Logo - Fakultät Illustration
 

WebEngine

WebEngine Logo

Veröffentlichung in der ACM Digital Library:
The WebEngine: A Fully Integrated, Decentralised Web Search Engine

Proceedings of the "NLPIR 2018 - 2nd International Conference on Natural Language Processing and Information Retrieval" in Bangkok/Thailand, September 07 - 09, 2018:

This paper presents a basic, new concept for decentralized web search which addresses major shortcomings of current web search engines. Its methods are characterised by their local working principles, making it possible to employ them on diverse hardware configurations. The concept's implementation in form of an interactive, librarian-inspired peer-to-peer software client, called 'WebEngine', is elaborated on in detail. This software extends and interconnects common web servers creating and forming a decentralised web search system on top of the existing web structure while -for the first time- combining modern text analysis techniques with novel and efficient search functions as well as approaches for the semantically induced P2P-network construction and its exible management. This way, an alternative, fully integrated and powerful web search engine under the motto 'The Web is its own search engine.' is built making the web searchable without any central authority.

URL: https://dl.acm.org/citation.cfm?id=3278294 [externer link]


Bedeutungsschwerpunkte

Menschliche Leser sind nach nur wenigen Zeilen im Stande zu bestimmen, zu welcher thematischen Kategorie von Texten vorgegebene Dokumente gehören. Dies demonstriert eindringlich, wie gut und schnell das menschliche Gehirn, besonders der menschliche Cortex, Daten verarbeiten und interpretieren kann. Es ist nicht nur im Stande, die Bedeutung von einzelnen Wörtern (als Darstellungen von wirklichen Entitäten), sondern auch bestimmte Zusammenhänge zu verstehen. Darüber hinaus dient es als Wissensdatenbank beim thematischen Klassifizieren vorher unbekannten Inhalts. Es versucht, die Begriffe (d.h. die Bedeutung von Wörtern) in solchen Dokumenten mit vorher erlernten Fachbegriffen abzugleichen und kann sie so unverzüglich und unbewusst grob klassifizieren.

Die Bedeutungsschwerpunkte (Centroid terms) repräsentieren eine völlig neue Methode und Technologie inspiriert von der Physik und den Prozessen im Gehirn, um diese Aufgaben in einer besseren Art und Weise zu lösen als alle herkömmlichen Ansätze, welche größtenteils auf der Bag-of-words Methode oder der Term Frequency – Inverse Document Frequency (TF-IDF) Methode basieren.

Nähere Informationen


WebEngine Search Results

In the following we shortly present the appearance of the WebEngine's user interface and explain four exemplary search results generated. Here, the focus is set on the description of the visible components of the respective result page (presented in form of screenshots). The documents searched for stem from an offline English Wikipedia corpus downloaded from http://www.kiwix.org.
The individual (on each WebEngine-peer) co-occurrence graphs for query evaluation and centroid determination have been constructed using these articles as well.


Query 1: `bird flu'

WebEngine search results of the query: 'bird flu'

Fig. 1 depicts the results of the query: ‘bird flu’. The indicator ‘query quality’ underneath the query input field suggests the user that the query is with 87% of high quality. The bar’s green color underlines this evaluation in form of a meaningful visual feedback. This evaluation is based on the measure diversity of a given document or query (see Chapter 3) that indicates how general (rather unspecific) or topic-oriented (rather specific) the analysed content is. The centroid term of the query is ‘bird’. Therefore, the search for documents whose centroid term is ‘bird’ returns under ‘Centroid Results’ the listed two links to the highly relevant articles ‘bird’ and ‘bird flu’. Also, the list of the top-ranked full-text results is shown which contains those two results as well. Furthermore, the user has the possibility to expand and collapse the two result categories.

Query 2: `bird flu virus'

WebEngine search results of the query: 'bird flu'

For query 2, the previous query 1 has been expanded with the term ‘virus’ and therefore made more precise. Fig. 2 depicts the results of this expanded query 2: ‘bird flu virus’. The query quality almost stayed the same, but its centroid changed to ‘H5N1’. Accordingly, the article ‘H5N1’ has been returned as the only and higly relevant centroid search result (due to the topical closeness of the query, it is to be expected that the number of centroid search results decreases). The list of top-ranked full-text results still contains links to matching and likewise relevant articles.

Query 3: `cat crop'

WebEngine search results of the query: 'cat crop'

Fig. 3 depicts the results of the topical unspecific query: ‘cat crop’. As its intentionally chosen terms are semantically unrelated, the determined query quality is with only 43% rather low as well. The bar’s red color intuitively suggests the user to reformulate the query, too. Accordingly, no centroid search results have been found and the list of full-text results contains merely a mixture of links to articles that are relevant to either ‘cat’ or ‘crop’.

Query 4: 'khmer rouge'

WebEngine search results of the query: 'khmer rouge'

Fig. 4 depicts the results of the query: ‘khmer rouge’. The query quality is with 92% very high. The only centroid search result (article ‘Cambodia’) suggests that the determined query’s centroid is likewise ‘Cambodia’. Taken into account the historical origins of the Khmer Rouge, this centroid is the perfect representative for both the query and the returned article alike. The list of top-ranked full-text results contains links to topically relevant articles as well.

Jutta Düring | 21.01.2019
FernUni-Logo FernUniversität in Hagen, Lehrgebiet Kommunikationsnetze , 58084 Hagen, Tel.: +49 2331 987-1141