Semantic Web Tokenizing

Tokenizing

            Search engines are technically designed in such way that while processing a string of words, each work may be treated as token. While converting RDF triples into Swangle terms, two major issue arise as the RDF triples look like indexing terms to web search engine and which of the triples may be the proper candidate for swangling and what kind of techniques should be used.

 What to Swangle?

            Some of the search engines like Google, limit the size of the query written by humans. To find the relevant documents, one must have to be careful in writing queries. OWL and RDF contains information about anonymous nodes also referred as blank nodes which are used to represent asserted entities existentially. By using special ways, triples with blank nodes can also be processed. A statistical model can also be developed for OWL by using statistical language models similar to annotations on documents.

 How to Swangle?

            The OWLIR system uses Swangling approach to explore triples. To reduce the set of triples to a set of tokens for sending information what will be acceptable by the information retrieval system, further experiments are needed to find out the most effective techniques.

         Decomposing each triple into three components and to Swangle them separately is the simplest approach. The most of the information will be lost by doing that simplest processing. OWLIR followed an approach by which more information will be preserved. Each triple may be transformed into seven patterns, replacing with zero, a special token “don’t care” token. Resulting tokens out of seven tokens are then reduced for indexing as a single word like token.

Reasoning and Trust

            For reasoning and trust over Semantic Web markup, we have to make a choice. By using one reason the markup in a document can be indexed, consequently larger set of triples will be retrieved. Another reason can be prior to submitting the RDF triples to the retrieval system by using a query. Documents retrieved having markup can be considered for reasoning. By using OWLIR we can reason over the documents while indexing and over queries which are to be submitted, the choice is over to us.

 How much to reason

            A similar problem arises while considering how much reasoning to do or to rely on forward chaining largely as done by the OWLIR, on the other side both techniques like forward and backward reasoning.

Trustable knowledge

            Variety of information may be available on Semantic Web. The reliability of the information is also not authenticated. A newly found relevant document cannot be injected into our level of reasoning the facts. We have to be very careful to avoid from inconsistent knowledge base. Most of the information contained by a document comes from different sources. The identification of ultimate source in a database or document is terms as data provenance used for the modeling and reasoning. Such systems which are used to extract and reason about facts and knowledge available on the Semantic Web will be considered important to:

i.     Inform our trust model and make better decision about the trustworthiness of each fact and

ii.   Remove duplicate facts from our semantic model.

Dealing the Control of Search Engines

            The basic steps elaborated earlier, forming a query, retrieving documents, processing and repeating them. A question arises that how to take a decision while analyzing the ranked results, whether to pay full attention or reforming the query to generate the new result set. There is a similar choice as face by an agent in multi agent system that decides when to start reasoning with the information it possesses or to ask for more information to other agents.

 Spiders

          Markups are not processed by the web search engines normally. A search engine may be given a spider, preprocessed, swangled web page while indexing a document or web page. HTTP server is controlled for the successful completion of this task. The server checks the requesting agent either it is a spider, if yes the swangled version of the page is returned. Otherwise the original source page is returned. The preprocessing of the document can be attained in advance or while caching.

 Offsite annotation

         The above mentioned technique requires control over the HTTP servers and with associated Semantic Web pages. If this control is not available, an alternate solution can be opted. To attain automatic swangling, on server the pages can be mirrored. The pages must have annotation in RDF that reflects the links between the source and mirrored pages. 

Limitations of Search Engines

            There are some limitations while using web based search engines, including the process how they tokenize text and how constraints are applied on queries. The user may wish that the swangeled terms be accepted as indexable terms by traditional search engines. The two retrieval systems used by OWLIR where very flexible while processing text and acceptance of tokens. Token can be of any length and could include almost any number of non white space characters. Commercial systems are some other strict constraints. Google which is the most widely used search engine, while using Google, we are advised to have token length up to 50 and only lower and uppercase alphabet characters can be used. Some other commercial system also impose the limit of the size of a query having maximum number of terms. Google has a limit of ten terms in a query.

Conclusion

            There are two types of document in Semantic Web. One type is of simple text document having annotations the provide metadata as well as capturing some of the meanings of the documents contents like machine interpretable statements. There are new challenges while using and developing information retrieval systems by using these types of documents. A framework has been presented for the integration of search and inference that supports both retrieval – driven and inference – driven processing. By using the both at the same time text and markup as indexing terms are now supported by today’s text based search engines. Some other challenges are to be resolved and to get the benefits of pursing. RDF encoded contents documents are also available to Semantic Web. Swangling techniques can also be used to enrich documents to capture some of the meanings which are indexed by search engines. Finally, there is also a role for specialized search engines that are designed to work over collections of RDF documents.

             The approach discussed above is evident as an evolution of the classic vector space mode, where keyword based files/documents are replaced by ontology based KB. It is also considered that the semi-automatic annotation and weighting procedure is also equivalent to the indexing and keyword extraction process.

            It is defined that the possibility is there to develop a ranking algorithm whose results can be consistent. Considering keyword based search, measurable improvements can be yielded in connection with the critical mass of metadata and quality.

            It is also considered that there are lots of problems while building and sharing well defined ontologies and these are inherited by almost all the IR system. Keywords are also mapped to the concepts while using semantic web searches.         

            This research domain is yielding promising results. The aim of this paper is to review and provide the most consistent model with advancements on the problems in connection with semantic search improvement.

            Lots of improvements can be done further as the annotation weighting scheme is not attaining the advantage of different relevance of structured document field as well as unstructured. Documents with annotating statements, instances, are also an interesting possibility to experiment with them. 

Post a Comment

1 Comments