Tokenizing
Search
engines are technically designed in such way that while processing a string of
words, each work may be treated as token. While converting RDF triples into
Swangle terms, two major issue arise as the RDF triples look like indexing
terms to web search engine and which of the triples may be the proper candidate
for swangling and what kind of techniques should be used.
Some of the
search engines like Google, limit the size of the query written by humans. To
find the relevant documents, one must have to be careful in writing queries.
OWL and RDF contains information about anonymous nodes also referred as blank
nodes which are used to represent asserted entities existentially. By using
special ways, triples with blank nodes can also be processed. A statistical
model can also be developed for OWL by using statistical language models
similar to annotations on documents.
The OWLIR
system uses Swangling approach to explore triples. To reduce the set of triples
to a set of tokens for sending information what will be acceptable by the
information retrieval system, further experiments are needed to find out the
most effective techniques.
Decomposing
each triple into three components and to Swangle them separately is the
simplest approach. The most of the information will be lost by doing that
simplest processing. OWLIR followed an approach by which more information will
be preserved. Each triple may be transformed into seven patterns, replacing
with zero, a special token “don’t care” token. Resulting tokens out of seven
tokens are then reduced for indexing as a single word like token.
Reasoning and Trust
For
reasoning and trust over Semantic Web markup, we have to make a choice. By
using one reason the markup in a document can be indexed, consequently larger
set of triples will be retrieved. Another reason can be prior to submitting the
RDF triples to the retrieval system by using a query. Documents retrieved
having markup can be considered for reasoning. By using OWLIR we can reason
over the documents while indexing and over queries which are to be submitted,
the choice is over to us.
A similar
problem arises while considering how much reasoning to do or to rely on forward
chaining largely as done by the OWLIR, on the other side both techniques like
forward and backward reasoning.
Trustable knowledge
Variety of
information may be available on Semantic Web. The reliability of the
information is also not authenticated. A newly found relevant document cannot
be injected into our level of reasoning the facts. We have to be very careful
to avoid from inconsistent knowledge base. Most of the information contained by
a document comes from different sources. The identification of ultimate source
in a database or document is terms as data provenance used for the modeling and
reasoning. Such systems which are used to extract and reason about facts and
knowledge available on the Semantic Web will be considered important to:
i. Inform our trust model and make better
decision about the trustworthiness of each fact and
ii. Remove duplicate facts from our
semantic model.
Dealing the Control of Search Engines
The
basic steps elaborated earlier, forming a query, retrieving documents,
processing and repeating them. A question arises that how to take a decision
while analyzing the ranked results, whether to pay full attention or reforming
the query to generate the new result set. There is a similar choice as face by
an agent in multi agent system that decides when to start reasoning with the
information it possesses or to ask for more information to other agents.
Markups are
not processed by the web search engines normally. A search engine may be given
a spider, preprocessed, swangled web page while indexing a document or web
page. HTTP server is controlled for the successful completion of this task. The
server checks the requesting agent either it is a spider, if yes the swangled
version of the page is returned. Otherwise the original source page is
returned. The preprocessing of the document can be attained in advance or while
caching.
The above
mentioned technique requires control over the HTTP servers and with associated
Semantic Web pages. If this control is not available, an alternate solution can
be opted. To attain automatic swangling, on server the pages can be mirrored.
The pages must have annotation in RDF that reflects the links between the
source and mirrored pages.
Limitations of Search Engines
There are
some limitations while using web based search engines, including the process
how they tokenize text and how constraints are applied on queries. The user may
wish that the swangeled terms be accepted as indexable terms by traditional
search engines. The two retrieval systems used by OWLIR where very flexible
while processing text and acceptance of tokens. Token can be of any length and
could include almost any number of non white space characters. Commercial
systems are some other strict constraints. Google which is the most widely used
search engine, while using Google, we are advised to have token length up to 50
and only lower and uppercase alphabet characters can be used. Some other
commercial system also impose the limit of the size of a query having maximum
number of terms. Google has a limit of ten terms in a query.
Conclusion
There
are two types of document in Semantic Web. One type is of simple text document
having annotations the provide metadata as well as capturing some of the
meanings of the documents contents like machine interpretable statements. There
are new challenges while using and developing information retrieval systems by
using these types of documents. A framework has been presented for the
integration of search and inference that supports both retrieval – driven and
inference – driven processing. By using the both at the same time text and
markup as indexing terms are now supported by today’s text based search
engines. Some other challenges are to be resolved and to get the benefits of
pursing. RDF encoded contents documents are also available to Semantic Web.
Swangling techniques can also be used to enrich documents to capture some of
the meanings which are indexed by search engines. Finally, there is also a role
for specialized search engines that are designed to work over collections of
RDF documents.
It
is defined that the possibility is there to develop a ranking algorithm whose
results can be consistent. Considering keyword based search, measurable
improvements can be yielded in connection with the critical mass of metadata
and quality.
It
is also considered that there are lots of problems while building and sharing
well defined ontologies and these are inherited by almost all the IR system.
Keywords are also mapped to the concepts while using semantic web searches.
This
research domain is yielding promising results. The aim of this paper is to
review and provide the most consistent model with advancements on the problems
in connection with semantic search improvement.
Lots
of improvements can be done further as the annotation weighting scheme is not
attaining the advantage of different relevance of structured document field as
well as unstructured. Documents with annotating statements, instances, are also
an interesting possibility to experiment with them.
1 Comments
Excellent work
ReplyDelete