Efficient Fuzzy
Type-Ahead Search in XML Data
ABSTRACT:
In a traditional keyword-search system
over XML data, a user composes a keyword query, submits it to the system, and retrieves
relevant answers. In the case where the user has limited knowledge about the
data, often the user feels “left in the dark” when issuing queries, and has to
use a try-and-see approach for finding information. In this paper, we study
fuzzy type-ahead search in XML data, a new information-access paradigm in which
the system searches XML data on the fly as the user types in query keywords. It
allows users to explore data as they type, even in the presence of minor errors
of their keywords. Our proposed method has the following features: 1) Search as
you type: It extends Autocomplete by supporting queries with multiple keywords
in XML data. 2) Fuzzy: It can find high-quality answers that have keywords
matching query keywords approximately. 3) Efficient: Our effective index structures
and searching algorithms can achieve a very high interactive speed. We study
research challenges in this new search framework. We propose effective index
structures and top-k algorithms to achieve a high interactive speed. We examine
effective ranking functions and early termination techniques to progressively
identify the top-k relevant answers. We have implemented our method on real
data sets, and the experimental results show that our method achieves high
search efficiency and result quality.
ARCHITECTURE:
EXISTING SYSTEM:
TRADITIONAL methods use query languages
such as XPath and XQuery to query XML data. These methods are powerful but
unfriendly to nonexpert users. First, these query languages are hard to
comprehend for nondatabase users. For example, XQuery is fairly complicated to
grasp. Second, these languages require the queries to be posed against the
underlying, sometimes complex, database schemas.
In a traditional keyword-search system
over XML data, a user composes a query, submits it to the system, and retrieves
relevant answers from XML data. This information- access paradigm requires the
user to have certain knowledge about the structure and content of the
underlying data repository. In the case where the user has limited knowledge
about the data, often the user feels “left in the dark” when issuing queries,
and has to use a try-and-see approach for finding information. He tries a few
possible keywords, goes through the returned results, modifies the keywords,
and reissues a new query. He needs to repeat this step multiple times to find
the information, if lucky enough. This search interface is neither efficient
nor user friendly. Many systems are introducing various features to solve this problem.
DISADVANTAGES:
One limitation of Autocomplete is that
the system treats a query with multiple keywords as a single string; thus, it
does not allow these keywords to appear at different places. For instance,
consider the search box on Apple.com, which allows Autocomplete search on Apple
products. Although a keyword query “iphone” can find a record “iphone has some
great new features,” a query with keywords “iphone features” cannot find this
record (as of February 2010), because these two keywords appear at different
places in the answer.
PROPOSED
SYSTEM:
In this paper, we propose TASX
(pronounced “task”), a fuzzy type-ahead search method in XML data. TASX
searches the XML data on the fly as users type in query keywords, even in the
presence of minor errors of their keywords. TASX provides a friendly interface
for users to explore XML data, and can significantly save users typing effort.
In this paper, we study research challenges that arise naturally in this
computing paradigm. The main challenge is search efficiency. Each query with
multiple keywords needs to be answered efficiently. To make search really
interactive, for each keystroke on the client browser, from the time the user presses
the key to the time the results computed from the server are displayed on the
browser, the delay should be as small as possible. An interactive speed
requires this delay should be within milliseconds. Notice that this time includes
the network transfer delay, execution time on the server, and the time for the
browser to execute its Java- Script. This low-running-time requirement is
especially challenging when the backend repository has a large amount of data.
To achieve our goal, we propose effective index structures and algorithms to
answer keyword queries in XML data.
MODULES:
Index Structures
Answering Queries with a Single Keyword
Fuzzy Search
Answering Queries with Multiple Keywords
Minimal-Cost Tree
MODULES DESCRIPTION:
Index Structures
We use a trie structure to index the words in the
underlying XML data. Each word w corresponds to a unique path from the root of
the trie to a leaf node. Each node on the path has a label of a character in w.
For each leaf node, we store an inverted list of IDs of XML elements that
contain the word of the leaf node. For instance, consider the XML document in
Fig. 1. The trie structure for the tokenized words is shown in Fig. 2. The word
“mich” has a node ID of 10. Its inverted list includes XML elements 18 and 26.
Answering
Queries with a Single Keyword
We first study how to answer a query
with a single keyword using the trie structure. Each keystroke that a user types
invokes a query of the current string, and the client browser sends the query
string to the server. We first consider the case of exact search. One naive way
to process such a query on the server is to answer the query from scratch as
follows: we first find the trie node corresponding to this keyword by
traversing the trie from the root. Then, we locate the leaf descendants of this
node, and retrieve the corresponding predicted words and the predicted XML
elements on the inverted lists.
Fuzzy
Search
Obviously, for exact search, given a partial
keyword, there exists at most one trie node for the keyword. We retrieve the leaf
descendants of this trie node as the predicted words. However, for fuzzy
search, there could be multiple trie nodes that are similar to the partial
keyword within a given edit-distance threshold, called active nodes. For
example, both nodes “mices” and “mich” on the trie
Answering
Queries with Multiple Keywords
Now, we consider how to do fuzzy
type-ahead search in the case of a query with multiple keywords. For a
keystroke that invokes a query, we first tokenize the query string into keywords,
k1; k2; . . . ; k‘. For each keyword ki (1 _ i _ ‘), we compute its
corresponding active nodes, and for each such active node, we retrieve its leaf
descendants and corresponding inverted lists Minimal-Cost Tree. We use the
semantics of ELCA to compute the corresponding answers. We use the
binary-search-based method to compute ELCAs. We will introduce an effective
ranking function in considering fuzzy search
Minimal-Cost
Tree
In this section, we introduce a new
framework to find relevant answers to a keyword query over an XML document. In
the framework, each node on the XML tree is potentially relevant to the query
with different scores. For each node, we define its corresponding answer to the
query as its subtree with paths to nodes that include the query keywords. This
subtree is called the “minimal-cost tree” for this node. Different nodes
correspond to different answers to the query, and we will study how to quantify
the relevance of each answer to the query for ranking
SYSTEM CONFIGURATION:-
HARDWARE REQUIREMENTS:-
ü Processor -Pentium –III
ü Speed - 1.1 Ghz
ü RAM - 256 MB(min)
ü Hard
Disk - 20 GB
ü Floppy
Drive - 1.44 MB
ü Key
Board - Standard Windows Keyboard
ü Mouse - Two or Three Button Mouse
ü Monitor - SVGA
SOFTWARE REQUIREMENTS:-
v Operating System : Windows95/98/2000/XP
v Front End : Java
REFERENCE:
Jianhua Feng, Senior Member, IEEE, and
Guoliang Li, Member, IEEE, “Efficient Fuzzy Type-Ahead Search in XML Data”, IEEE TRANSACTIONS ON KNOWLEDGE AND DATA
ENGINEERING, VOL. 24, NO. 5, MAY 2012.