Efficient Fuzzy Type-Ahead Search in XML Data

ABSTRACT:

In a traditional keyword-search system over XML data, a user composes a keyword query, submits it to the system, and retrieves relevant answers. In the case where the user has limited knowledge about the data, often the user feels “left in the dark” when issuing queries, and has to use a try-and-see approach for finding information. In this paper, we study fuzzy type-ahead search in XML data, a new information-access paradigm in which the system searches XML data on the fly as the user types in query keywords. It allows users to explore data as they type, even in the presence of minor errors of their keywords. Our proposed method has the following features: 1) Search as you type: It extends Autocomplete by supporting queries with multiple keywords in XML data. 2) Fuzzy: It can find high-quality answers that have keywords matching query keywords approximately. 3) Efficient: Our effective index structures and searching algorithms can achieve a very high interactive speed. We study research challenges in this new search framework. We propose effective index structures and top-k algorithms to achieve a high interactive speed. We examine effective ranking functions and early termination techniques to progressively identify the top-k relevant answers. We have implemented our method on real data sets, and the experimental results show that our method achieves high search efficiency and result quality.



ARCHITECTURE:


 

EXISTING SYSTEM:

TRADITIONAL methods use query languages such as XPath and XQuery to query XML data. These methods are powerful but unfriendly to nonexpert users. First, these query languages are hard to comprehend for nondatabase users. For example, XQuery is fairly complicated to grasp. Second, these languages require the queries to be posed against the underlying, sometimes complex, database schemas.

In a traditional keyword-search system over XML data, a user composes a query, submits it to the system, and retrieves relevant answers from XML data. This information- access paradigm requires the user to have certain knowledge about the structure and content of the underlying data repository. In the case where the user has limited knowledge about the data, often the user feels “left in the dark” when issuing queries, and has to use a try-and-see approach for finding information. He tries a few possible keywords, goes through the returned results, modifies the keywords, and reissues a new query. He needs to repeat this step multiple times to find the information, if lucky enough. This search interface is neither efficient nor user friendly. Many systems are introducing various features to solve this problem.

DISADVANTAGES:
One limitation of Autocomplete is that the system treats a query with multiple keywords as a single string; thus, it does not allow these keywords to appear at different places. For instance, consider the search box on Apple.com, which allows Autocomplete search on Apple products. Although a keyword query “iphone” can find a record “iphone has some great new features,” a query with keywords “iphone features” cannot find this record (as of February 2010), because these two keywords appear at different places in the answer.

PROPOSED SYSTEM:
In this paper, we propose TASX (pronounced “task”), a fuzzy type-ahead search method in XML data. TASX searches the XML data on the fly as users type in query keywords, even in the presence of minor errors of their keywords. TASX provides a friendly interface for users to explore XML data, and can significantly save users typing effort. In this paper, we study research challenges that arise naturally in this computing paradigm. The main challenge is search efficiency. Each query with multiple keywords needs to be answered efficiently. To make search really interactive, for each keystroke on the client browser, from the time the user presses the key to the time the results computed from the server are displayed on the browser, the delay should be as small as possible. An interactive speed requires this delay should be within milliseconds. Notice that this time includes the network transfer delay, execution time on the server, and the time for the browser to execute its Java- Script. This low-running-time requirement is especially challenging when the backend repository has a large amount of data. To achieve our goal, we propose effective index structures and algorithms to answer keyword queries in XML data.

 

MODULES:

Index Structures

Answering Queries with a Single Keyword
Fuzzy Search
Answering Queries with Multiple Keywords
Minimal-Cost Tree

MODULES DESCRIPTION:

Index Structures

We use a trie structure to index the words in the underlying XML data. Each word w corresponds to a unique path from the root of the trie to a leaf node. Each node on the path has a label of a character in w. For each leaf node, we store an inverted list of IDs of XML elements that contain the word of the leaf node. For instance, consider the XML document in Fig. 1. The trie structure for the tokenized words is shown in Fig. 2. The word “mich” has a node ID of 10. Its inverted list includes XML elements 18 and 26.
Answering Queries with a Single Keyword
We first study how to answer a query with a single keyword using the trie structure. Each keystroke that a user types invokes a query of the current string, and the client browser sends the query string to the server. We first consider the case of exact search. One naive way to process such a query on the server is to answer the query from scratch as follows: we first find the trie node corresponding to this keyword by traversing the trie from the root. Then, we locate the leaf descendants of this node, and retrieve the corresponding predicted words and the predicted XML elements on the inverted lists.

Fuzzy Search
Obviously, for exact search, given a partial keyword, there exists at most one trie node for the keyword. We retrieve the leaf descendants of this trie node as the predicted words. However, for fuzzy search, there could be multiple trie nodes that are similar to the partial keyword within a given edit-distance threshold, called active nodes. For example, both nodes “mices” and “mich” on the trie

Answering Queries with Multiple Keywords
Now, we consider how to do fuzzy type-ahead search in the case of a query with multiple keywords. For a keystroke that invokes a query, we first tokenize the query string into keywords, k1; k2; . . . ; k‘. For each keyword ki (1 _ i _ ‘), we compute its corresponding active nodes, and for each such active node, we retrieve its leaf descendants and corresponding inverted lists Minimal-Cost Tree. We use the semantics of ELCA to compute the corresponding answers. We use the binary-search-based method to compute ELCAs. We will introduce an effective ranking function in considering fuzzy search

Minimal-Cost Tree
In this section, we introduce a new framework to find relevant answers to a keyword query over an XML document. In the framework, each node on the XML tree is potentially relevant to the query with different scores. For each node, we define its corresponding answer to the query as its subtree with paths to nodes that include the query keywords. This subtree is called the “minimal-cost tree” for this node. Different nodes correspond to different answers to the query, and we will study how to quantify the relevance of each answer to the query for ranking

SYSTEM CONFIGURATION:-

HARDWARE REQUIREMENTS:-


ü Processor             -Pentium –III

ü Speed                             -    1.1 Ghz
ü RAM                    -    256 MB(min)
ü Hard Disk            -   20 GB
ü Floppy Drive       -    1.44 MB
ü Key Board            -    Standard Windows Keyboard
ü Mouse                  -    Two or Three Button Mouse
ü Monitor                -    SVGA

 

SOFTWARE REQUIREMENTS:-


v   Operating System          : Windows95/98/2000/XP
v   Front End                      : Java

REFERENCE:
Jianhua Feng, Senior Member, IEEE, and Guoliang Li, Member, IEEE, “Efficient Fuzzy Type-Ahead Search in XML Data”, IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 24, NO. 5, MAY 2012.