Annotating Search Results from Web
Databases
ABSTRACT:
An increasing number of databases have become web
accessible through HTML form-based search interfaces. The data units returned
from the underlying database are usually encoded into the result pages dynamically
for human browsing. For the encoded data units to be machine process able,
which is essential for many applications such as deep web data collection and
Internet comparison shopping, they need to be extracted out and assigned
meaningful labels. In this paper, we present an automatic annotation approach
that first aligns the data units on a result page into different groups such
that the data in the same group have the same semantic. Then, for each group we
annotate it from different aspects and aggregate the different annotations to
predict a final annotation label for it. An annotation wrapper for the search
site is automatically constructed and can be used to annotate new result pages
from the same web database. Our experiments indicate that the proposed approach
is highly effective.
EXISTING SYSTEM:
In this existing system, a data unit is
a piece of text that semantically represents one concept of an entity. It
corresponds to the value of a record under an attribute. It is different from a
text node which refers to a sequence of text surrounded by a pair of HTML tags.
It describes the relationships between text nodes and data units in detail. In
this paper, we perform data unit level annotation. There is a high demand for
collecting data of interest from multiple WDBs. For example, once a book
comparison shopping system collects multiple result records from different book
sites, it needs to determine whether any two SRRs refer to the same book.
DISADVANTAGES
OF EXISTING SYSTEM:
If ISBNs are not available, their titles
and authors could be compared. The system also needs to list the prices offered
by each site. Thus, the system needs to know the semantic of each data unit.
Unfortunately, the semantic labels of data units are often not provided in result
pages. For instance, no semantic labels for the values of title, author,
publisher, etc., are given. Having semantic labels for data units is not only
important for the above record linkage task, but also for storing collected
SRRs into a database table.
PROPOSED SYSTEM:
In this paper, we consider how to automatically
assign labels to the data units within the SRRs returned from WDBs. Given a set
of SRRs that have been extracted from a result page returned from a WDB, our
automatic annotation solution consists of three phases.
ADVANTAGES
OF PROPOSED SYSTEM:
This paper has the following contributions:
·
While most existing approaches simply
assign labels to each HTML text node, we thoroughly analyze the relationships
between text nodes and data units. We perform data unit level annotation.
·
We propose a clustering-based shifting
technique to align data units into different groups so that the data units
inside the same group have the same semantic. Instead of using only the DOM
tree or other HTML tag tree structures of the SRRs to align the data units
(like most current methods do), our approach also considers other important
features shared among data units, such as their data types (DT), data contents
(DC), presentation styles (PS), and adjacency (AD) information.
·
We utilize the integrated interface schema
(IIS) over multiple WDBs in the same domain to enhance data unit annotation. To
the best of our knowledge, we are the first to utilize IIS for annotating SRRs.
·
We employ six basic annotators; each annotator
can independently assign labels to data units based on certain features of the
data units. We also employ a probabilistic model to combine the results from
different annotators into a single label. This model is highly flexible so that
the existing basic annotators may be modified and new annotators may be added
easily without affecting the operation of other annotators.
·
We construct an annotation wrapper for any
given WDB. The wrapper can be applied to efficiently annotating the SRRs
retrieved from the same WDB with new queries.
ALGORITHMS USED:
Alignment algorithm
SYSTEM CONFIGURATION:-
HARDWARE CONFIGURATION:-
ü Processor - Pentium –IV
ü Speed - 1.1
Ghz
ü RAM - 256
MB(min)
ü Hard Disk -
20 GB
ü Key Board -
Standard Windows Keyboard
ü Mouse - Two
or Three Button Mouse
ü Monitor - SVGA
SOFTWARE CONFIGURATION:-
ü Operating System : Windows XP
ü Programming Language :
JAVA
ü Java Version :
JDK 1.6 & above.
REFERENCE:
Yiyao Lu, Hai He, Hongkun Zhao, Weiyi
Meng, Member, IEEE, and Clement Yu, Senior Member, IEEE-“ Annotating Search
Results from Web Databases”- IEEE TRANSACTIONS ON KNOWLEDGE AND DATA
ENGINEERING, VOL. 25, NO. 3, MARCH 2013.