Evaluating Data
Reliability: An Evidential Answer with Application to a Web-Enabled Data
Warehouse
Abstract
There are many
available methods to integrate information source reliability in an uncertainty
representation, but there are only a few works focusing on the problem of
evaluating this reliability. However, data reliability and confidence are
essential components of a data warehousing system, as they influence subsequent
retrieval and analysis. In this paper, we propose a generic method to assess
data reliability from a set of criteria using the theory of belief functions.
Customizable criteria and insightful decisions are provided. The chosen
illustrative example comes from real-world data issued from the Sym’Previus
predictive microbiology oriented data warehouse.
Existing
System
These data are used in
further inferences. During collection, data reliability is mostly ensured by measurement
device calibration, by adapted experimental design and by statistical
repetition. However, full traceability is no longer ensured when data are
reused at a later time by other scientistsThis estimation is especially important
in areas where data are scarce and difficult to obtain as it is the case, for
example, in Life Sciences.The growth of the web and the emergence of dedicated data
warehouses offer great opportunities to collect additional data, be it to build
models or to make decisions. The reliability of these data depends on many
different aspects and metainformation: data source, experimental protocol,
Developing generic tools to evaluate this reliability represents a true
challenge for the proper use of distributed data.
Disadvantages
·
The conflicting information, as
different criteria may provide conflicting information about the reliability.
·
Finally, interval-valued evaluations
based on lower and upper expectation notions are used to numerically summarize
the results, for their capacity to reflect the imprecision in the final knowledge.
·
Addresses the question of data ordering
by groups of decreasing reliability and subsequently the presentation of
informative results to end users.
Proposed
System
The
evaluation of their reliability, it is natural to be interested in the reasons explaining
why some particular data were assessed as (un)reliable. We now show how maximal
coherent subsets of criteria, i.e., groups of agreeing criteria, may provide some
insight as to which reasons have led to a particular assessment. we present an
application of the method a web-enabled data warehouse. Indeed, the framework developed
in this paper was originally motivated by
the need to estimate the reliability of scientific experimental results
collected in open data warehouses. To lighten the burden laid upon domain
experts when selecting data for a particular application, it is necessary to
give them indicative reliability estimations. Formalizing reliability criteria
will hopefully be a better asset for them to justify their choices and to
capitalize knowledge than the use of an ad hoc estimation. Tools development
was carefully done using Semantic Web recommended languages, so that created
tools would be generic and reusable in other data warehouses. This required an
advanced design step, which is important to ensure modularity and to foresee
future evolutions.
Advantages
·
This notion only makes sense if the
source can be suspected of lying in order to gain some advantage, and is
distinct from reliability.
·
The
differentiate between individual-level and system-level trust, the
former concerning the trust one has in a particular agent, while the latter
concerns the overall system and how it ensures that no one will be able to take
advantage of the system.
Modules
o
Global Reliability Information
o
Maximal Coherent Subsets
o
Web-Enabled Data Warehouse
o
Web Presentation
o
Data Reliability Management
o
Predictive Food Microbiology
Module
Description
1.
Global
Reliability Information
A particular value, providing S different
fuzzy sets as pieces of information. We propose to use evidence theory to merge
these information into a global representation. This choice is motivated by the
richness of the merging rules it provides and by the good compromise it
represents in terms of expressiveness and tractability. Indeed it encompasses
fuzzy sets and probability distributions as particular representations.
2.
Maximal Coherent
Subsets
The
problem of conflicting information, we propose a merging strategy based on
maximal coherent subsets (MCS). This notion has been introduced by Rescher and
Manor as a means to infer from
inconsistent logic bases, and can be easily extended to the case of quantitative
uncertainty representations. Given a set of conflicting sources, MCS consists
in applying a conjunctive operator within each nonconflicting (maximal) subset
of sources, and then using a disjunctive operator between the partial results.
With such a method, as much precision as possible is gained while not
neglecting any source, an attractive feature in information fusion. In general,
detecting maximal coherent subsets has a NP-hard complexity, however in some
particular cases this complexity may significantly be reduced.
3.
Web-Enabled Data
Warehouse
The
present an application of the method to , a web-enabled data warehouse. Indeed,
the framework developed in this paper was originally motivated by the need to
estimate the reliability of scientific experimental results collected in open
data warehouses. To lighten the burden laid upon domain experts when selecting
data for a particular application, it is necessary to give them indicative
reliability estimations. Formalizing reliability criteria will hopefully be a
better asset for them to justify their choices and to capitalize knowledge than
the use of an ad hoc estimation
4.
Web
Presentation
Web
is a data warehouse opened on the web. Its current version is centered on the
integration of heterogeneous data tables extracted from web documents. The
focus has been put on web tables for two reasons: experimental data are often summarized in
tables and data are already structured
and easier to integrate in a data warehouse than, e.g., text or graphics.
5.
Data
Reliability Management
The
presents @Web extension integrating a reliability estimation to each table, in
order to display the results of a user query ordered by decreasing reliability
values. Even if a data table can include several items, the table level has
been retained, as data from a given table are usually issued from the same
experimental setup and therefore share the same reliability criteria.
6.
Predictive Food
Microbiology
This part is dedicated
to a use case in the field of predictive food microbiology, namely the selection
of reliable parameters for simulation models.We first give the criteria suited
to this field, as well as the corresponding expertopinions and fuzzy sets. We
then detail the use case query and results.
Flow
chart
CONCLUSION
We proposed a generic
method to evaluate the reliability of data automatically retrieved from the web
or from electronic documents. Even if the method is generic, we were more specifically
interested in scientific experimental data. The method evaluates data reliability
from a set of common sense criteria. It relies on the use of basic
probabilistic assignments and of induced belief functions, since they offer a
good compromise between flexibility and computational tractability. To handle
conflicting information while keeping a maximal amount of it, the information merging
follows a maximal coherent subset approach. Finally, reliability evaluations
and ordering of data tables are achieved by using lower/upper expectations,
allowing us to reflect uncertainty in the evaluation. The results displayed to end
users is an ordered list of tables, from the most to the least reliable ones,
together with an interval-valued evaluation. We have demonstrated the
applicability of the method by its integration in the @Web system, and its use
on the Sym’Previus data warehouse. As future works, we see two main possible
evolutions: . complementing the current method with useful additional features:
the possibility to cope with multiple experts, with criteria of nonequal
importance and with uncertainly known criteria; . combining the current
approach with other notions or sources of information: relevance, in
particular, appears to be equally important to characterize experimental data.
Also, we may consider adding user feedback as an additional (and parallel)
source of information about reliability or relevance, as it is done in web
applications.
REFERENCES
[1] S. Ramchurn, D. Huynh, and N.
Jennings, “Trust in Multi-Agent Systems,” The Knowledge Eng. Rev., vol. 19, pp.
1-25, 2004.
[2] P. Buche, J. Dibie-Barthe´lemy, and
H. Chebil, “Flexible Sparql Querying of Web Data Tables Driven by an Ontology,”
Proc. Eighth Int’l Conf. Flexible Query Answering Systems (FQAS), pp. 345-357,
2009.
[3] G. Hignette, P. Buche, J.
Dibie-Barthe´lemy, and O. Haemmerle´, “Fuzzy Annotation of Web Data Tables
Driven by a Domain Ontology,” Proc. Sixth European Semantic Web Conf. The
Semantic Web: Research and Applications (ESWC), pp. 638-653, 2009.
[4] D. Mercier, B. Quost, and T.
Denoeux, “Refined Modeling of Sensor Reliability in the Bellief Function
Framework Using Contextual Discounting,” Information Fusion, vol. 9, pp.
246-258,2008.
[5] R. Cooke, Experts in Uncertainty.
Oxford Univ. Press, 1991.
[6]
S. Sandri, D. Dubois, and H. Kalfsbeek, “Elicitation, Assessment and Pooling of
Expert Judgments Using Possibility Theory,” IEEE Trans. Fuzzy Systems, vol. 3,
no. 3, pp. 313-335, Aug. 1995.
[7] F. Delmotte and P. Borne, “Modeling
of Reliability with Possibility Theory,” IEEE Trans. Systems, Man, and
Cybernetics A, vol. 28, no. 1, pp. 78-88, 1998.
[8] F. Pichon, D. Dubois, and T.
Denoeux, “Relevance and Truthfulness in Information Correction and Fusion,”
Int’l J. Approximate Reasoning, vol. 53, pp. 159-175, 2011.