Evaluating Data Reliability: An Evidential Answer with Application to a Web-Enabled Data Warehouse

Abstract

There are many available methods to integrate information source reliability in an uncertainty representation, but there are only a few works focusing on the problem of evaluating this reliability. However, data reliability and confidence are essential components of a data warehousing system, as they influence subsequent retrieval and analysis. In this paper, we propose a generic method to assess data reliability from a set of criteria using the theory of belief functions. Customizable criteria and insightful decisions are provided. The chosen illustrative example comes from real-world data issued from the Sym’Previus predictive microbiology oriented data warehouse.

















Existing System

These data are used in further inferences. During collection, data reliability is mostly ensured by measurement device calibration, by adapted experimental design and by statistical repetition. However, full traceability is no longer ensured when data are reused at a later time by other scientistsThis estimation is especially important in areas where data are scarce and difficult to obtain as it is the case, for example, in Life Sciences.The growth of the web and the emergence of dedicated data warehouses offer great opportunities to collect additional data, be it to build models or to make decisions. The reliability of these data depends on many different aspects and metainformation: data source, experimental protocol, Developing generic tools to evaluate this reliability represents a true challenge for the proper use of distributed data.

Disadvantages
·         The conflicting information, as different criteria may provide conflicting information about the reliability.
·         Finally, interval-valued evaluations based on lower and upper expectation notions are used to numerically summarize the results, for their capacity to reflect the imprecision  in the final knowledge.
·         Addresses the question of data ordering by groups of decreasing reliability and subsequently the presentation of informative results to end users.











Proposed System
The evaluation of their reliability, it is natural to be interested in the reasons explaining why some particular data were assessed as (un)reliable. We now show how maximal coherent subsets of criteria, i.e., groups of agreeing criteria, may provide some insight as to which reasons have led to a particular assessment. we present an application of the method a web-enabled data warehouse. Indeed, the framework developed in this paper was originally motivated by  the need to estimate the reliability of scientific experimental results collected in open data warehouses. To lighten the burden laid upon domain experts when selecting data for a particular application, it is necessary to give them indicative reliability estimations. Formalizing reliability criteria will hopefully be a better asset for them to justify their choices and to capitalize knowledge than the use of an ad hoc estimation. Tools development was carefully done using Semantic Web recommended languages, so that created tools would be generic and reusable in other data warehouses. This required an advanced design step, which is important to ensure modularity and to foresee future evolutions.

Advantages

·         This notion only makes sense if the source can be suspected of lying in order to gain some advantage, and is distinct from reliability.
·         The  differentiate between individual-level and system-level trust, the former concerning the trust one has in a particular agent, while the latter concerns the overall system and how it ensures that no one will be able to take advantage of the system.








Modules

o   Global Reliability Information
o   Maximal Coherent Subsets
o   Web-Enabled Data Warehouse
o   Web Presentation
o   Data Reliability  Management
o   Predictive Food Microbiology

Module Description

1.      Global Reliability Information
 A particular value, providing S different fuzzy sets as pieces of information. We propose to use evidence theory to merge these information into a global representation. This choice is motivated by the richness of the merging rules it provides and by the good compromise it represents in terms of expressiveness and tractability. Indeed it encompasses fuzzy sets and probability distributions as particular representations.

2.      Maximal Coherent Subsets
    
The problem of conflicting information, we propose a merging strategy based on maximal coherent subsets (MCS). This notion has been introduced by Rescher and Manor  as a means to infer from inconsistent logic bases, and can be easily extended to the case of quantitative uncertainty representations. Given a set of conflicting sources, MCS consists in applying a conjunctive operator within each nonconflicting (maximal) subset of sources, and then using a disjunctive operator between the partial results. With such a method, as much precision as possible is gained while not neglecting any source, an attractive feature in information fusion. In general, detecting maximal coherent subsets has a NP-hard complexity, however in some particular cases this complexity may significantly be reduced.




3.      Web-Enabled Data Warehouse
   
The present an application of the method to , a web-enabled data warehouse. Indeed, the framework developed in this paper was originally motivated by the need to estimate the reliability of scientific experimental results collected in open data warehouses. To lighten the burden laid upon domain experts when selecting data for a particular application, it is necessary to give them indicative reliability estimations. Formalizing reliability criteria will hopefully be a better asset for them to justify their choices and to capitalize knowledge than the use of an ad hoc estimation

4.      Web Presentation
Web is a data warehouse opened on the web. Its current version is centered on the integration of heterogeneous data tables extracted from web documents. The focus has been put on web tables for two reasons:  experimental data are often summarized in tables and  data are already structured and easier to integrate in a data warehouse than, e.g., text or graphics.

5.      Data Reliability  Management
   
The presents @Web extension integrating a reliability estimation to each table, in order to display the results of a user query ordered by decreasing reliability values. Even if a data table can include several items, the table level has been retained, as data from a given table are usually issued from the same experimental setup and therefore share the same reliability criteria.






6.      Predictive Food Microbiology

                         This part is dedicated to a use case in the field of predictive food microbiology, namely the selection of reliable parameters for simulation models.We first give the criteria suited to this field, as well as the corresponding expertopinions and fuzzy sets. We then detail the use case query and results.

Flow chart







CONCLUSION

We proposed a generic method to evaluate the reliability of data automatically retrieved from the web or from electronic documents. Even if the method is generic, we were more specifically interested in scientific experimental data. The method evaluates data reliability from a set of common sense criteria. It relies on the use of basic probabilistic assignments and of induced belief functions, since they offer a good compromise between flexibility and computational tractability. To handle conflicting information while keeping a maximal amount of it, the information merging follows a maximal coherent subset approach. Finally, reliability evaluations and ordering of data tables are achieved by using lower/upper expectations, allowing us to reflect uncertainty in the evaluation. The results displayed to end users is an ordered list of tables, from the most to the least reliable ones, together with an interval-valued evaluation. We have demonstrated the applicability of the method by its integration in the @Web system, and its use on the Sym’Previus data warehouse. As future works, we see two main possible evolutions: . complementing the current method with useful additional features: the possibility to cope with multiple experts, with criteria of nonequal importance and with uncertainly known criteria; . combining the current approach with other notions or sources of information: relevance, in particular, appears to be equally important to characterize experimental data. Also, we may consider adding user feedback as an additional (and parallel) source of information about reliability or relevance, as it is done in web applications.












REFERENCES

[1] S. Ramchurn, D. Huynh, and N. Jennings, “Trust in Multi-Agent Systems,” The Knowledge Eng. Rev., vol. 19, pp. 1-25, 2004.

[2] P. Buche, J. Dibie-Barthe´lemy, and H. Chebil, “Flexible Sparql Querying of Web Data Tables Driven by an Ontology,” Proc. Eighth Int’l Conf. Flexible Query Answering Systems (FQAS), pp. 345-357, 2009.

[3] G. Hignette, P. Buche, J. Dibie-Barthe´lemy, and O. Haemmerle´, “Fuzzy Annotation of Web Data Tables Driven by a Domain Ontology,” Proc. Sixth European Semantic Web Conf. The Semantic Web: Research and Applications (ESWC), pp. 638-653, 2009.

[4] D. Mercier, B. Quost, and T. Denoeux, “Refined Modeling of Sensor Reliability in the Bellief Function Framework Using Contextual Discounting,” Information Fusion, vol. 9, pp. 246-258,2008.

[5] R. Cooke, Experts in Uncertainty. Oxford Univ. Press, 1991.

 [6] S. Sandri, D. Dubois, and H. Kalfsbeek, “Elicitation, Assessment and Pooling of Expert Judgments Using Possibility Theory,” IEEE Trans. Fuzzy Systems, vol. 3, no. 3, pp. 313-335, Aug. 1995.

[7] F. Delmotte and P. Borne, “Modeling of Reliability with Possibility Theory,” IEEE Trans. Systems, Man, and Cybernetics A, vol. 28, no. 1, pp. 78-88, 1998.

[8] F. Pichon, D. Dubois, and T. Denoeux, “Relevance and Truthfulness in Information Correction and Fusion,” Int’l J. Approximate Reasoning, vol. 53, pp. 159-175, 2011.