O slideshow foi denunciado.
Utilizamos seu perfil e dados de atividades no LinkedIn para personalizar e exibir anúncios mais relevantes. Altere suas preferências de anúncios quando desejar.

Big Data Quality Panel : Diachron Workshop @EDBT

my contribution to the Diachron workshop at the EDBT'16 conference

  • Entre para ver os comentários

  • Seja a primeira pessoa a gostar disto

Big Data Quality Panel : Diachron Workshop @EDBT

  1. 1. P.Missier-2016 Diachronworkshoppanel Big Data Quality Panel Diachron Workshop @EDBT Panta Rhei (Heraclitus, through Plato) Paolo Missier Newcastle University, UK Bordeaux, March 2016 (*) Painting by Johannes Moreelse (*)
  2. 2. P.Missier-2016 Diachronworkshoppanel The “curse” of Data and Information Quality • Quality requirements are often specific to the application that makes use of the data (“fitness for purpose”) • Quality Assurance (actions required to meet the requirements) are specific to the data types A few generic quality techniques (linkage, blocking, …) but mostly ad hoc solutions
  3. 3. P.Missier-2016 Diachronworkshoppanel V for “Veracity”? Q3. To what extent traditional approaches for diagnosis, prevention and curation are challenged by the Volume Variety and Velocity characteristics of Big Data? V Issues Example High Volume • Scalability: What kinds of QC step can be parallelised? • Human curation not feasible Parallel meta-blocking High Velocity • Statistics-based diagnosis, data- type specific • Human curation not feasible Reliability of sensor readings High Variety • Heterogeneity is not a new issue! Data fusion for decision making Recent contributions on Quality & Big Data (IEEE Big Data 2015) Chung-Yi Li et al., Recommending missing sensor values Yang Wang and Kwan-Liu Ma, Revealing the fog-of-war: A visualization-directed, uncertainty-aware approach for exploring high-dimensional data S. Bonner et al., Data quality assessment and anomaly detection via map/reduce and linked data: A case study in the medical domain V. Efthymiou, K. Stefanidis and V. Christophides, Big data entity resolution: From highly to somehow similar entity descriptions in the Web V. Efthymiou, G. Papadakis, G. Papastefanatos, K. Stefanidis and T. Palpanas, Parallel meta-blocking: Realizing scalable entity resolution over large, heterogeneous data
  4. 4. P.Missier-2016 Diachronworkshoppanel Can we ignore quality issues? Q4: How difficult is the evaluation of the threshold under which data quality can be ignored? • Some analytics algorithms may be tolerant to {outliers, missing values, implausible values} in the input • But this “meta-knowledge” is specific to each algorithm. Hard to derive general models • i.e. the importance and danger of FP / FN A possible incremental learning approach: Build a database of past analytics task: H = {<In, P, Out>} Try and learn (In, Out) correlations over a growing collection H
  5. 5. P.Missier-2016 Diachronworkshoppanel Data to Knowledge Meta-knowledge Big Data The Big Analytics Machine Algorithms Tools Middleware Reference datasets “Valuable Knowledge” The Data-to-Knowledge pattern of the Knowledge Economy:
  6. 6. P.Missier-2016 Diachronworkshoppanel The missing element: time Big Data The Big Analytics Machine “Valuable Knowledge” V3 V2 V1 Meta-knowledge Algorithms Tools Middleware Reference datasets t t t Change  data currency
  7. 7. P.Missier-2016 Diachronworkshoppanel The ReComp decision support system Observe change • In big data • In meta-knowledge Assess and measure • knowledge decay Estimate • Cost and benefits of refresh Enact • Reproduce (analytics) processes Currency of data and of meta-knowledge: - What knowledge should be refreshed? - When, how? - Cost / benefits
  8. 8. P.Missier-2016 Diachronworkshoppanel ReComp: 2016-18 Change Events Diff(.,.) functions “business Rules” Prioritised KAs Cost estimates Reproducibility assessment ReComp DSS History DB Past KAs and their metadata  provenance Observe change Assess and measure Estimate Enact KA: Knowledge Assets META-K
  9. 9. P.Missier-2016 Diachronworkshoppanel Metadata + Analytics The knowledge is in the metadata! Research hypothesis: supporting the analysis can be achieved through analytical reasoning applied to a collection of metadata items, which describe details of past computations. identify recomp candidates large-scale recomp estimate change impact Estimate reproducibility cost/effort Change Events Change Impact Model Cost Model Model updates Model updates Meta-K • Logs • Provenance • Dependencies