Enviar pesquisa
Carregar
Data Normalization and Alignment in Heterogeneous Data Sets
•
2 gostaram
•
1,542 visualizações
DataCards
Seguir
Denunciar
Compartilhar
Denunciar
Compartilhar
1 de 22
Recomendados
Original Images Powerpoint
Original Images Powerpoint
paigeh1995
try
try
Lamha Agarwal
Architecting Agile Data Applications for Scale
Architecting Agile Data Applications for Scale
Databricks
Machine Learning on Distributed Systems by Josh Poduska
Machine Learning on Distributed Systems by Josh Poduska
Data Con LA
Bigdata analytics
Bigdata analytics
Keshav Tripathy
Apache Spark and R: A (Big Data) Love Story?
Apache Spark and R: A (Big Data) Love Story?
sellorm
Relational Database Stockholm Syndrome (Neal Murray, 6 Point 6) London 2019 C...
Relational Database Stockholm Syndrome (Neal Murray, 6 Point 6) London 2019 C...
confluent
Wei's Self Intro
Wei's Self Intro
sunmast
Recomendados
Original Images Powerpoint
Original Images Powerpoint
paigeh1995
try
try
Lamha Agarwal
Architecting Agile Data Applications for Scale
Architecting Agile Data Applications for Scale
Databricks
Machine Learning on Distributed Systems by Josh Poduska
Machine Learning on Distributed Systems by Josh Poduska
Data Con LA
Bigdata analytics
Bigdata analytics
Keshav Tripathy
Apache Spark and R: A (Big Data) Love Story?
Apache Spark and R: A (Big Data) Love Story?
sellorm
Relational Database Stockholm Syndrome (Neal Murray, 6 Point 6) London 2019 C...
Relational Database Stockholm Syndrome (Neal Murray, 6 Point 6) London 2019 C...
confluent
Wei's Self Intro
Wei's Self Intro
sunmast
SQL, NoSQL, BigData in Data Architecture
SQL, NoSQL, BigData in Data Architecture
Venu Anuganti
Art of Feature Engineering for Data Science with Nabeel Sarwar
Art of Feature Engineering for Data Science with Nabeel Sarwar
Spark Summit
Real World Performance - OLTP
Real World Performance - OLTP
Connor McDonald
Databases for Data Science
Databases for Data Science
Alexander Hendorf
ENAR short course
ENAR short course
Deepak Agarwal
Evolution of Distributed Database Technologies in the Digital era
Evolution of Distributed Database Technologies in the Digital era
Vishal Puri
Endeca Performance Considerations
Endeca Performance Considerations
Cirrus10
Big data Intro - Presentation to OCHackerz Meetup Group
Big data Intro - Presentation to OCHackerz Meetup Group
Sri Kanajan
VF NZ
VF NZ
Vince Fiorilli
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
DATAVERSITY
Where Does Big Data Meet Big Database - QCon 2012
Where Does Big Data Meet Big Database - QCon 2012
Ben Stopford
Impact of cloud services on the work of oracle technology experts
Impact of cloud services on the work of oracle technology experts
Andreas Chatziantoniou
Impact of cloud services on the work of oracle technology experts
Impact of cloud services on the work of oracle technology experts
Andreas Chatziantoniou
AnDSummit2020 Session Pattern Analysis Data Model
AnDSummit2020 Session Pattern Analysis Data Model
Shankar Somayajula
Map reducecloudtech
Map reducecloudtech
Jakir Hossain
NoSQLDatabases
NoSQLDatabases
Adi Challa
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...
DATAVERSITY
Database Systems - Lecture Week 1
Database Systems - Lecture Week 1
Dios Kurniawan
Evolution of Esri Data Formats Seminar
Evolution of Esri Data Formats Seminar
Esri South Africa
So You Want to Build a Data Lake?
So You Want to Build a Data Lake?
David P. Moore
Information Extraction and Integration of Hard and Soft Information for D2D v...
Information Extraction and Integration of Hard and Soft Information for D2D v...
DataCards
Fusion of Human Geography Data
Fusion of Human Geography Data
DataCards
Mais conteúdo relacionado
Semelhante a Data Normalization and Alignment in Heterogeneous Data Sets
SQL, NoSQL, BigData in Data Architecture
SQL, NoSQL, BigData in Data Architecture
Venu Anuganti
Art of Feature Engineering for Data Science with Nabeel Sarwar
Art of Feature Engineering for Data Science with Nabeel Sarwar
Spark Summit
Real World Performance - OLTP
Real World Performance - OLTP
Connor McDonald
Databases for Data Science
Databases for Data Science
Alexander Hendorf
ENAR short course
ENAR short course
Deepak Agarwal
Evolution of Distributed Database Technologies in the Digital era
Evolution of Distributed Database Technologies in the Digital era
Vishal Puri
Endeca Performance Considerations
Endeca Performance Considerations
Cirrus10
Big data Intro - Presentation to OCHackerz Meetup Group
Big data Intro - Presentation to OCHackerz Meetup Group
Sri Kanajan
VF NZ
VF NZ
Vince Fiorilli
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
DATAVERSITY
Where Does Big Data Meet Big Database - QCon 2012
Where Does Big Data Meet Big Database - QCon 2012
Ben Stopford
Impact of cloud services on the work of oracle technology experts
Impact of cloud services on the work of oracle technology experts
Andreas Chatziantoniou
Impact of cloud services on the work of oracle technology experts
Impact of cloud services on the work of oracle technology experts
Andreas Chatziantoniou
AnDSummit2020 Session Pattern Analysis Data Model
AnDSummit2020 Session Pattern Analysis Data Model
Shankar Somayajula
Map reducecloudtech
Map reducecloudtech
Jakir Hossain
NoSQLDatabases
NoSQLDatabases
Adi Challa
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...
DATAVERSITY
Database Systems - Lecture Week 1
Database Systems - Lecture Week 1
Dios Kurniawan
Evolution of Esri Data Formats Seminar
Evolution of Esri Data Formats Seminar
Esri South Africa
So You Want to Build a Data Lake?
So You Want to Build a Data Lake?
David P. Moore
Semelhante a Data Normalization and Alignment in Heterogeneous Data Sets
(20)
SQL, NoSQL, BigData in Data Architecture
SQL, NoSQL, BigData in Data Architecture
Art of Feature Engineering for Data Science with Nabeel Sarwar
Art of Feature Engineering for Data Science with Nabeel Sarwar
Real World Performance - OLTP
Real World Performance - OLTP
Databases for Data Science
Databases for Data Science
ENAR short course
ENAR short course
Evolution of Distributed Database Technologies in the Digital era
Evolution of Distributed Database Technologies in the Digital era
Endeca Performance Considerations
Endeca Performance Considerations
Big data Intro - Presentation to OCHackerz Meetup Group
Big data Intro - Presentation to OCHackerz Meetup Group
VF NZ
VF NZ
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
Where Does Big Data Meet Big Database - QCon 2012
Where Does Big Data Meet Big Database - QCon 2012
Impact of cloud services on the work of oracle technology experts
Impact of cloud services on the work of oracle technology experts
Impact of cloud services on the work of oracle technology experts
Impact of cloud services on the work of oracle technology experts
AnDSummit2020 Session Pattern Analysis Data Model
AnDSummit2020 Session Pattern Analysis Data Model
Map reducecloudtech
Map reducecloudtech
NoSQLDatabases
NoSQLDatabases
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...
Database Systems - Lecture Week 1
Database Systems - Lecture Week 1
Evolution of Esri Data Formats Seminar
Evolution of Esri Data Formats Seminar
So You Want to Build a Data Lake?
So You Want to Build a Data Lake?
Mais de DataCards
Information Extraction and Integration of Hard and Soft Information for D2D v...
Information Extraction and Integration of Hard and Soft Information for D2D v...
DataCards
Fusion of Human Geography Data
Fusion of Human Geography Data
DataCards
Geohash: Integration of Disparate Geospatial Data
Geohash: Integration of Disparate Geospatial Data
DataCards
The Challenges and Pitfalls of Aggregating Social Media Data
The Challenges and Pitfalls of Aggregating Social Media Data
DataCards
How NOT to Aggregrate Polling Data
How NOT to Aggregrate Polling Data
DataCards
Alignment and Analytics of Large Scale, Disparate Data from IARPA's Knowledge...
Alignment and Analytics of Large Scale, Disparate Data from IARPA's Knowledge...
DataCards
3rd Socio-Cultural Data Summit
3rd Socio-Cultural Data Summit
DataCards
Statistical Approaches to Missing Data
Statistical Approaches to Missing Data
DataCards
Mais de DataCards
(8)
Information Extraction and Integration of Hard and Soft Information for D2D v...
Information Extraction and Integration of Hard and Soft Information for D2D v...
Fusion of Human Geography Data
Fusion of Human Geography Data
Geohash: Integration of Disparate Geospatial Data
Geohash: Integration of Disparate Geospatial Data
The Challenges and Pitfalls of Aggregating Social Media Data
The Challenges and Pitfalls of Aggregating Social Media Data
How NOT to Aggregrate Polling Data
How NOT to Aggregrate Polling Data
Alignment and Analytics of Large Scale, Disparate Data from IARPA's Knowledge...
Alignment and Analytics of Large Scale, Disparate Data from IARPA's Knowledge...
3rd Socio-Cultural Data Summit
3rd Socio-Cultural Data Summit
Statistical Approaches to Missing Data
Statistical Approaches to Missing Data
Data Normalization and Alignment in Heterogeneous Data Sets
1.
Data Normalization and
Alignment Tales from the Data Crypt WWW.DATA–TACTICS.COM © 2012 Data Tactics ARCHITECT – ENGINEER – INTEGRATE – SOLUTIONS
2.
Data – The
Hard Part WWW.DATA–TACTICS.COM © 2012 Data Tactics ARCHITECT – ENGINEER – INTEGRATE – SOLUTIONS
3.
Data Exploitation
Information (Data) = Access + Understanding + Normalization Raw Data (ISOs, partitions, encase) Recovery Usable Data (Structured Tablular, Functional Databases) Interpretation Exploitable Information Analysis Knowledge WWW.DATA–TACTICS.COM © 2012 Data Tactics ARCHITECT – ENGINEER – INTEGRATE – SOLUTIONS
4.
What Day is
it? • Problem: – 12/1/2002 – Dec 1st or Jan 12th? – 10K+ spreadsheets – Date column with wide mix of formats • Approach: – Define some rules for best guess – Apache POI to access excel data – Use Java Date routines to attempt to parse data – Use statistical analysis to determine most used formats – Look for non-sensical dates (e.g. months > 12 or years out of range.) – Last ditch heuristic: Date column appeared to be basically in date order – look to nearby rows to determine likely value • Result: 95%+ population of the date field. WWW.DATA–TACTICS.COM © 2012 Data Tactics ARCHITECT – ENGINEER – INTEGRATE – SOLUTIONS
5.
Where is it?
• Problem: – Databases exported to spreadsheets (two data sets: 119K+ and 18K+) – Multiple coordinate fields (Lat/Lon, MGRS) – Some records with multiple locations – No format checking on field entries • Examples – WELL AND WATER SYSTEM: 41R QQ 30961 96855 SEPTIC TANK: 41R QQ 30946 96869 – 41R QQ 30990 90370, 41R QQ 31005 90337, 41R QQ 31017 90341, 41R QQ 30998 90378 – GR 41R QQ 31 93 – 41R QQ 32123 96814 41R QQ 32003 97004 41R QQ 32053 97204 – GR 41R QQ 3238 9227 TO GR 41 R QQ 3238 9229. CULVERT GR 41R QQ 3250 9238 WWW.DATA–TACTICS.COM © 2012 Data Tactics ARCHITECT – ENGINEER – INTEGRATE – SOLUTIONS
6.
Where is it?
• Approaches – Use NGA’s GEOTRANs software for conversions – Apply rules to determine canonical location: • If Lat/Lon present and parsable use that • If parsable MGRS use that • Use crude NLP (Regex) to extract candidate MGRS coordinates • Use 1st valid coordinate found • Check validity against Province/District bounds – Supplement with an intern • Faster, More accurate • Leveraged the power of Excel – Considered: Implementing multi-point objects for rows. • Results: – 119K row data set = 86.5% bad -> 82.0% bad – 4.5% improvement – 18K row data set = 62.4% bad -> 7.5% bad – 54.9% improvement WWW.DATA–TACTICS.COM © 2012 Data Tactics ARCHITECT – ENGINEER – INTEGRATE – SOLUTIONS
7.
Not all districts
are equal • Problem: – Multiple data sets spanning 8yrs • Spreadsheets, DBMSs – Looking to do analysis/stats by Province/Districts – Common enumeration problems • Multiple spellings/transliterations • Punctuation • Strange formatting (Alternate names in parens) – District names/boundaries changed multiple times over the data span • Examples: – Eshkashim vs Ishkashiem, Dehdadi vs Dihdadi – Pul-i-khumri vs Puli Khumri – Darwaz-i-bala (nesay) vs Darwazbala WWW.DATA–TACTICS.COM © 2012 Data Tactics ARCHITECT – ENGINEER – INTEGRATE – SOLUTIONS
8.
Not all districts
are equal • Approaches: – Find a current canonical set of district names and boundaries – Fix provinces 1st: Manual using Excel – If a geocoord is present use that – Check both in and out of parens – Look at two soundex and double-metaphone – Look at lexical distance • Results – Soundex could help resolve partial double-metaphone matches – Small lexical distances are good indicators but not conversely • Estalef vs Istalif (2) • Shigal Wa Sheltan vs Shaygal wa shital (7) WWW.DATA–TACTICS.COM © 2012 Data Tactics ARCHITECT – ENGINEER – INTEGRATE – SOLUTIONS
9.
Not all districts
are equal • Representative Data Set Statistics – No geocoords available in data set. – 59.5K+ Rows After Test Correct (%) Incorrect (%) Canonical Lookup 47.48% 52.52% Double-Metaphon/Soundex 92.12% 7.88% Lexical Dist <=2 94.01% 5.99% After Test Unique Misses % Unique Fixed Canonical Lookup 234 Double-Metaphon/Soundex 42 82.05% Lexical Dist <=2 33 21.43% WWW.DATA–TACTICS.COM © 2012 Data Tactics ARCHITECT – ENGINEER – INTEGRATE – SOLUTIONS
10.
Would a Babblefish
even help? • Problem: – Foreign language SMS messages – Short not always structured sentences – Afghanistan is a polyglot of languages: • Pashtun • Dari/Urdu • Farsi • Arabic – Add to the problem • Abbreviations • Slang • Approaches – Automated Translation services (Google & others) WWW.DATA–TACTICS.COM © 2012 Data Tactics ARCHITECT – ENGINEER – INTEGRATE – SOLUTIONS
11.
Would a Babblefish
even help? Language Value Original Arabic [mwn in him [stEEsw] [G[ lw[ w] [mqaamaatw] [th] Warsaw[ staasw] [d] [s[ wl] [dywaal] [jw[ y] his kindnesses Farsi The bottom in the Olympic Games described her years national Pashto We will voice higher education authorities to you to the wall thanks Urdu If you run school. skwal : (page 610) •Pl. skārah. • skwal • skwal, s.m. (2nd) Shearing, clipping, cutting off wool, hair, nap, etc. by shears. Pl. skwalūnah. skwal kawul, verb trans. To shear, to clip. See • sʿkawul • sʿkawul verb trans. (caus.) To cause to drink, imbibe, drink up, to water as a , horse, cattle, etc. . To draw out, to unsheath. See WWW.DATA–TACTICS.COM © 2012 Data Tactics ARCHITECT – ENGINEER – INTEGRATE – SOLUTIONS
12.
DBMS Data Recovery
• Problem: – Data is a database that must be reconstructed • Primary Issues: – Media recovery (unusual volume or partition schemes or formats) – Interrogation of backups to determine platform, version, backup or export flavor – Establish database server for correct database platform and version accounting for database physical layout and sizing – Characterset Encoding – Database administration for performance tuning or version upgrades to enable advanced features WWW.DATA–TACTICS.COM © 2012 Data Tactics ARCHITECT – ENGINEER – INTEGRATE – SOLUTIONS
13.
DBMS Interpretation
• Problems: – What is the data structure and what do the components mean • Primary Issues: – Determine database schema entry points • SME knowledge necessary for denormalized data projections • Primary key and Foreign key recovery – Examine meta data for data type distribution and possible embedded structure • E.g., XML nested in CLOBs – Data statistics and quality metrics: data size, density • focus on populated data structures – Temporal data analysis: time hack distribution for all date/time cells WWW.DATA–TACTICS.COM © 2012 Data Tactics ARCHITECT – ENGINEER – INTEGRATE – SOLUTIONS
14.
Scale of the
structure problem • Example: – Undocumented schemas with over 7,000 tables and 160,000 columns with minimal foreign key relationship definitions (the relationships between tables are not defined) and over 1 billion data rows – Only approximately 70% of primary keys for tables are defined • Primary Issues: – Need to reverse engineer missing primary keys and foreign keys which represent a portion of SME knowledge of the data structures – Implement algorithms to extract missing foreign key relationships within each schema • http://liris.cnrs.fr/Documents/Liris-3034.pdf • http://www.comp.nus.edu.sg/~zmeihui/vldb10.pdf • http://www.cs.toronto.edu/dcs/theses/MSc/2002-03/Vilarem.msc.pdf • http://webdb09.cse.buffalo.edu/papers/Paper30/rostin_et_al_final.pdf – Complicating Matters • Artificial/Pseudo Keys (e.g. one up numbers) • Compound Keys ( Column A + Column B = Column C) WWW.DATA–TACTICS.COM © 2012 Data Tactics ARCHITECT – ENGINEER – INTEGRATE – SOLUTIONS
15.
Primary Key Discovery
Vilarem approximation for primary key discovery: Complexity given by the equation Complexity(ExtractKeys) = O(nKeyCands x p log p) Where, nKeyCands = number of key candidates, p = number of tuples (rows), And the number of key candidates is dependant upon the number of columns for a given table. WWW.DATA–TACTICS.COM © 2012 Data Tactics ARCHITECT – ENGINEER – INTEGRATE – SOLUTIONS
16.
Foreign Key Discovery Foreign
key discovery without pruning approximation: Complexity given by the equations: Complexity(ExtractUINDs) = O((nUindCands + nFKCands) x join(p)) Where, nUindCands = key-based unary inclusion dependencies nKeyCands = number of key candidates, p = number of tuples (rows), And the number of key candidates is dependant upon the number of columns for a given table. WWW.DATA–TACTICS.COM © 2012 Data Tactics ARCHITECT – ENGINEER – INTEGRATE – SOLUTIONS
17.
Horizontal Data Integration
• Definition: Horizontal Integration •multiple heterogeneous data resources become aligned in such a way that search and analysis procedures can be applied to their combined content as if they formed a single resource •Challenges •Quantity and variety • Need to do justice to radical heterogeneity in the representation of data and semantics Dynamic environments • Need agile support for retrieval, integration and enrichment of data •Emergence of new data resources • Need in agile, flexible, and incremental integration approach WWW.DATA–TACTICS.COM © 2012 Data Tactics ARCHITECT – ENGINEER – INTEGRATE – SOLUTIONS
18.
Unified DataSpace +
Semantic Enhancement The Wild • Data sources with rich data & Segment 3 - Model Description semantic context locked in domain Data Rich semantic silos Models context • Data tightly coupled to data-models • Data-models Segment 2 - Data Description tightly coupled to Structured Integration Enrichment storage models Data Exploitation Exploration Silos isolated by Across all sources • Implementation Segment 1 - Artifact Description technology • Storage structure Unstructured Rich data • Data Data context representation • Data modality WWW.DATA–TACTICS.COM © 2012 Data Tactics ARCHITECT – ENGINEER – INTEGRATE – SOLUTIONS
19.
Unified DataSpace
•Segment 0 is an artifact store (i.e., binary representation of artifacts). High-Level Conceptual Model of the DataSpace and Ingest/Extraction Flows Segment 3 - Model Semantics •Segment 1 represents artifact semantics . 2 . 2 CONCEPT CONCEPT_ASSOCIATION PREDICATE PREDICATE_ASSOCIATION . . . and includes artifact metadata and Uses Uses associations between the artifacts. Indexing Segment 1 - Artifact Semantics Segment 2 - Data Semantics Semantics of Segment 1 supports search on text SOURCE . + 2 . 2 . Metadata ARTIFACT ARTIFACT_ASSOCIATION TERM . STATEMENT . . . content, geospatial, and artifact meta data. . Data Uses + Metadata Metadata Segment 0 - Artifacts •Segment 2 represents data and semantics Ingest Extraction of structured data elements extracted from artifacts. Indexing of Segment 2 supports search on properties of entities (e.g., Person, Location) based on their properties and relationships. •Segment 3 represents data-models extracted from artifacts and models used for aligning, disambiguating, and enriching the elements of Segments 1 and 2. WWW.DATA–TACTICS.COM © 2012 Data Tactics ARCHITECT – ENGINEER – INTEGRATE – SOLUTIONS
20.
Semantic Enhancement •
Requirements for Horizontal Integration – The ontologies must be linked together through logical definitions to form a single, nonredundant and consistently evolving integrated network – The ontologies must be capable of evolving in an agile fashion in response to new sorts of data and new analytical and warfighter needs • Creating Ontology Modules – Incremental distributed ontology development • Based on Doctrine; • Involves SMEs in label selection and definition – Ontology development rules and principles • A shared governance and change management process • A common ontology architecture incorporating a common, domain-neutral, upper-level ontology (BFO) – An ontology registry – A simple, repeatable process for ontology development – A process of intelligence data capture through ‘annotation’ or ‘tagging’ of source data artifacts – Feedback between ontology authors and users • SE Architecture – The Upper Level Ontology (ULO) in the SE hierarchy must be maximally general (no overlap with domain ontologies) – The Mid-Level Ontologies (MLOs) introduce successively less general and more detailed representations of types which arise in successively narrower domains until we reach the Lowest Level Ontologies (LLOs). – The LLOs are maximally specific representation of the entities in a particular one-dimensional domain WWW.DATA–TACTICS.COM © 2012 Data Tactics ARCHITECT – ENGINEER – INTEGRATE – SOLUTIONS
21.
Modular Hierarchy WWW.DATA–TACTICS.COM
© 2012 Data Tactics ARCHITECT – ENGINEER – INTEGRATE – SOLUTIONS
22.
References • Salmen, et
al,. Integration of Intelligence Data through Semantic Enhancement, STIDS 2011 – Strategy for developing an SE suite of orthogonal reference ontology modules • Smith, et al. Ontology for the Intelligence Analyst, CrossTalk: The Journal of Defense Software Engineering November/December 2012,18-25. – Shows how SE approach provides immediate benefits to the intelligence analyst • Smith, et al. Horizontal Integration of Warfighter Intelligence Data - A Shared Semantic Resource for the Intelligence Community – Describes a strategy that is being used for the horizontal integration of warfighter intelligence data within the framework of the US Army’s Distributed Common Ground System Standard Cloud (DSC) initiative – Strategy rests on the development of a set of ontologies that are being incrementally applied to bring about what we call the ‘semantic enhancement’ of data models used within each intelligence discipline WWW.DATA–TACTICS.COM © 2012 Data Tactics ARCHITECT – ENGINEER – INTEGRATE – SOLUTIONS