SlideShare uma empresa Scribd logo
1 de 21
BIG DATA
Definition


Big data is the term for a collection of data
sets so large and complex that it becomes
difficult to process using on-hand database
management tools or traditional data
processing applications.
How big ?
ABC of BIG DATA


Analytics. This solution area focuses on providing efficient analytics for

extremely large datasets. Analytics is all about gaining insight, taking
advantage of the digital universe, and turning data into high-quality
information, providing deeper insights about the business to enable better
decisions.


Bandwidth. This solution area focuses on obtaining better performance

for very fast workloads. High-bandwidth applications include highperformance computing: the ability to perform complex analyses at
extremely high speeds; high-performance video streaming for surveillance
and mission planning; and as video editing and play-out in media and
entertainment.


Content. This solution area focuses on the need to provide boundless

secure scalable data storage. Content solutions must enable storing
virtually unlimited amounts of data, so that enterprises can store as much
data as they want, find it when they need it, and never lose it.
3 V’S of BIG DATA


Volume:



Velocity: As a direct consequence of the rate at which data is being

Not only can each data source contain a huge volume of data,
but also the number of data sources, even for a single domain, has grown
to be in the tens of thousands.
collected and continuously made available,many of the data sources are
very dynamic.



Variety: Data sources (even in the same domain) are extremely

heterogeneous both at the schema level regarding how they structure their
data and at the instance level regarding how they describe the same realworld entity, exhibiting considerable variety even for substantially similar
entities.
Examples












The NASA Center for Climate Simulation (NCCS) stores 32 petabytes of
climate observations
Big data analysis played a large role in  Barack Obama's successful 2012 reelection campaign.
eBay.com uses two data warehouses at 7.5 petabytes and 40PB as well as a
40PB Hadoop cluster for search, consumer recommendations, and
merchandising. Inside eBay’s 90PB data warehouse
Amazon.com handles millions of back-end operations every day, as well as
queries from more than half a million third-party sellers. The core technology
that keeps Amazon running is Linux-based and as of 2005 they had the world’s
three largest Linux databases, with capacities of 7.8 TB, 18.5 TB, and 24.7 TB. [
Walmart handles more than 1 million customer transactions every hour, which
is imported into databases estimated to contain more than 2.5 petabytes (2560
terabytes) of data – the equivalent of 167 times the information contained in all
the books in the US Library of Congress.
Facebook handles 50 billion photos from its user base.
Big data Integration




A lot of data growth is happening around these so-called
unstructured data types. Big data integration is all about
automation of the collection, organization and analysis
of these data types.
The importance of big data integration has led to a
substantial amount of research over the past few years
on the topics of schema mapping, record linkage and
data fusion.
Structured data vs Unstructured
data
Big data vs Traditional Data Integration


The number of data sources, even for a single

domain, has grown to be in the tens of thousands.


Many of the data sources are very dynamic, as a huge
amount of newly collected data are continuously made
available.



The data sources are extremely heterogeneous in their
structure, with considerable variety even for substantially
similar entities.



The data sources are of widely differing qualities, with

significant differences in the coverage, accuracy and
timeliness
of data provided.
Schema Mapping
Schema mapping in a data integration system refers to
i) creating a mediated (global) schema, and
(ii) Identifying the mappings between the mediated (global)
schema and the local schemas of the data sources to
determine which (sets of) attributes contain the same
information

Example





Entities like people (customers, employees), companies
(the enterprise itself, competitors, partners, suppliers),
products (those owned by the enterprise and its
competitors)
Defined Relationships among these entities
Activities with one or more entities as actors and/or
subjects - Documents can represent these activities
Record Linkage




Record linkage (RL) refers to the task of
finding records in a data set that refer to the
same entity across different data sources (e.g., data
files, books, websites, databases).
Record linkage is necessary when joining data sets
based on entities that may or may not share a common
identifier (e.g., database key, URI, National identification
number), as may be the case due to differences in
record shape, storage location, and/or curator style or
preference
Challenge in BDI






In BDI, (i) data sources tend to be heterogeneous in
their structure and many sources (e.g., tweets, blog
posts) provide unstructured data, and
(ii) data sources are dynamic and continuously evolving.
To address the volume dimension, new techniques have
been proposed to enable parallel record linkage using
MapReduce.
Adaptive blocking is another technique been used to
overcome this.
MapReduce






MapReduce is a programming model for processing
large data sets with a parallel, distributed algorithm on
a cluster.
The model is inspired by the map and reduce functions
commonly used in functional programming.
A MapReduce program is composed of
a Map() procedure that performs filtering and sorting
and  Reduce() procedure that performs a summary
operation.
Adaptive Blocking


Blocking methods alleviate this big data integration
problem by efficiently selecting approximately similar
object pairs for subsequent distance computations,
leaving out the remaining pairs as dissimilar.
Data fusion






Data fusion refers to resolving conflicts from different
sources and finding the truth that reflects the real world.
Its motivation is exactly the veracity of data: the Web has
made it easy to publish and spread false information across
multiple sources.
 Data integration might be viewed as set combination
wherein the larger set is retained, whereas fusion is a
set reduction technique
Data fusion model







Level 0: Source Preprocessing.
Level 1: Object Assessment
Level 2: Situation Assessment
Level 3: Impact Assessment 
Level 4: Process Refinement
Level 5: User Refinement 
Advantages








Real-time rerouting of transportation fleets based on
weather patterns
Customer sentiment analysis based on social postings
Targeted disease therapies based on genomic data
Allocation of disaster relief supplies based on mobile
and social messages from victims
Cars driving themselves.
Conclusion
This seminar gives a basic insight of what is big data
and reviews the state-of-the-art techniques for data
integration in addressing the new challenges raised by
Big Data, including volume and number of sources,
velocity, variety, and veracity. It also lists out the
advantages of harnessing the potential of big data.

Mais conteúdo relacionado

Mais procurados (20)

Big data by Mithlesh sadh
Big data by Mithlesh sadhBig data by Mithlesh sadh
Big data by Mithlesh sadh
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 
Big_data_ppt
Big_data_ppt Big_data_ppt
Big_data_ppt
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
 
Big data introduction
Big data introductionBig data introduction
Big data introduction
 
Big Data PPT by Rohit Dubey
Big Data PPT by Rohit DubeyBig Data PPT by Rohit Dubey
Big Data PPT by Rohit Dubey
 
Presentation on Big Data
Presentation on Big DataPresentation on Big Data
Presentation on Big Data
 
Big Data
Big DataBig Data
Big Data
 
Big data analytics
Big data analyticsBig data analytics
Big data analytics
 
Presentation on Big Data
Presentation on Big DataPresentation on Big Data
Presentation on Big Data
 
Big data
Big dataBig data
Big data
 
Big Data ppt
Big Data pptBig Data ppt
Big Data ppt
 
Big data Presentation
Big data PresentationBig data Presentation
Big data Presentation
 
Presentation About Big Data (DBMS)
Presentation About Big Data (DBMS)Presentation About Big Data (DBMS)
Presentation About Big Data (DBMS)
 
Big Data - Applications and Technologies Overview
Big Data - Applications and Technologies OverviewBig Data - Applications and Technologies Overview
Big Data - Applications and Technologies Overview
 
Overview of Big data(ppt)
Overview of Big data(ppt)Overview of Big data(ppt)
Overview of Big data(ppt)
 
Big data PPT prepared by Hritika Raj (Shivalik college of engg.)
Big data PPT prepared by Hritika Raj (Shivalik college of engg.)Big data PPT prepared by Hritika Raj (Shivalik college of engg.)
Big data PPT prepared by Hritika Raj (Shivalik college of engg.)
 
Big Data - The 5 Vs Everyone Must Know
Big Data - The 5 Vs Everyone Must KnowBig Data - The 5 Vs Everyone Must Know
Big Data - The 5 Vs Everyone Must Know
 
Big Data
Big DataBig Data
Big Data
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 

Destaque

Big data veracity challenges
Big data veracity challengesBig data veracity challenges
Big data veracity challengesPrayukth K V
 
Sejarah komputer & perkembangannya.html
Sejarah komputer & perkembangannya.htmlSejarah komputer & perkembangannya.html
Sejarah komputer & perkembangannya.htmlDevraNurEkaKusuma
 
Walk-In Clinic In Austin, TX
Walk-In Clinic In Austin, TXWalk-In Clinic In Austin, TX
Walk-In Clinic In Austin, TXTexanUrgentCare
 
الجهاز الهضمي
الجهاز الهضميالجهاز الهضمي
الجهاز الهضميayshamashani
 
Kwater investor presentation oct2013
Kwater investor presentation oct2013Kwater investor presentation oct2013
Kwater investor presentation oct2013Sarod Paichayonrittha
 
School and college tour packages to singapore
School and college tour packages to singaporeSchool and college tour packages to singapore
School and college tour packages to singaporeDAsia India
 
Предоставление частных земельных участков под строительство электрических сет...
Предоставление частных земельных участков под строительство электрических сет...Предоставление частных земельных участков под строительство электрических сет...
Предоставление частных земельных участков под строительство электрических сет...Мрск Урала
 
Global Travelling overview
Global Travelling overviewGlobal Travelling overview
Global Travelling overviewDmitry Rodionov
 
Bajigur spesial rasa
Bajigur spesial rasaBajigur spesial rasa
Bajigur spesial rasaPanjiKN
 

Destaque (15)

Token
TokenToken
Token
 
Big data veracity challenges
Big data veracity challengesBig data veracity challenges
Big data veracity challenges
 
Sejarah komputer & perkembangannya.html
Sejarah komputer & perkembangannya.htmlSejarah komputer & perkembangannya.html
Sejarah komputer & perkembangannya.html
 
Walk-In Clinic In Austin, TX
Walk-In Clinic In Austin, TXWalk-In Clinic In Austin, TX
Walk-In Clinic In Austin, TX
 
Doc1
Doc1Doc1
Doc1
 
الجهاز الهضمي
الجهاز الهضميالجهاز الهضمي
الجهاز الهضمي
 
Kwater investor presentation oct2013
Kwater investor presentation oct2013Kwater investor presentation oct2013
Kwater investor presentation oct2013
 
Nanowrimo castle
Nanowrimo castleNanowrimo castle
Nanowrimo castle
 
Somnath City Plots For Booking-7503367689
Somnath City Plots For Booking-7503367689Somnath City Plots For Booking-7503367689
Somnath City Plots For Booking-7503367689
 
School and college tour packages to singapore
School and college tour packages to singaporeSchool and college tour packages to singapore
School and college tour packages to singapore
 
Drogas2
Drogas2Drogas2
Drogas2
 
Предоставление частных земельных участков под строительство электрических сет...
Предоставление частных земельных участков под строительство электрических сет...Предоставление частных земельных участков под строительство электрических сет...
Предоставление частных земельных участков под строительство электрических сет...
 
Global Travelling overview
Global Travelling overviewGlobal Travelling overview
Global Travelling overview
 
Protocolos de red
Protocolos de redProtocolos de red
Protocolos de red
 
Bajigur spesial rasa
Bajigur spesial rasaBajigur spesial rasa
Bajigur spesial rasa
 

Semelhante a Big Data

Big data analytics, survey r.nabati
Big data analytics, survey r.nabatiBig data analytics, survey r.nabati
Big data analytics, survey r.nabatinabati
 
Lecture 5 - Big Data and Hadoop Intro.ppt
Lecture 5 - Big Data and Hadoop Intro.pptLecture 5 - Big Data and Hadoop Intro.ppt
Lecture 5 - Big Data and Hadoop Intro.pptalmaraniabwmalk
 
TCS_DATA_ANALYSIS_REPORT_ADITYA
TCS_DATA_ANALYSIS_REPORT_ADITYATCS_DATA_ANALYSIS_REPORT_ADITYA
TCS_DATA_ANALYSIS_REPORT_ADITYAAditya Srinivasan
 
Big data and data mining
Big data and data miningBig data and data mining
Big data and data miningPolash Halder
 
Big data - what, why, where, when and how
Big data - what, why, where, when and howBig data - what, why, where, when and how
Big data - what, why, where, when and howbobosenthil
 
An Efficient Approach for Clustering High Dimensional Data
An Efficient Approach for Clustering High Dimensional DataAn Efficient Approach for Clustering High Dimensional Data
An Efficient Approach for Clustering High Dimensional DataIJSTA
 
Big-Data-Analytics.8592259.powerpoint.pdf
Big-Data-Analytics.8592259.powerpoint.pdfBig-Data-Analytics.8592259.powerpoint.pdf
Big-Data-Analytics.8592259.powerpoint.pdfrajsharma159890
 
using big-data methods analyse the Cross platform aviation
 using big-data methods analyse the Cross platform aviation using big-data methods analyse the Cross platform aviation
using big-data methods analyse the Cross platform aviationranjit banshpal
 
A Review Paper on Big Data and Hadoop for Data Science
A Review Paper on Big Data and Hadoop for Data ScienceA Review Paper on Big Data and Hadoop for Data Science
A Review Paper on Big Data and Hadoop for Data Scienceijtsrd
 
Big Data Summarization : Framework, Challenges and Possible Solutions
Big Data Summarization : Framework, Challenges and Possible SolutionsBig Data Summarization : Framework, Challenges and Possible Solutions
Big Data Summarization : Framework, Challenges and Possible Solutionsaciijournal
 
Big Data Summarization : Framework, Challenges and Possible Solutions
Big Data Summarization : Framework, Challenges and Possible SolutionsBig Data Summarization : Framework, Challenges and Possible Solutions
Big Data Summarization : Framework, Challenges and Possible Solutionsaciijournal
 

Semelhante a Big Data (20)

Big data analytics, survey r.nabati
Big data analytics, survey r.nabatiBig data analytics, survey r.nabati
Big data analytics, survey r.nabati
 
Lecture 5 - Big Data and Hadoop Intro.ppt
Lecture 5 - Big Data and Hadoop Intro.pptLecture 5 - Big Data and Hadoop Intro.ppt
Lecture 5 - Big Data and Hadoop Intro.ppt
 
TCS_DATA_ANALYSIS_REPORT_ADITYA
TCS_DATA_ANALYSIS_REPORT_ADITYATCS_DATA_ANALYSIS_REPORT_ADITYA
TCS_DATA_ANALYSIS_REPORT_ADITYA
 
Big Data & Data Mining
Big Data & Data MiningBig Data & Data Mining
Big Data & Data Mining
 
U0 vqmtq3m tc=
U0 vqmtq3m tc=U0 vqmtq3m tc=
U0 vqmtq3m tc=
 
Big data and data mining
Big data and data miningBig data and data mining
Big data and data mining
 
Big Data
Big DataBig Data
Big Data
 
Big data - what, why, where, when and how
Big data - what, why, where, when and howBig data - what, why, where, when and how
Big data - what, why, where, when and how
 
An Efficient Approach for Clustering High Dimensional Data
An Efficient Approach for Clustering High Dimensional DataAn Efficient Approach for Clustering High Dimensional Data
An Efficient Approach for Clustering High Dimensional Data
 
A Big Data Concept
A Big Data ConceptA Big Data Concept
A Big Data Concept
 
Big-Data-Analytics.8592259.powerpoint.pdf
Big-Data-Analytics.8592259.powerpoint.pdfBig-Data-Analytics.8592259.powerpoint.pdf
Big-Data-Analytics.8592259.powerpoint.pdf
 
Data mining with big data
Data mining with big dataData mining with big data
Data mining with big data
 
using big-data methods analyse the Cross platform aviation
 using big-data methods analyse the Cross platform aviation using big-data methods analyse the Cross platform aviation
using big-data methods analyse the Cross platform aviation
 
BigData
BigDataBigData
BigData
 
A Review Paper on Big Data and Hadoop for Data Science
A Review Paper on Big Data and Hadoop for Data ScienceA Review Paper on Big Data and Hadoop for Data Science
A Review Paper on Big Data and Hadoop for Data Science
 
Hadoop
HadoopHadoop
Hadoop
 
Big data and oracle
Big data and oracleBig data and oracle
Big data and oracle
 
12575474.ppt
12575474.ppt12575474.ppt
12575474.ppt
 
Big Data Summarization : Framework, Challenges and Possible Solutions
Big Data Summarization : Framework, Challenges and Possible SolutionsBig Data Summarization : Framework, Challenges and Possible Solutions
Big Data Summarization : Framework, Challenges and Possible Solutions
 
Big Data Summarization : Framework, Challenges and Possible Solutions
Big Data Summarization : Framework, Challenges and Possible SolutionsBig Data Summarization : Framework, Challenges and Possible Solutions
Big Data Summarization : Framework, Challenges and Possible Solutions
 

Último

Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 

Último (20)

Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 

Big Data

  • 2. Definition  Big data is the term for a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications.
  • 4. ABC of BIG DATA  Analytics. This solution area focuses on providing efficient analytics for extremely large datasets. Analytics is all about gaining insight, taking advantage of the digital universe, and turning data into high-quality information, providing deeper insights about the business to enable better decisions.  Bandwidth. This solution area focuses on obtaining better performance for very fast workloads. High-bandwidth applications include highperformance computing: the ability to perform complex analyses at extremely high speeds; high-performance video streaming for surveillance and mission planning; and as video editing and play-out in media and entertainment.  Content. This solution area focuses on the need to provide boundless secure scalable data storage. Content solutions must enable storing virtually unlimited amounts of data, so that enterprises can store as much data as they want, find it when they need it, and never lose it.
  • 5. 3 V’S of BIG DATA  Volume:  Velocity: As a direct consequence of the rate at which data is being Not only can each data source contain a huge volume of data, but also the number of data sources, even for a single domain, has grown to be in the tens of thousands. collected and continuously made available,many of the data sources are very dynamic.  Variety: Data sources (even in the same domain) are extremely heterogeneous both at the schema level regarding how they structure their data and at the instance level regarding how they describe the same realworld entity, exhibiting considerable variety even for substantially similar entities.
  • 6. Examples       The NASA Center for Climate Simulation (NCCS) stores 32 petabytes of climate observations Big data analysis played a large role in  Barack Obama's successful 2012 reelection campaign. eBay.com uses two data warehouses at 7.5 petabytes and 40PB as well as a 40PB Hadoop cluster for search, consumer recommendations, and merchandising. Inside eBay’s 90PB data warehouse Amazon.com handles millions of back-end operations every day, as well as queries from more than half a million third-party sellers. The core technology that keeps Amazon running is Linux-based and as of 2005 they had the world’s three largest Linux databases, with capacities of 7.8 TB, 18.5 TB, and 24.7 TB. [ Walmart handles more than 1 million customer transactions every hour, which is imported into databases estimated to contain more than 2.5 petabytes (2560 terabytes) of data – the equivalent of 167 times the information contained in all the books in the US Library of Congress. Facebook handles 50 billion photos from its user base.
  • 7.
  • 8.
  • 9. Big data Integration   A lot of data growth is happening around these so-called unstructured data types. Big data integration is all about automation of the collection, organization and analysis of these data types. The importance of big data integration has led to a substantial amount of research over the past few years on the topics of schema mapping, record linkage and data fusion.
  • 10. Structured data vs Unstructured data
  • 11. Big data vs Traditional Data Integration  The number of data sources, even for a single domain, has grown to be in the tens of thousands.  Many of the data sources are very dynamic, as a huge amount of newly collected data are continuously made available.  The data sources are extremely heterogeneous in their structure, with considerable variety even for substantially similar entities.  The data sources are of widely differing qualities, with significant differences in the coverage, accuracy and timeliness of data provided.
  • 12. Schema Mapping Schema mapping in a data integration system refers to i) creating a mediated (global) schema, and (ii) Identifying the mappings between the mediated (global) schema and the local schemas of the data sources to determine which (sets of) attributes contain the same information 
  • 13. Example    Entities like people (customers, employees), companies (the enterprise itself, competitors, partners, suppliers), products (those owned by the enterprise and its competitors) Defined Relationships among these entities Activities with one or more entities as actors and/or subjects - Documents can represent these activities
  • 14. Record Linkage   Record linkage (RL) refers to the task of finding records in a data set that refer to the same entity across different data sources (e.g., data files, books, websites, databases). Record linkage is necessary when joining data sets based on entities that may or may not share a common identifier (e.g., database key, URI, National identification number), as may be the case due to differences in record shape, storage location, and/or curator style or preference
  • 15. Challenge in BDI    In BDI, (i) data sources tend to be heterogeneous in their structure and many sources (e.g., tweets, blog posts) provide unstructured data, and (ii) data sources are dynamic and continuously evolving. To address the volume dimension, new techniques have been proposed to enable parallel record linkage using MapReduce. Adaptive blocking is another technique been used to overcome this.
  • 16. MapReduce    MapReduce is a programming model for processing large data sets with a parallel, distributed algorithm on a cluster. The model is inspired by the map and reduce functions commonly used in functional programming. A MapReduce program is composed of a Map() procedure that performs filtering and sorting and  Reduce() procedure that performs a summary operation.
  • 17. Adaptive Blocking  Blocking methods alleviate this big data integration problem by efficiently selecting approximately similar object pairs for subsequent distance computations, leaving out the remaining pairs as dissimilar.
  • 18. Data fusion    Data fusion refers to resolving conflicts from different sources and finding the truth that reflects the real world. Its motivation is exactly the veracity of data: the Web has made it easy to publish and spread false information across multiple sources.  Data integration might be viewed as set combination wherein the larger set is retained, whereas fusion is a set reduction technique
  • 19. Data fusion model       Level 0: Source Preprocessing. Level 1: Object Assessment Level 2: Situation Assessment Level 3: Impact Assessment  Level 4: Process Refinement Level 5: User Refinement 
  • 20. Advantages      Real-time rerouting of transportation fleets based on weather patterns Customer sentiment analysis based on social postings Targeted disease therapies based on genomic data Allocation of disaster relief supplies based on mobile and social messages from victims Cars driving themselves.
  • 21. Conclusion This seminar gives a basic insight of what is big data and reviews the state-of-the-art techniques for data integration in addressing the new challenges raised by Big Data, including volume and number of sources, velocity, variety, and veracity. It also lists out the advantages of harnessing the potential of big data.