SlideShare uma empresa Scribd logo
1 de 21
BIG DATA
Definition


Big data is the term for a collection of data
sets so large and complex that it becomes
difficult to process using on-hand database
management tools or traditional data
processing applications.
How big ?
ABC of BIG DATA


Analytics. This solution area focuses on providing efficient analytics for

extremely large datasets. Analytics is all about gaining insight, taking
advantage of the digital universe, and turning data into high-quality
information, providing deeper insights about the business to enable better
decisions.


Bandwidth. This solution area focuses on obtaining better performance

for very fast workloads. High-bandwidth applications include highperformance computing: the ability to perform complex analyses at
extremely high speeds; high-performance video streaming for surveillance
and mission planning; and as video editing and play-out in media and
entertainment.


Content. This solution area focuses on the need to provide boundless

secure scalable data storage. Content solutions must enable storing
virtually unlimited amounts of data, so that enterprises can store as much
data as they want, find it when they need it, and never lose it.
3 V’S of BIG DATA


Volume:



Velocity: As a direct consequence of the rate at which data is being

Not only can each data source contain a huge volume of data,
but also the number of data sources, even for a single domain, has grown
to be in the tens of thousands.
collected and continuously made available,many of the data sources are
very dynamic.



Variety: Data sources (even in the same domain) are extremely

heterogeneous both at the schema level regarding how they structure their
data and at the instance level regarding how they describe the same realworld entity, exhibiting considerable variety even for substantially similar
entities.
Examples












The NASA Center for Climate Simulation (NCCS) stores 32 petabytes of
climate observations
Big data analysis played a large role in  Barack Obama's successful 2012 reelection campaign.
eBay.com uses two data warehouses at 7.5 petabytes and 40PB as well as a
40PB Hadoop cluster for search, consumer recommendations, and
merchandising. Inside eBay’s 90PB data warehouse
Amazon.com handles millions of back-end operations every day, as well as
queries from more than half a million third-party sellers. The core technology
that keeps Amazon running is Linux-based and as of 2005 they had the world’s
three largest Linux databases, with capacities of 7.8 TB, 18.5 TB, and 24.7 TB. [
Walmart handles more than 1 million customer transactions every hour, which
is imported into databases estimated to contain more than 2.5 petabytes (2560
terabytes) of data – the equivalent of 167 times the information contained in all
the books in the US Library of Congress.
Facebook handles 50 billion photos from its user base.
Big data Integration




A lot of data growth is happening around these so-called
unstructured data types. Big data integration is all about
automation of the collection, organization and analysis
of these data types.
The importance of big data integration has led to a
substantial amount of research over the past few years
on the topics of schema mapping, record linkage and
data fusion.
Structured data vs Unstructured
data
Big data vs Traditional Data Integration


The number of data sources, even for a single

domain, has grown to be in the tens of thousands.


Many of the data sources are very dynamic, as a huge
amount of newly collected data are continuously made
available.



The data sources are extremely heterogeneous in their
structure, with considerable variety even for substantially
similar entities.



The data sources are of widely differing qualities, with

significant differences in the coverage, accuracy and
timeliness
of data provided.
Schema Mapping
Schema mapping in a data integration system refers to
i) creating a mediated (global) schema, and
(ii) Identifying the mappings between the mediated (global)
schema and the local schemas of the data sources to
determine which (sets of) attributes contain the same
information

Example





Entities like people (customers, employees), companies
(the enterprise itself, competitors, partners, suppliers),
products (those owned by the enterprise and its
competitors)
Defined Relationships among these entities
Activities with one or more entities as actors and/or
subjects - Documents can represent these activities
Record Linkage




Record linkage (RL) refers to the task of
finding records in a data set that refer to the
same entity across different data sources (e.g., data
files, books, websites, databases).
Record linkage is necessary when joining data sets
based on entities that may or may not share a common
identifier (e.g., database key, URI, National identification
number), as may be the case due to differences in
record shape, storage location, and/or curator style or
preference
Challenge in BDI






In BDI, (i) data sources tend to be heterogeneous in
their structure and many sources (e.g., tweets, blog
posts) provide unstructured data, and
(ii) data sources are dynamic and continuously evolving.
To address the volume dimension, new techniques have
been proposed to enable parallel record linkage using
MapReduce.
Adaptive blocking is another technique been used to
overcome this.
MapReduce






MapReduce is a programming model for processing
large data sets with a parallel, distributed algorithm on
a cluster.
The model is inspired by the map and reduce functions
commonly used in functional programming.
A MapReduce program is composed of
a Map() procedure that performs filtering and sorting
and  Reduce() procedure that performs a summary
operation.
Adaptive Blocking


Blocking methods alleviate this big data integration
problem by efficiently selecting approximately similar
object pairs for subsequent distance computations,
leaving out the remaining pairs as dissimilar.
Data fusion






Data fusion refers to resolving conflicts from different
sources and finding the truth that reflects the real world.
Its motivation is exactly the veracity of data: the Web has
made it easy to publish and spread false information across
multiple sources.
 Data integration might be viewed as set combination
wherein the larger set is retained, whereas fusion is a
set reduction technique
Data fusion model







Level 0: Source Preprocessing.
Level 1: Object Assessment
Level 2: Situation Assessment
Level 3: Impact Assessment 
Level 4: Process Refinement
Level 5: User Refinement 
Advantages








Real-time rerouting of transportation fleets based on
weather patterns
Customer sentiment analysis based on social postings
Targeted disease therapies based on genomic data
Allocation of disaster relief supplies based on mobile
and social messages from victims
Cars driving themselves.
Conclusion
This seminar gives a basic insight of what is big data
and reviews the state-of-the-art techniques for data
integration in addressing the new challenges raised by
Big Data, including volume and number of sources,
velocity, variety, and veracity. It also lists out the
advantages of harnessing the potential of big data.

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Big Data PPT by Rohit Dubey
Big Data PPT by Rohit DubeyBig Data PPT by Rohit Dubey
Big Data PPT by Rohit Dubey
 
Lecture1 introduction to big data
Lecture1 introduction to big dataLecture1 introduction to big data
Lecture1 introduction to big data
 
Big Data
Big DataBig Data
Big Data
 
Business intelligence
Business intelligenceBusiness intelligence
Business intelligence
 
Big data
Big dataBig data
Big data
 
Big data Presentation
Big data PresentationBig data Presentation
Big data Presentation
 
Presentation on Big Data
Presentation on Big DataPresentation on Big Data
Presentation on Big Data
 
Introduction to big data
Introduction to big dataIntroduction to big data
Introduction to big data
 
Presentation on Big Data
Presentation on Big DataPresentation on Big Data
Presentation on Big Data
 
Big Data - Applications and Technologies Overview
Big Data - Applications and Technologies OverviewBig Data - Applications and Technologies Overview
Big Data - Applications and Technologies Overview
 
Data analytics
Data analyticsData analytics
Data analytics
 
Big data by Mithlesh sadh
Big data by Mithlesh sadhBig data by Mithlesh sadh
Big data by Mithlesh sadh
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...
 
Big data PPT prepared by Hritika Raj (Shivalik college of engg.)
Big data PPT prepared by Hritika Raj (Shivalik college of engg.)Big data PPT prepared by Hritika Raj (Shivalik college of engg.)
Big data PPT prepared by Hritika Raj (Shivalik college of engg.)
 
What is big data?
What is big data?What is big data?
What is big data?
 
Big Data
Big DataBig Data
Big Data
 
Data science
Data scienceData science
Data science
 
Data science Big Data
Data science Big DataData science Big Data
Data science Big Data
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 

Destaque

Big data veracity challenges
Big data veracity challengesBig data veracity challenges
Big data veracity challengesPrayukth K V
 
Sejarah komputer & perkembangannya.html
Sejarah komputer & perkembangannya.htmlSejarah komputer & perkembangannya.html
Sejarah komputer & perkembangannya.htmlDevraNurEkaKusuma
 
Walk-In Clinic In Austin, TX
Walk-In Clinic In Austin, TXWalk-In Clinic In Austin, TX
Walk-In Clinic In Austin, TXTexanUrgentCare
 
الجهاز الهضمي
الجهاز الهضميالجهاز الهضمي
الجهاز الهضميayshamashani
 
Kwater investor presentation oct2013
Kwater investor presentation oct2013Kwater investor presentation oct2013
Kwater investor presentation oct2013Sarod Paichayonrittha
 
School and college tour packages to singapore
School and college tour packages to singaporeSchool and college tour packages to singapore
School and college tour packages to singaporeDAsia India
 
Предоставление частных земельных участков под строительство электрических сет...
Предоставление частных земельных участков под строительство электрических сет...Предоставление частных земельных участков под строительство электрических сет...
Предоставление частных земельных участков под строительство электрических сет...Мрск Урала
 
Global Travelling overview
Global Travelling overviewGlobal Travelling overview
Global Travelling overviewDmitry Rodionov
 
Bajigur spesial rasa
Bajigur spesial rasaBajigur spesial rasa
Bajigur spesial rasaPanjiKN
 

Destaque (15)

Token
TokenToken
Token
 
Big data veracity challenges
Big data veracity challengesBig data veracity challenges
Big data veracity challenges
 
Sejarah komputer & perkembangannya.html
Sejarah komputer & perkembangannya.htmlSejarah komputer & perkembangannya.html
Sejarah komputer & perkembangannya.html
 
Walk-In Clinic In Austin, TX
Walk-In Clinic In Austin, TXWalk-In Clinic In Austin, TX
Walk-In Clinic In Austin, TX
 
Doc1
Doc1Doc1
Doc1
 
الجهاز الهضمي
الجهاز الهضميالجهاز الهضمي
الجهاز الهضمي
 
Kwater investor presentation oct2013
Kwater investor presentation oct2013Kwater investor presentation oct2013
Kwater investor presentation oct2013
 
Nanowrimo castle
Nanowrimo castleNanowrimo castle
Nanowrimo castle
 
Somnath City Plots For Booking-7503367689
Somnath City Plots For Booking-7503367689Somnath City Plots For Booking-7503367689
Somnath City Plots For Booking-7503367689
 
School and college tour packages to singapore
School and college tour packages to singaporeSchool and college tour packages to singapore
School and college tour packages to singapore
 
Drogas2
Drogas2Drogas2
Drogas2
 
Предоставление частных земельных участков под строительство электрических сет...
Предоставление частных земельных участков под строительство электрических сет...Предоставление частных земельных участков под строительство электрических сет...
Предоставление частных земельных участков под строительство электрических сет...
 
Global Travelling overview
Global Travelling overviewGlobal Travelling overview
Global Travelling overview
 
Protocolos de red
Protocolos de redProtocolos de red
Protocolos de red
 
Bajigur spesial rasa
Bajigur spesial rasaBajigur spesial rasa
Bajigur spesial rasa
 

Semelhante a Big Data

Big data analytics, survey r.nabati
Big data analytics, survey r.nabatiBig data analytics, survey r.nabati
Big data analytics, survey r.nabatinabati
 
Lecture 5 - Big Data and Hadoop Intro.ppt
Lecture 5 - Big Data and Hadoop Intro.pptLecture 5 - Big Data and Hadoop Intro.ppt
Lecture 5 - Big Data and Hadoop Intro.pptalmaraniabwmalk
 
TCS_DATA_ANALYSIS_REPORT_ADITYA
TCS_DATA_ANALYSIS_REPORT_ADITYATCS_DATA_ANALYSIS_REPORT_ADITYA
TCS_DATA_ANALYSIS_REPORT_ADITYAAditya Srinivasan
 
Big data and data mining
Big data and data miningBig data and data mining
Big data and data miningPolash Halder
 
Big data - what, why, where, when and how
Big data - what, why, where, when and howBig data - what, why, where, when and how
Big data - what, why, where, when and howbobosenthil
 
An Efficient Approach for Clustering High Dimensional Data
An Efficient Approach for Clustering High Dimensional DataAn Efficient Approach for Clustering High Dimensional Data
An Efficient Approach for Clustering High Dimensional DataIJSTA
 
Big-Data-Analytics.8592259.powerpoint.pdf
Big-Data-Analytics.8592259.powerpoint.pdfBig-Data-Analytics.8592259.powerpoint.pdf
Big-Data-Analytics.8592259.powerpoint.pdfrajsharma159890
 
using big-data methods analyse the Cross platform aviation
 using big-data methods analyse the Cross platform aviation using big-data methods analyse the Cross platform aviation
using big-data methods analyse the Cross platform aviationranjit banshpal
 
A Review Paper on Big Data and Hadoop for Data Science
A Review Paper on Big Data and Hadoop for Data ScienceA Review Paper on Big Data and Hadoop for Data Science
A Review Paper on Big Data and Hadoop for Data Scienceijtsrd
 
Big Data Summarization : Framework, Challenges and Possible Solutions
Big Data Summarization : Framework, Challenges and Possible SolutionsBig Data Summarization : Framework, Challenges and Possible Solutions
Big Data Summarization : Framework, Challenges and Possible Solutionsaciijournal
 
Big Data Summarization : Framework, Challenges and Possible Solutions
Big Data Summarization : Framework, Challenges and Possible SolutionsBig Data Summarization : Framework, Challenges and Possible Solutions
Big Data Summarization : Framework, Challenges and Possible Solutionsaciijournal
 

Semelhante a Big Data (20)

Big data analytics, survey r.nabati
Big data analytics, survey r.nabatiBig data analytics, survey r.nabati
Big data analytics, survey r.nabati
 
Lecture 5 - Big Data and Hadoop Intro.ppt
Lecture 5 - Big Data and Hadoop Intro.pptLecture 5 - Big Data and Hadoop Intro.ppt
Lecture 5 - Big Data and Hadoop Intro.ppt
 
TCS_DATA_ANALYSIS_REPORT_ADITYA
TCS_DATA_ANALYSIS_REPORT_ADITYATCS_DATA_ANALYSIS_REPORT_ADITYA
TCS_DATA_ANALYSIS_REPORT_ADITYA
 
Big Data & Data Mining
Big Data & Data MiningBig Data & Data Mining
Big Data & Data Mining
 
U0 vqmtq3m tc=
U0 vqmtq3m tc=U0 vqmtq3m tc=
U0 vqmtq3m tc=
 
Big data and data mining
Big data and data miningBig data and data mining
Big data and data mining
 
Big Data
Big DataBig Data
Big Data
 
Big data - what, why, where, when and how
Big data - what, why, where, when and howBig data - what, why, where, when and how
Big data - what, why, where, when and how
 
An Efficient Approach for Clustering High Dimensional Data
An Efficient Approach for Clustering High Dimensional DataAn Efficient Approach for Clustering High Dimensional Data
An Efficient Approach for Clustering High Dimensional Data
 
A Big Data Concept
A Big Data ConceptA Big Data Concept
A Big Data Concept
 
Big-Data-Analytics.8592259.powerpoint.pdf
Big-Data-Analytics.8592259.powerpoint.pdfBig-Data-Analytics.8592259.powerpoint.pdf
Big-Data-Analytics.8592259.powerpoint.pdf
 
Data mining with big data
Data mining with big dataData mining with big data
Data mining with big data
 
using big-data methods analyse the Cross platform aviation
 using big-data methods analyse the Cross platform aviation using big-data methods analyse the Cross platform aviation
using big-data methods analyse the Cross platform aviation
 
BigData
BigDataBigData
BigData
 
A Review Paper on Big Data and Hadoop for Data Science
A Review Paper on Big Data and Hadoop for Data ScienceA Review Paper on Big Data and Hadoop for Data Science
A Review Paper on Big Data and Hadoop for Data Science
 
Hadoop
HadoopHadoop
Hadoop
 
Big data and oracle
Big data and oracleBig data and oracle
Big data and oracle
 
12575474.ppt
12575474.ppt12575474.ppt
12575474.ppt
 
Big Data Summarization : Framework, Challenges and Possible Solutions
Big Data Summarization : Framework, Challenges and Possible SolutionsBig Data Summarization : Framework, Challenges and Possible Solutions
Big Data Summarization : Framework, Challenges and Possible Solutions
 
Big Data Summarization : Framework, Challenges and Possible Solutions
Big Data Summarization : Framework, Challenges and Possible SolutionsBig Data Summarization : Framework, Challenges and Possible Solutions
Big Data Summarization : Framework, Challenges and Possible Solutions
 

Último

Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 

Último (20)

Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 

Big Data

  • 2. Definition  Big data is the term for a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications.
  • 4. ABC of BIG DATA  Analytics. This solution area focuses on providing efficient analytics for extremely large datasets. Analytics is all about gaining insight, taking advantage of the digital universe, and turning data into high-quality information, providing deeper insights about the business to enable better decisions.  Bandwidth. This solution area focuses on obtaining better performance for very fast workloads. High-bandwidth applications include highperformance computing: the ability to perform complex analyses at extremely high speeds; high-performance video streaming for surveillance and mission planning; and as video editing and play-out in media and entertainment.  Content. This solution area focuses on the need to provide boundless secure scalable data storage. Content solutions must enable storing virtually unlimited amounts of data, so that enterprises can store as much data as they want, find it when they need it, and never lose it.
  • 5. 3 V’S of BIG DATA  Volume:  Velocity: As a direct consequence of the rate at which data is being Not only can each data source contain a huge volume of data, but also the number of data sources, even for a single domain, has grown to be in the tens of thousands. collected and continuously made available,many of the data sources are very dynamic.  Variety: Data sources (even in the same domain) are extremely heterogeneous both at the schema level regarding how they structure their data and at the instance level regarding how they describe the same realworld entity, exhibiting considerable variety even for substantially similar entities.
  • 6. Examples       The NASA Center for Climate Simulation (NCCS) stores 32 petabytes of climate observations Big data analysis played a large role in  Barack Obama's successful 2012 reelection campaign. eBay.com uses two data warehouses at 7.5 petabytes and 40PB as well as a 40PB Hadoop cluster for search, consumer recommendations, and merchandising. Inside eBay’s 90PB data warehouse Amazon.com handles millions of back-end operations every day, as well as queries from more than half a million third-party sellers. The core technology that keeps Amazon running is Linux-based and as of 2005 they had the world’s three largest Linux databases, with capacities of 7.8 TB, 18.5 TB, and 24.7 TB. [ Walmart handles more than 1 million customer transactions every hour, which is imported into databases estimated to contain more than 2.5 petabytes (2560 terabytes) of data – the equivalent of 167 times the information contained in all the books in the US Library of Congress. Facebook handles 50 billion photos from its user base.
  • 7.
  • 8.
  • 9. Big data Integration   A lot of data growth is happening around these so-called unstructured data types. Big data integration is all about automation of the collection, organization and analysis of these data types. The importance of big data integration has led to a substantial amount of research over the past few years on the topics of schema mapping, record linkage and data fusion.
  • 10. Structured data vs Unstructured data
  • 11. Big data vs Traditional Data Integration  The number of data sources, even for a single domain, has grown to be in the tens of thousands.  Many of the data sources are very dynamic, as a huge amount of newly collected data are continuously made available.  The data sources are extremely heterogeneous in their structure, with considerable variety even for substantially similar entities.  The data sources are of widely differing qualities, with significant differences in the coverage, accuracy and timeliness of data provided.
  • 12. Schema Mapping Schema mapping in a data integration system refers to i) creating a mediated (global) schema, and (ii) Identifying the mappings between the mediated (global) schema and the local schemas of the data sources to determine which (sets of) attributes contain the same information 
  • 13. Example    Entities like people (customers, employees), companies (the enterprise itself, competitors, partners, suppliers), products (those owned by the enterprise and its competitors) Defined Relationships among these entities Activities with one or more entities as actors and/or subjects - Documents can represent these activities
  • 14. Record Linkage   Record linkage (RL) refers to the task of finding records in a data set that refer to the same entity across different data sources (e.g., data files, books, websites, databases). Record linkage is necessary when joining data sets based on entities that may or may not share a common identifier (e.g., database key, URI, National identification number), as may be the case due to differences in record shape, storage location, and/or curator style or preference
  • 15. Challenge in BDI    In BDI, (i) data sources tend to be heterogeneous in their structure and many sources (e.g., tweets, blog posts) provide unstructured data, and (ii) data sources are dynamic and continuously evolving. To address the volume dimension, new techniques have been proposed to enable parallel record linkage using MapReduce. Adaptive blocking is another technique been used to overcome this.
  • 16. MapReduce    MapReduce is a programming model for processing large data sets with a parallel, distributed algorithm on a cluster. The model is inspired by the map and reduce functions commonly used in functional programming. A MapReduce program is composed of a Map() procedure that performs filtering and sorting and  Reduce() procedure that performs a summary operation.
  • 17. Adaptive Blocking  Blocking methods alleviate this big data integration problem by efficiently selecting approximately similar object pairs for subsequent distance computations, leaving out the remaining pairs as dissimilar.
  • 18. Data fusion    Data fusion refers to resolving conflicts from different sources and finding the truth that reflects the real world. Its motivation is exactly the veracity of data: the Web has made it easy to publish and spread false information across multiple sources.  Data integration might be viewed as set combination wherein the larger set is retained, whereas fusion is a set reduction technique
  • 19. Data fusion model       Level 0: Source Preprocessing. Level 1: Object Assessment Level 2: Situation Assessment Level 3: Impact Assessment  Level 4: Process Refinement Level 5: User Refinement 
  • 20. Advantages      Real-time rerouting of transportation fleets based on weather patterns Customer sentiment analysis based on social postings Targeted disease therapies based on genomic data Allocation of disaster relief supplies based on mobile and social messages from victims Cars driving themselves.
  • 21. Conclusion This seminar gives a basic insight of what is big data and reviews the state-of-the-art techniques for data integration in addressing the new challenges raised by Big Data, including volume and number of sources, velocity, variety, and veracity. It also lists out the advantages of harnessing the potential of big data.