SlideShare uma empresa Scribd logo
1 de 40
Data Science
Data Science
 Overview of Data Science
 Definition of Data and Information
 Data Types and Representation
 Data Value Chain
 Data Acquisition
 Data Analysis
 Data Curating
 Data Storage
 Data Usage
 Basic Concepts of Big Data
Overview of Data Science
 Data science is the practice of mining large data sets of raw data, both structured
and unstructured, to identify patterns and extract actionable insight from them.
 Data Science deals with vast volumes of data using modern tools and techniques to
find unseen patterns, derive meaningful information, and make business decisions.
 Data Science is a blend of various fields like Probability, Statistics, Programming,
Analysis, Cloud Computing, etc.;
 Data Science is the extraction of actionable insights from raw data.
Data Information
Data
 Raw facts, figures and statistics
 No contextual meaning
 Data can be in characters,
numbers, images, words
Information
 Processed / Organized Data
 Exact meaning and organized
context
 Organized and presented in
context – Value added to data
Context + Processing
100
100
Miles
Difficult to
walk 100 Miles
but Vehicle
transport is
okay
100 Miles
is a Far
Distance
Measure of Data in Files – File Size
Name Equal To Size(In Bytes)
Bit 1 Bit 1/8
Nibble 4 Bits 1/2 (rare)
Byte 8 Bits 1
Kilobyte 1024 Bytes 1024
Megabyte 1, 024 Kilobytes 1, 048, 576
Gigabyte 1, 024 Megabytes 1, 073, 741, 824
Terrabyte 1, 024 Gigabytes 1, 099, 511, 627, 776
Petabyte 1, 024 Terabytes 1, 125, 899, 906, 842, 624
Exabyte 1, 024 Petabytes 1, 152, 921, 504, 606, 846, 976
Zettabyte 1, 024 Exabytes 1, 180, 591, 620, 717, 411, 303, 424
Yottabyte 1, 024 Zettabytes 1, 208, 925, 819, 614, 629, 174, 706, 176
Data
Types of Data and it’s Representation
Structured Data
Semi-Structured Data
Unstructured Data
 Predefined data models
 Stored in Rows and Columns
 Examples: Dates, Phone Number, Names
 No predefined data models
 Stored in various forms – image, audio, video, text
 Examples: Documents, Image Files, Emails & Messages
 Loosely organized into categories using meta tags
 Stored in abstract and figures – HTML, XML, JSON
 Examples: Server Logs, Tweets organized by Hashtags
Data Science
 Data science enables
businesses to Process huge
amounts of structured and
unstructured Big Data to
detect patterns
 Alexa or Siri for a
recommendation demands
data science
 Operating a self-driving car
 Search Engine
 Chatbot for customer service
Data Science Pre-Requisites
Machine
Learning
Modeling Statistics Programming Databases
Data Science Lifecycle
 Capture: Data Acquisition, Data Entry, Signal Reception, Data Extraction - Gathering raw structured
and unstructured data
 Maintain: Data Warehousing, Data Cleansing, Data Staging, Data Processing, Data Architecture - Taking
the raw data and putting it in a form that can be used
 Process: Data Mining, Clustering/Classification, Data Modeling, Data Summarization - Data scientists
take the prepared data and examine its patterns, ranges, and biases to determine how useful it will be in
predictive analysis
 Analyze: Exploratory/Confirmatory, Predictive Analysis, Regression, Text Mining, Qualitative Analysis -
- Performing the various analyses on the data
 Communicate: Data Reporting, Data Visualization, Business Intelligence, Decision Making - Analysts
prepare the analyses in easily readable forms such as charts, graphs, and reports
Data Science Applications
 Healthcare
Healthcare companies are using data science to build sophisticated medical instruments to detect and cure diseases.
 Gaming
Video and computer games are now being created with the help of data science and that has taken the gaming experience to the
next level.
 Image Recognition
Identifying patterns in images and detecting objects in an image is one of the most popular data science applications.
 Recommendation Systems
Netflix and Amazon give movie and product recommendations based on what you like to watch, purchase, or browse on their
platforms.
 Logistics
Data Science is used by logistics companies to optimize routes to ensure faster delivery of products and increase operational
efficiency.
 Fraud Detection
Banking and financial institutions use data science and related algorithms to detect fraudulent transactions.
Data Value Chain
Data Value Chain - The evolution of
data from collection to analysis,
dissemination, and the final impact of
data on decision making
Data Value Chain
 Data Capture & Acquisition
Collection of raw data from both internal and external sources. The first phase of data collection involves identifying what data to collect and
then establishing a process to do so (i.e., conducting a survey or retrieving automated IoT data). Decisions made here will affect the quality and
usability of data throughout its life-cycle
 Data Processing & Cleansing
Cleaning data - identifying and correcting corrupt, inaccurate, or irrelevant data - as well as converting raw data into a format that is usable,
integratable and machine readable.
 Data Curation, Integration and Enrichment
Data curation and integration refers to the collection of processes required to merge data from multiple sources into one, cohesive dataset.
During this process, data is also enriched, meaning that contextual metadata (the data that makes larger datasets discoverable) is added or
updated.
 Data Analysis
Data is analyzed and used to uncover trends, patterns and other insights that can enhance decision making.
 Data ROI & Monetization
The application of data analytics processes to solve real-world problems and, in a business setting, increase revenue.
Big Data Value Chain
Data
Acquisition
Data
Analysis
Data
Curating
Data
Storage
Data Usage
Big Data Value Chain – Data Acquisition
Process of gathering, filtering, and cleaning data before it is
put in a data warehouse or any other storage solution on
which data analysis can be carried out. Data acquisition is
one of the major big data challenges in terms of
infrastructure requirements. The infrastructure required to
support the acquisition of big data must deliver low,
predictable latency in both capturing data and in executing
queries; be able to handle very high transaction volumes,
often in a distributed environment; and support flexible and
dynamic data structures.
Big Data Value Chain – Data Analysis
Concerned with making the raw data acquired amenable to
use in decision-making as well as domain-specific usage. Data
analysis involves exploring, transforming and modelling data
with the goal of highlighting relevant data, synthesizing and
extracting useful hidden information with high potential
from a business point of view.
Big Data Value Chain – Data Curation
Data curation processes can be categorized into different
activities such as content creation, selection, classification,
transformation, validation, and preservation. Data curation
is responsible for improving the accessibility and quality of
data, ensuring that data are trustworthy, discoverable,
accessible, reusable, and fit their purpose.
Big Data Value Chain – Data Storage
Data Storage is the persistence and management of data in a
scalable way that satisfies the needs of applications that
require fast access to the data. Relational Database
Management Systems (RDBMS) are majorly used. NoSQL
technologies have been designed with the scalability goal in
mind and present a wide range of solutions based on
alternative data models.
Big Data Value Chain – Data Usage
Covers the data-driven business activities that need access to
data, its analysis, and the tools needed to integrate the data
analysis within the business activity. Data usage in business
decision-making can enhance competitiveness through
reduction of costs, increased added value, or any other
parameter that can be measured against existing performance
criteria.
Discover / Acquisition Prepare Plan
Model Operationalize Communicate Results
Project
Phase
Basic Concepts of Big Data
 Big Data: High Degrees of Dimensions – Volume, Variety, Velocity, Value, Veracity
• Volume - Amount of the data that is been generated
• Velocity - Speed at which the data is been generated
• Variety - Diversity or different types of the data
• Value – Worth of the data
• Veracity - Quality, accuracy, or trustworthiness of the data
Big Data – Impact of 3V’s
Volume (Amount of Data): Dealing with large scales of data within data
processing (e.g., Global Supply Chains, Global Financial Analysis, Large Hadron
Collider).
Velocity (Speed of Data): Dealing with streams of high frequency of incoming
real-time data (e.g., Sensors, Pervasive Environments, Electronic Trading,
Internet of Things).
Variety (Range of Data Types/Sources): Dealing with data using differing
syntactic formats (e.g., Spreadsheets, XML, DBMS), schemas, and meanings
(e.g., Enterprise Data Integration).
Big Data Processing
The general categories of activities involved with big data processing are:
 Ingesting data into the system
 Persisting the data in storage
 Computing and Analyzing data
 Visualizing the results
Sources of Big Data
Categories:
 from human activities
 from the physical world
 from computers
Example:
 Internet data (emails, social media, and weblogs), network
data, mobile networks or telecoms, machine-to-machine data
or the IoT (sensor data), online transactions, medical records,
and open data (mostly by governments).
 Unstructured (such as text, audio, video) or semi-structured
(such as emails, tweets, weblogs).
Data Analytics
Step 1: Determine the criteria for grouping the data
Step 2: Collecting the data
Step 3: Organizing the data
Step 4: Cleaning the data
Step 5: Analyze and Derive Insights
Big Data Analytics
Big data analytics helps organizations
harness their data and use it to identify
new opportunities. That, in turn, leads
to smarter business moves, more
efficient operations, higher profits and
happier customers.
Big Data Analytics - Techniques
Data Mining / Analytics
Web Mining
Text Mining / Analytics
Predictive Analytics
Visual Analytics
Machine Learning / AI / Deep Learning
Mobile Analytics
Crowdsourcing
Big Data Analytics - Tools
 Hadoop
• For distributed storage of large datasets on computer clusters
• Designed to process large amounts of structured and unstructured data
• Provides large amounts of storage for all sorts of data along with the ability to handle virtually limitless
concurrent tasks
 MapReduce
• Google technology for processing massive amounts of data
• Software framework that enables developers to code programs that can process large amounts of unstructured
data
• It has two components:
 Map which distributes the input data to several clusters for parallel processing
 Reduce which collects all sub-results to provide the result
Big Data Analytics - Tools
 NoSQL
• Used in Big Data application in clustered environments
• Provides high speed access unstructured or semi-structured data
• Provides capabilities to query and retrieve unstructured and semi-structured data
 MongoDB
• For managing data that are frequently changing or unstructured
• flexible, highly scalable database designed for web applications
• used to store data in mobile apps, product catalogs, and real-time applications
 Other tools include Hive, Cassandra, Spark, Tableau, Talend, and cloud computing
Big Data Analytics - Applications
Internet of
Things (IoT)
Smart Grid Science
Healthcare Nursing Business
Industry Manufacturing
Public
Agencies
Big Data Analytics - Benefits
 Lead to making better decisions and improves insights and predictions. This can lead to
greater operational efficiency, productivity, reduced cost, and risk
 Eliminates the biases people have when making decisions based on limited information
 Analysis of data to be built into the process that enables automated decision-making
 Helps in reducing rates of return, producing high-quality products
 Improve overall profitability of business
 Helps social media, public and private agencies to explore behavioral patterns of people
 Potentially be used in driving economic growth in developing world
Big Data Challenges
 Complexities
• Processing, storage, and transfer of a large scale of data
• Challenge to filter out the useless information without discarding useful information
 Privacy
• Risk of Data Leakage
• Privacy concern arises continue from the users who outsource their private data into the cloud storage
 Security
• Concerns over the impact that collecting, storing, and processing large amount of data could have on security
• Security is a concern because of the variety and heterogeneity of Big Data
 Data Migration
• Transferring Big Data for distributed processing and storage
 Shortage of HR (Data Scientist)
References
 http://www.dataeconomy.eu/data-value-chain/#page-content
 https://rd-springer-com.eu1.proxy.openathens.net/chapter/10.1007/978-3-319-21569-3_4
 https://rd-springer-com.eu1.proxy.openathens.net/chapter/10.1007/978-3-319-21569-3_5
 https://rd-springer-com.eu1.proxy.openathens.net/chapter/10.1007/978-3-319-21569-3_6
 https://rd-springer-com.eu1.proxy.openathens.net/chapter/10.1007/978-3-319-21569-3_7
 https://rd-springer-com.eu1.proxy.openathens.net/chapter/10.1007/978-3-319-21569-3_8
 https://www.ibm.com/cloud/blog/structured-vs-unstructured-data
Appendix
Data Science
Data Science
Data Science

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Introduction to data science.pptx
Introduction to data science.pptxIntroduction to data science.pptx
Introduction to data science.pptx
 
Data science applications and usecases
Data science applications and usecasesData science applications and usecases
Data science applications and usecases
 
Data science
Data scienceData science
Data science
 
Data science
Data science Data science
Data science
 
Introduction to Data Science.pptx
Introduction to Data Science.pptxIntroduction to Data Science.pptx
Introduction to Data Science.pptx
 
introduction to data science
introduction to data scienceintroduction to data science
introduction to data science
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Data science
Data scienceData science
Data science
 
Introduction to data science club
Introduction to data science clubIntroduction to data science club
Introduction to data science club
 
1. Data Analytics-introduction
1. Data Analytics-introduction1. Data Analytics-introduction
1. Data Analytics-introduction
 
Data science & data scientist
Data science & data scientistData science & data scientist
Data science & data scientist
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
 
Data science
Data scienceData science
Data science
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Introduction to data analytics
Introduction to data analyticsIntroduction to data analytics
Introduction to data analytics
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 
data science
data sciencedata science
data science
 
Data science life cycle
Data science life cycleData science life cycle
Data science life cycle
 
Data science
Data scienceData science
Data science
 
Data analytics
Data analyticsData analytics
Data analytics
 

Semelhante a Data Science

Semelhante a Data Science (20)

BDA Mod1@AzDOCUMENTS.in.pdf
BDA Mod1@AzDOCUMENTS.in.pdfBDA Mod1@AzDOCUMENTS.in.pdf
BDA Mod1@AzDOCUMENTS.in.pdf
 
Big data
Big dataBig data
Big data
 
CS8091_BDA_Unit_I_Analytical_Architecture
CS8091_BDA_Unit_I_Analytical_ArchitectureCS8091_BDA_Unit_I_Analytical_Architecture
CS8091_BDA_Unit_I_Analytical_Architecture
 
data collection, data integration, data management, data modeling.pptx
data collection, data integration, data management, data modeling.pptxdata collection, data integration, data management, data modeling.pptx
data collection, data integration, data management, data modeling.pptx
 
Big data and oracle
Big data and oracleBig data and oracle
Big data and oracle
 
1 UNIT-DSP.pptx
1 UNIT-DSP.pptx1 UNIT-DSP.pptx
1 UNIT-DSP.pptx
 
Abstract
AbstractAbstract
Abstract
 
U - 2 Emerging.pptx
U - 2 Emerging.pptxU - 2 Emerging.pptx
U - 2 Emerging.pptx
 
Big data Introduction
Big data IntroductionBig data Introduction
Big data Introduction
 
Real World Application of Big Data In Data Mining Tools
Real World Application of Big Data In Data Mining ToolsReal World Application of Big Data In Data Mining Tools
Real World Application of Big Data In Data Mining Tools
 
Bigdata
Bigdata Bigdata
Bigdata
 
IRJET- Big Data Management and Growth Enhancement
IRJET- Big Data Management and Growth EnhancementIRJET- Big Data Management and Growth Enhancement
IRJET- Big Data Management and Growth Enhancement
 
All About Big Data
All About Big Data All About Big Data
All About Big Data
 
KIT-601 Lecture Notes-UNIT-1.pdf
KIT-601 Lecture Notes-UNIT-1.pdfKIT-601 Lecture Notes-UNIT-1.pdf
KIT-601 Lecture Notes-UNIT-1.pdf
 
What is Big Data - Edvicon
What is Big Data - EdviconWhat is Big Data - Edvicon
What is Big Data - Edvicon
 
Big data Analytics
Big data AnalyticsBig data Analytics
Big data Analytics
 
Chapter 2 - Intro to Data Sciences[2].pptx
Chapter 2 - Intro to Data Sciences[2].pptxChapter 2 - Intro to Data Sciences[2].pptx
Chapter 2 - Intro to Data Sciences[2].pptx
 
Big data unit 2
Big data unit 2Big data unit 2
Big data unit 2
 
ACCOUNTING-IT-APP-MIdterm Topic-Bigdata.pdf
ACCOUNTING-IT-APP-MIdterm Topic-Bigdata.pdfACCOUNTING-IT-APP-MIdterm Topic-Bigdata.pdf
ACCOUNTING-IT-APP-MIdterm Topic-Bigdata.pdf
 
Big Data Analytics_Unit1.pptx
Big Data Analytics_Unit1.pptxBig Data Analytics_Unit1.pptx
Big Data Analytics_Unit1.pptx
 

Mais de Prakhyath Rai

Mais de Prakhyath Rai (12)

Ethics, Professionalism and Other Emerging Technologies
Ethics, Professionalism and Other Emerging TechnologiesEthics, Professionalism and Other Emerging Technologies
Ethics, Professionalism and Other Emerging Technologies
 
Internet of Things (IoT)
Internet of Things (IoT)Internet of Things (IoT)
Internet of Things (IoT)
 
Artificial Intelligence
Artificial IntelligenceArtificial Intelligence
Artificial Intelligence
 
Emerging Exponential Technologies - History & Introduction
Emerging Exponential Technologies - History & IntroductionEmerging Exponential Technologies - History & Introduction
Emerging Exponential Technologies - History & Introduction
 
Preparation of Project
Preparation of ProjectPreparation of Project
Preparation of Project
 
Small Scale Industry
Small Scale IndustrySmall Scale Industry
Small Scale Industry
 
Entrepreneurship
EntrepreneurshipEntrepreneurship
Entrepreneurship
 
Directing and Controlling
Directing and ControllingDirecting and Controlling
Directing and Controlling
 
Planning
PlanningPlanning
Planning
 
Introduction to Management
Introduction to Management Introduction to Management
Introduction to Management
 
Text MIning
Text MIningText MIning
Text MIning
 
Text Mining Framework
Text Mining FrameworkText Mining Framework
Text Mining Framework
 

Último

The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
heathfieldcps1
 

Último (20)

The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
 
General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...
 
Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)
 
Micro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfMicro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdf
 
Food safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfFood safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdf
 
ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.
 
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptxHMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
 
Plant propagation: Sexual and Asexual propapagation.pptx
Plant propagation: Sexual and Asexual propapagation.pptxPlant propagation: Sexual and Asexual propapagation.pptx
Plant propagation: Sexual and Asexual propapagation.pptx
 
On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptx
On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptxOn_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptx
On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptx
 
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...
 
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdfUGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
 
FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024
 
SOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning PresentationSOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning Presentation
 
How to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POSHow to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POS
 
Exploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptx
Exploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptxExploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptx
Exploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptx
 
How to Add New Custom Addons Path in Odoo 17
How to Add New Custom Addons Path in Odoo 17How to Add New Custom Addons Path in Odoo 17
How to Add New Custom Addons Path in Odoo 17
 
Sociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning ExhibitSociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning Exhibit
 
Graduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - EnglishGraduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - English
 

Data Science

  • 2. Data Science  Overview of Data Science  Definition of Data and Information  Data Types and Representation  Data Value Chain  Data Acquisition  Data Analysis  Data Curating  Data Storage  Data Usage  Basic Concepts of Big Data
  • 3. Overview of Data Science  Data science is the practice of mining large data sets of raw data, both structured and unstructured, to identify patterns and extract actionable insight from them.  Data Science deals with vast volumes of data using modern tools and techniques to find unseen patterns, derive meaningful information, and make business decisions.  Data Science is a blend of various fields like Probability, Statistics, Programming, Analysis, Cloud Computing, etc.;  Data Science is the extraction of actionable insights from raw data.
  • 4. Data Information Data  Raw facts, figures and statistics  No contextual meaning  Data can be in characters, numbers, images, words Information  Processed / Organized Data  Exact meaning and organized context  Organized and presented in context – Value added to data Context + Processing
  • 5. 100 100 Miles Difficult to walk 100 Miles but Vehicle transport is okay 100 Miles is a Far Distance
  • 6. Measure of Data in Files – File Size Name Equal To Size(In Bytes) Bit 1 Bit 1/8 Nibble 4 Bits 1/2 (rare) Byte 8 Bits 1 Kilobyte 1024 Bytes 1024 Megabyte 1, 024 Kilobytes 1, 048, 576 Gigabyte 1, 024 Megabytes 1, 073, 741, 824 Terrabyte 1, 024 Gigabytes 1, 099, 511, 627, 776 Petabyte 1, 024 Terabytes 1, 125, 899, 906, 842, 624 Exabyte 1, 024 Petabytes 1, 152, 921, 504, 606, 846, 976 Zettabyte 1, 024 Exabytes 1, 180, 591, 620, 717, 411, 303, 424 Yottabyte 1, 024 Zettabytes 1, 208, 925, 819, 614, 629, 174, 706, 176
  • 8. Types of Data and it’s Representation Structured Data Semi-Structured Data Unstructured Data  Predefined data models  Stored in Rows and Columns  Examples: Dates, Phone Number, Names  No predefined data models  Stored in various forms – image, audio, video, text  Examples: Documents, Image Files, Emails & Messages  Loosely organized into categories using meta tags  Stored in abstract and figures – HTML, XML, JSON  Examples: Server Logs, Tweets organized by Hashtags
  • 9.
  • 10.
  • 11. Data Science  Data science enables businesses to Process huge amounts of structured and unstructured Big Data to detect patterns  Alexa or Siri for a recommendation demands data science  Operating a self-driving car  Search Engine  Chatbot for customer service
  • 12. Data Science Pre-Requisites Machine Learning Modeling Statistics Programming Databases
  • 13. Data Science Lifecycle  Capture: Data Acquisition, Data Entry, Signal Reception, Data Extraction - Gathering raw structured and unstructured data  Maintain: Data Warehousing, Data Cleansing, Data Staging, Data Processing, Data Architecture - Taking the raw data and putting it in a form that can be used  Process: Data Mining, Clustering/Classification, Data Modeling, Data Summarization - Data scientists take the prepared data and examine its patterns, ranges, and biases to determine how useful it will be in predictive analysis  Analyze: Exploratory/Confirmatory, Predictive Analysis, Regression, Text Mining, Qualitative Analysis - - Performing the various analyses on the data  Communicate: Data Reporting, Data Visualization, Business Intelligence, Decision Making - Analysts prepare the analyses in easily readable forms such as charts, graphs, and reports
  • 14. Data Science Applications  Healthcare Healthcare companies are using data science to build sophisticated medical instruments to detect and cure diseases.  Gaming Video and computer games are now being created with the help of data science and that has taken the gaming experience to the next level.  Image Recognition Identifying patterns in images and detecting objects in an image is one of the most popular data science applications.  Recommendation Systems Netflix and Amazon give movie and product recommendations based on what you like to watch, purchase, or browse on their platforms.  Logistics Data Science is used by logistics companies to optimize routes to ensure faster delivery of products and increase operational efficiency.  Fraud Detection Banking and financial institutions use data science and related algorithms to detect fraudulent transactions.
  • 15. Data Value Chain Data Value Chain - The evolution of data from collection to analysis, dissemination, and the final impact of data on decision making
  • 16. Data Value Chain  Data Capture & Acquisition Collection of raw data from both internal and external sources. The first phase of data collection involves identifying what data to collect and then establishing a process to do so (i.e., conducting a survey or retrieving automated IoT data). Decisions made here will affect the quality and usability of data throughout its life-cycle  Data Processing & Cleansing Cleaning data - identifying and correcting corrupt, inaccurate, or irrelevant data - as well as converting raw data into a format that is usable, integratable and machine readable.  Data Curation, Integration and Enrichment Data curation and integration refers to the collection of processes required to merge data from multiple sources into one, cohesive dataset. During this process, data is also enriched, meaning that contextual metadata (the data that makes larger datasets discoverable) is added or updated.  Data Analysis Data is analyzed and used to uncover trends, patterns and other insights that can enhance decision making.  Data ROI & Monetization The application of data analytics processes to solve real-world problems and, in a business setting, increase revenue.
  • 17. Big Data Value Chain Data Acquisition Data Analysis Data Curating Data Storage Data Usage
  • 18. Big Data Value Chain – Data Acquisition Process of gathering, filtering, and cleaning data before it is put in a data warehouse or any other storage solution on which data analysis can be carried out. Data acquisition is one of the major big data challenges in terms of infrastructure requirements. The infrastructure required to support the acquisition of big data must deliver low, predictable latency in both capturing data and in executing queries; be able to handle very high transaction volumes, often in a distributed environment; and support flexible and dynamic data structures.
  • 19. Big Data Value Chain – Data Analysis Concerned with making the raw data acquired amenable to use in decision-making as well as domain-specific usage. Data analysis involves exploring, transforming and modelling data with the goal of highlighting relevant data, synthesizing and extracting useful hidden information with high potential from a business point of view.
  • 20. Big Data Value Chain – Data Curation Data curation processes can be categorized into different activities such as content creation, selection, classification, transformation, validation, and preservation. Data curation is responsible for improving the accessibility and quality of data, ensuring that data are trustworthy, discoverable, accessible, reusable, and fit their purpose.
  • 21. Big Data Value Chain – Data Storage Data Storage is the persistence and management of data in a scalable way that satisfies the needs of applications that require fast access to the data. Relational Database Management Systems (RDBMS) are majorly used. NoSQL technologies have been designed with the scalability goal in mind and present a wide range of solutions based on alternative data models.
  • 22. Big Data Value Chain – Data Usage Covers the data-driven business activities that need access to data, its analysis, and the tools needed to integrate the data analysis within the business activity. Data usage in business decision-making can enhance competitiveness through reduction of costs, increased added value, or any other parameter that can be measured against existing performance criteria.
  • 23. Discover / Acquisition Prepare Plan Model Operationalize Communicate Results Project Phase
  • 24. Basic Concepts of Big Data  Big Data: High Degrees of Dimensions – Volume, Variety, Velocity, Value, Veracity • Volume - Amount of the data that is been generated • Velocity - Speed at which the data is been generated • Variety - Diversity or different types of the data • Value – Worth of the data • Veracity - Quality, accuracy, or trustworthiness of the data
  • 25. Big Data – Impact of 3V’s Volume (Amount of Data): Dealing with large scales of data within data processing (e.g., Global Supply Chains, Global Financial Analysis, Large Hadron Collider). Velocity (Speed of Data): Dealing with streams of high frequency of incoming real-time data (e.g., Sensors, Pervasive Environments, Electronic Trading, Internet of Things). Variety (Range of Data Types/Sources): Dealing with data using differing syntactic formats (e.g., Spreadsheets, XML, DBMS), schemas, and meanings (e.g., Enterprise Data Integration).
  • 26. Big Data Processing The general categories of activities involved with big data processing are:  Ingesting data into the system  Persisting the data in storage  Computing and Analyzing data  Visualizing the results
  • 27. Sources of Big Data Categories:  from human activities  from the physical world  from computers Example:  Internet data (emails, social media, and weblogs), network data, mobile networks or telecoms, machine-to-machine data or the IoT (sensor data), online transactions, medical records, and open data (mostly by governments).  Unstructured (such as text, audio, video) or semi-structured (such as emails, tweets, weblogs).
  • 28. Data Analytics Step 1: Determine the criteria for grouping the data Step 2: Collecting the data Step 3: Organizing the data Step 4: Cleaning the data Step 5: Analyze and Derive Insights
  • 29. Big Data Analytics Big data analytics helps organizations harness their data and use it to identify new opportunities. That, in turn, leads to smarter business moves, more efficient operations, higher profits and happier customers.
  • 30. Big Data Analytics - Techniques Data Mining / Analytics Web Mining Text Mining / Analytics Predictive Analytics Visual Analytics Machine Learning / AI / Deep Learning Mobile Analytics Crowdsourcing
  • 31. Big Data Analytics - Tools  Hadoop • For distributed storage of large datasets on computer clusters • Designed to process large amounts of structured and unstructured data • Provides large amounts of storage for all sorts of data along with the ability to handle virtually limitless concurrent tasks  MapReduce • Google technology for processing massive amounts of data • Software framework that enables developers to code programs that can process large amounts of unstructured data • It has two components:  Map which distributes the input data to several clusters for parallel processing  Reduce which collects all sub-results to provide the result
  • 32. Big Data Analytics - Tools  NoSQL • Used in Big Data application in clustered environments • Provides high speed access unstructured or semi-structured data • Provides capabilities to query and retrieve unstructured and semi-structured data  MongoDB • For managing data that are frequently changing or unstructured • flexible, highly scalable database designed for web applications • used to store data in mobile apps, product catalogs, and real-time applications  Other tools include Hive, Cassandra, Spark, Tableau, Talend, and cloud computing
  • 33. Big Data Analytics - Applications Internet of Things (IoT) Smart Grid Science Healthcare Nursing Business Industry Manufacturing Public Agencies
  • 34. Big Data Analytics - Benefits  Lead to making better decisions and improves insights and predictions. This can lead to greater operational efficiency, productivity, reduced cost, and risk  Eliminates the biases people have when making decisions based on limited information  Analysis of data to be built into the process that enables automated decision-making  Helps in reducing rates of return, producing high-quality products  Improve overall profitability of business  Helps social media, public and private agencies to explore behavioral patterns of people  Potentially be used in driving economic growth in developing world
  • 35. Big Data Challenges  Complexities • Processing, storage, and transfer of a large scale of data • Challenge to filter out the useless information without discarding useful information  Privacy • Risk of Data Leakage • Privacy concern arises continue from the users who outsource their private data into the cloud storage  Security • Concerns over the impact that collecting, storing, and processing large amount of data could have on security • Security is a concern because of the variety and heterogeneity of Big Data  Data Migration • Transferring Big Data for distributed processing and storage  Shortage of HR (Data Scientist)
  • 36. References  http://www.dataeconomy.eu/data-value-chain/#page-content  https://rd-springer-com.eu1.proxy.openathens.net/chapter/10.1007/978-3-319-21569-3_4  https://rd-springer-com.eu1.proxy.openathens.net/chapter/10.1007/978-3-319-21569-3_5  https://rd-springer-com.eu1.proxy.openathens.net/chapter/10.1007/978-3-319-21569-3_6  https://rd-springer-com.eu1.proxy.openathens.net/chapter/10.1007/978-3-319-21569-3_7  https://rd-springer-com.eu1.proxy.openathens.net/chapter/10.1007/978-3-319-21569-3_8  https://www.ibm.com/cloud/blog/structured-vs-unstructured-data