The document provides an overview of key concepts in data science including data types, the data value chain, and big data. It defines data science as extracting insights from large, diverse datasets using tools like machine learning. The data value chain involves acquiring, processing, analyzing and using data. Big data is characterized by its volume, velocity and variety. Common techniques for big data analytics include data mining, machine learning and visualization.
2. Data Science
Overview of Data Science
Definition of Data and Information
Data Types and Representation
Data Value Chain
Data Acquisition
Data Analysis
Data Curating
Data Storage
Data Usage
Basic Concepts of Big Data
3. Overview of Data Science
Data science is the practice of mining large data sets of raw data, both structured
and unstructured, to identify patterns and extract actionable insight from them.
Data Science deals with vast volumes of data using modern tools and techniques to
find unseen patterns, derive meaningful information, and make business decisions.
Data Science is a blend of various fields like Probability, Statistics, Programming,
Analysis, Cloud Computing, etc.;
Data Science is the extraction of actionable insights from raw data.
4. Data Information
Data
Raw facts, figures and statistics
No contextual meaning
Data can be in characters,
numbers, images, words
Information
Processed / Organized Data
Exact meaning and organized
context
Organized and presented in
context – Value added to data
Context + Processing
8. Types of Data and it’s Representation
Structured Data
Semi-Structured Data
Unstructured Data
Predefined data models
Stored in Rows and Columns
Examples: Dates, Phone Number, Names
No predefined data models
Stored in various forms – image, audio, video, text
Examples: Documents, Image Files, Emails & Messages
Loosely organized into categories using meta tags
Stored in abstract and figures – HTML, XML, JSON
Examples: Server Logs, Tweets organized by Hashtags
9.
10.
11. Data Science
Data science enables
businesses to Process huge
amounts of structured and
unstructured Big Data to
detect patterns
Alexa or Siri for a
recommendation demands
data science
Operating a self-driving car
Search Engine
Chatbot for customer service
13. Data Science Lifecycle
Capture: Data Acquisition, Data Entry, Signal Reception, Data Extraction - Gathering raw structured
and unstructured data
Maintain: Data Warehousing, Data Cleansing, Data Staging, Data Processing, Data Architecture - Taking
the raw data and putting it in a form that can be used
Process: Data Mining, Clustering/Classification, Data Modeling, Data Summarization - Data scientists
take the prepared data and examine its patterns, ranges, and biases to determine how useful it will be in
predictive analysis
Analyze: Exploratory/Confirmatory, Predictive Analysis, Regression, Text Mining, Qualitative Analysis -
- Performing the various analyses on the data
Communicate: Data Reporting, Data Visualization, Business Intelligence, Decision Making - Analysts
prepare the analyses in easily readable forms such as charts, graphs, and reports
14. Data Science Applications
Healthcare
Healthcare companies are using data science to build sophisticated medical instruments to detect and cure diseases.
Gaming
Video and computer games are now being created with the help of data science and that has taken the gaming experience to the
next level.
Image Recognition
Identifying patterns in images and detecting objects in an image is one of the most popular data science applications.
Recommendation Systems
Netflix and Amazon give movie and product recommendations based on what you like to watch, purchase, or browse on their
platforms.
Logistics
Data Science is used by logistics companies to optimize routes to ensure faster delivery of products and increase operational
efficiency.
Fraud Detection
Banking and financial institutions use data science and related algorithms to detect fraudulent transactions.
15. Data Value Chain
Data Value Chain - The evolution of
data from collection to analysis,
dissemination, and the final impact of
data on decision making
16. Data Value Chain
Data Capture & Acquisition
Collection of raw data from both internal and external sources. The first phase of data collection involves identifying what data to collect and
then establishing a process to do so (i.e., conducting a survey or retrieving automated IoT data). Decisions made here will affect the quality and
usability of data throughout its life-cycle
Data Processing & Cleansing
Cleaning data - identifying and correcting corrupt, inaccurate, or irrelevant data - as well as converting raw data into a format that is usable,
integratable and machine readable.
Data Curation, Integration and Enrichment
Data curation and integration refers to the collection of processes required to merge data from multiple sources into one, cohesive dataset.
During this process, data is also enriched, meaning that contextual metadata (the data that makes larger datasets discoverable) is added or
updated.
Data Analysis
Data is analyzed and used to uncover trends, patterns and other insights that can enhance decision making.
Data ROI & Monetization
The application of data analytics processes to solve real-world problems and, in a business setting, increase revenue.
17. Big Data Value Chain
Data
Acquisition
Data
Analysis
Data
Curating
Data
Storage
Data Usage
18. Big Data Value Chain – Data Acquisition
Process of gathering, filtering, and cleaning data before it is
put in a data warehouse or any other storage solution on
which data analysis can be carried out. Data acquisition is
one of the major big data challenges in terms of
infrastructure requirements. The infrastructure required to
support the acquisition of big data must deliver low,
predictable latency in both capturing data and in executing
queries; be able to handle very high transaction volumes,
often in a distributed environment; and support flexible and
dynamic data structures.
19. Big Data Value Chain – Data Analysis
Concerned with making the raw data acquired amenable to
use in decision-making as well as domain-specific usage. Data
analysis involves exploring, transforming and modelling data
with the goal of highlighting relevant data, synthesizing and
extracting useful hidden information with high potential
from a business point of view.
20. Big Data Value Chain – Data Curation
Data curation processes can be categorized into different
activities such as content creation, selection, classification,
transformation, validation, and preservation. Data curation
is responsible for improving the accessibility and quality of
data, ensuring that data are trustworthy, discoverable,
accessible, reusable, and fit their purpose.
21. Big Data Value Chain – Data Storage
Data Storage is the persistence and management of data in a
scalable way that satisfies the needs of applications that
require fast access to the data. Relational Database
Management Systems (RDBMS) are majorly used. NoSQL
technologies have been designed with the scalability goal in
mind and present a wide range of solutions based on
alternative data models.
22. Big Data Value Chain – Data Usage
Covers the data-driven business activities that need access to
data, its analysis, and the tools needed to integrate the data
analysis within the business activity. Data usage in business
decision-making can enhance competitiveness through
reduction of costs, increased added value, or any other
parameter that can be measured against existing performance
criteria.
23. Discover / Acquisition Prepare Plan
Model Operationalize Communicate Results
Project
Phase
24. Basic Concepts of Big Data
Big Data: High Degrees of Dimensions – Volume, Variety, Velocity, Value, Veracity
• Volume - Amount of the data that is been generated
• Velocity - Speed at which the data is been generated
• Variety - Diversity or different types of the data
• Value – Worth of the data
• Veracity - Quality, accuracy, or trustworthiness of the data
25. Big Data – Impact of 3V’s
Volume (Amount of Data): Dealing with large scales of data within data
processing (e.g., Global Supply Chains, Global Financial Analysis, Large Hadron
Collider).
Velocity (Speed of Data): Dealing with streams of high frequency of incoming
real-time data (e.g., Sensors, Pervasive Environments, Electronic Trading,
Internet of Things).
Variety (Range of Data Types/Sources): Dealing with data using differing
syntactic formats (e.g., Spreadsheets, XML, DBMS), schemas, and meanings
(e.g., Enterprise Data Integration).
26. Big Data Processing
The general categories of activities involved with big data processing are:
Ingesting data into the system
Persisting the data in storage
Computing and Analyzing data
Visualizing the results
27. Sources of Big Data
Categories:
from human activities
from the physical world
from computers
Example:
Internet data (emails, social media, and weblogs), network
data, mobile networks or telecoms, machine-to-machine data
or the IoT (sensor data), online transactions, medical records,
and open data (mostly by governments).
Unstructured (such as text, audio, video) or semi-structured
(such as emails, tweets, weblogs).
28. Data Analytics
Step 1: Determine the criteria for grouping the data
Step 2: Collecting the data
Step 3: Organizing the data
Step 4: Cleaning the data
Step 5: Analyze and Derive Insights
29. Big Data Analytics
Big data analytics helps organizations
harness their data and use it to identify
new opportunities. That, in turn, leads
to smarter business moves, more
efficient operations, higher profits and
happier customers.
30. Big Data Analytics - Techniques
Data Mining / Analytics
Web Mining
Text Mining / Analytics
Predictive Analytics
Visual Analytics
Machine Learning / AI / Deep Learning
Mobile Analytics
Crowdsourcing
31. Big Data Analytics - Tools
Hadoop
• For distributed storage of large datasets on computer clusters
• Designed to process large amounts of structured and unstructured data
• Provides large amounts of storage for all sorts of data along with the ability to handle virtually limitless
concurrent tasks
MapReduce
• Google technology for processing massive amounts of data
• Software framework that enables developers to code programs that can process large amounts of unstructured
data
• It has two components:
Map which distributes the input data to several clusters for parallel processing
Reduce which collects all sub-results to provide the result
32. Big Data Analytics - Tools
NoSQL
• Used in Big Data application in clustered environments
• Provides high speed access unstructured or semi-structured data
• Provides capabilities to query and retrieve unstructured and semi-structured data
MongoDB
• For managing data that are frequently changing or unstructured
• flexible, highly scalable database designed for web applications
• used to store data in mobile apps, product catalogs, and real-time applications
Other tools include Hive, Cassandra, Spark, Tableau, Talend, and cloud computing
33. Big Data Analytics - Applications
Internet of
Things (IoT)
Smart Grid Science
Healthcare Nursing Business
Industry Manufacturing
Public
Agencies
34. Big Data Analytics - Benefits
Lead to making better decisions and improves insights and predictions. This can lead to
greater operational efficiency, productivity, reduced cost, and risk
Eliminates the biases people have when making decisions based on limited information
Analysis of data to be built into the process that enables automated decision-making
Helps in reducing rates of return, producing high-quality products
Improve overall profitability of business
Helps social media, public and private agencies to explore behavioral patterns of people
Potentially be used in driving economic growth in developing world
35. Big Data Challenges
Complexities
• Processing, storage, and transfer of a large scale of data
• Challenge to filter out the useless information without discarding useful information
Privacy
• Risk of Data Leakage
• Privacy concern arises continue from the users who outsource their private data into the cloud storage
Security
• Concerns over the impact that collecting, storing, and processing large amount of data could have on security
• Security is a concern because of the variety and heterogeneity of Big Data
Data Migration
• Transferring Big Data for distributed processing and storage
Shortage of HR (Data Scientist)