Data Science

Data Science
 Overview of Data Science
 Definition of Data and Information
 Data Types and Representation
 Data Value Chain
 Data Acquisition
 Data Analysis
 Data Curating
 Data Storage
 Data Usage
 Basic Concepts of Big Data

Overview of Data Science
 Data science is the practice of mining large data sets of raw data, both structured
and unstructured, to identify patterns and extract actionable insight from them.
 Data Science deals with vast volumes of data using modern tools and techniques to
find unseen patterns, derive meaningful information, and make business decisions.
 Data Science is a blend of various fields like Probability, Statistics, Programming,
Analysis, Cloud Computing, etc.;
 Data Science is the extraction of actionable insights from raw data.

Data Information
Data
 Raw facts, figures and statistics
 No contextual meaning
 Data can be in characters,
numbers, images, words
Information
 Processed / Organized Data
 Exact meaning and organized
context
 Organized and presented in
context – Value added to data
Context + Processing

100
100
Miles
Difficult to
walk 100 Miles
but Vehicle
transport is
okay
100 Miles
is a Far
Distance

Measure of Data in Files – File Size
Name Equal To Size(In Bytes)
Bit 1 Bit 1/8
Nibble 4 Bits 1/2 (rare)
Byte 8 Bits 1
Kilobyte 1024 Bytes 1024
Megabyte 1, 024 Kilobytes 1, 048, 576
Gigabyte 1, 024 Megabytes 1, 073, 741, 824
Terrabyte 1, 024 Gigabytes 1, 099, 511, 627, 776
Petabyte 1, 024 Terabytes 1, 125, 899, 906, 842, 624
Exabyte 1, 024 Petabytes 1, 152, 921, 504, 606, 846, 976
Zettabyte 1, 024 Exabytes 1, 180, 591, 620, 717, 411, 303, 424
Yottabyte 1, 024 Zettabytes 1, 208, 925, 819, 614, 629, 174, 706, 176

Types of Data and it’s Representation
Structured Data
Semi-Structured Data
Unstructured Data
 Predefined data models
 Stored in Rows and Columns
 Examples: Dates, Phone Number, Names
 No predefined data models
 Stored in various forms – image, audio, video, text
 Examples: Documents, Image Files, Emails & Messages
 Loosely organized into categories using meta tags
 Stored in abstract and figures – HTML, XML, JSON
 Examples: Server Logs, Tweets organized by Hashtags

Data Science
 Data science enables
businesses to Process huge
amounts of structured and
unstructured Big Data to
detect patterns
 Alexa or Siri for a
recommendation demands
data science
 Operating a self-driving car
 Search Engine
 Chatbot for customer service

Data Science Pre-Requisites
Machine
Learning
Modeling Statistics Programming Databases

Data Science Lifecycle
 Capture: Data Acquisition, Data Entry, Signal Reception, Data Extraction - Gathering raw structured
and unstructured data
 Maintain: Data Warehousing, Data Cleansing, Data Staging, Data Processing, Data Architecture - Taking
the raw data and putting it in a form that can be used
 Process: Data Mining, Clustering/Classification, Data Modeling, Data Summarization - Data scientists
take the prepared data and examine its patterns, ranges, and biases to determine how useful it will be in
predictive analysis
 Analyze: Exploratory/Confirmatory, Predictive Analysis, Regression, Text Mining, Qualitative Analysis -
- Performing the various analyses on the data
 Communicate: Data Reporting, Data Visualization, Business Intelligence, Decision Making - Analysts
prepare the analyses in easily readable forms such as charts, graphs, and reports

Data Science Applications
 Healthcare
Healthcare companies are using data science to build sophisticated medical instruments to detect and cure diseases.
 Gaming
Video and computer games are now being created with the help of data science and that has taken the gaming experience to the
next level.
 Image Recognition
Identifying patterns in images and detecting objects in an image is one of the most popular data science applications.
 Recommendation Systems
Netflix and Amazon give movie and product recommendations based on what you like to watch, purchase, or browse on their
platforms.
 Logistics
Data Science is used by logistics companies to optimize routes to ensure faster delivery of products and increase operational
efficiency.
 Fraud Detection
Banking and financial institutions use data science and related algorithms to detect fraudulent transactions.

Data Value Chain
Data Value Chain - The evolution of
data from collection to analysis,
dissemination, and the final impact of
data on decision making

Data Value Chain
 Data Capture & Acquisition
Collection of raw data from both internal and external sources. The first phase of data collection involves identifying what data to collect and
then establishing a process to do so (i.e., conducting a survey or retrieving automated IoT data). Decisions made here will affect the quality and
usability of data throughout its life-cycle
 Data Processing & Cleansing
Cleaning data - identifying and correcting corrupt, inaccurate, or irrelevant data - as well as converting raw data into a format that is usable,
integratable and machine readable.
 Data Curation, Integration and Enrichment
Data curation and integration refers to the collection of processes required to merge data from multiple sources into one, cohesive dataset.
During this process, data is also enriched, meaning that contextual metadata (the data that makes larger datasets discoverable) is added or
updated.
 Data Analysis
Data is analyzed and used to uncover trends, patterns and other insights that can enhance decision making.
 Data ROI & Monetization
The application of data analytics processes to solve real-world problems and, in a business setting, increase revenue.

Big Data Value Chain
Data
Acquisition
Data
Analysis
Data
Curating
Data
Storage
Data Usage

Big Data Value Chain – Data Acquisition
Process of gathering, filtering, and cleaning data before it is
put in a data warehouse or any other storage solution on
which data analysis can be carried out. Data acquisition is
one of the major big data challenges in terms of
infrastructure requirements. The infrastructure required to
support the acquisition of big data must deliver low,
predictable latency in both capturing data and in executing
queries; be able to handle very high transaction volumes,
often in a distributed environment; and support flexible and
dynamic data structures.

Big Data Value Chain – Data Analysis
Concerned with making the raw data acquired amenable to
use in decision-making as well as domain-specific usage. Data
analysis involves exploring, transforming and modelling data
with the goal of highlighting relevant data, synthesizing and
extracting useful hidden information with high potential
from a business point of view.

Big Data Value Chain – Data Curation
Data curation processes can be categorized into different
activities such as content creation, selection, classification,
transformation, validation, and preservation. Data curation
is responsible for improving the accessibility and quality of
data, ensuring that data are trustworthy, discoverable,
accessible, reusable, and fit their purpose.

Big Data Value Chain – Data Storage
Data Storage is the persistence and management of data in a
scalable way that satisfies the needs of applications that
require fast access to the data. Relational Database
Management Systems (RDBMS) are majorly used. NoSQL
technologies have been designed with the scalability goal in
mind and present a wide range of solutions based on
alternative data models.

Big Data Value Chain – Data Usage
Covers the data-driven business activities that need access to
data, its analysis, and the tools needed to integrate the data
analysis within the business activity. Data usage in business
decision-making can enhance competitiveness through
reduction of costs, increased added value, or any other
parameter that can be measured against existing performance
criteria.

Discover / Acquisition Prepare Plan
Model Operationalize Communicate Results
Project
Phase

Basic Concepts of Big Data
 Big Data: High Degrees of Dimensions – Volume, Variety, Velocity, Value, Veracity
• Volume - Amount of the data that is been generated
• Velocity - Speed at which the data is been generated
• Variety - Diversity or different types of the data
• Value – Worth of the data
• Veracity - Quality, accuracy, or trustworthiness of the data

Big Data – Impact of 3V’s
Volume (Amount of Data): Dealing with large scales of data within data
processing (e.g., Global Supply Chains, Global Financial Analysis, Large Hadron
Collider).
Velocity (Speed of Data): Dealing with streams of high frequency of incoming
real-time data (e.g., Sensors, Pervasive Environments, Electronic Trading,
Internet of Things).
Variety (Range of Data Types/Sources): Dealing with data using differing
syntactic formats (e.g., Spreadsheets, XML, DBMS), schemas, and meanings
(e.g., Enterprise Data Integration).

Big Data Processing
The general categories of activities involved with big data processing are:
 Ingesting data into the system
 Persisting the data in storage
 Computing and Analyzing data
 Visualizing the results

Sources of Big Data
Categories:
 from human activities
 from the physical world
 from computers
Example:
 Internet data (emails, social media, and weblogs), network
data, mobile networks or telecoms, machine-to-machine data
or the IoT (sensor data), online transactions, medical records,
and open data (mostly by governments).
 Unstructured (such as text, audio, video) or semi-structured
(such as emails, tweets, weblogs).

Data Analytics
Step 1: Determine the criteria for grouping the data
Step 2: Collecting the data
Step 3: Organizing the data
Step 4: Cleaning the data
Step 5: Analyze and Derive Insights

Big Data Analytics
Big data analytics helps organizations
harness their data and use it to identify
new opportunities. That, in turn, leads
to smarter business moves, more
efficient operations, higher profits and
happier customers.

Big Data Analytics - Techniques
Data Mining / Analytics
Web Mining
Text Mining / Analytics
Predictive Analytics
Visual Analytics
Machine Learning / AI / Deep Learning
Mobile Analytics
Crowdsourcing

Big Data Analytics - Tools
 Hadoop
• For distributed storage of large datasets on computer clusters
• Designed to process large amounts of structured and unstructured data
• Provides large amounts of storage for all sorts of data along with the ability to handle virtually limitless
concurrent tasks
 MapReduce
• Google technology for processing massive amounts of data
• Software framework that enables developers to code programs that can process large amounts of unstructured
data
• It has two components:
 Map which distributes the input data to several clusters for parallel processing
 Reduce which collects all sub-results to provide the result

Big Data Analytics - Tools
 NoSQL
• Used in Big Data application in clustered environments
• Provides high speed access unstructured or semi-structured data
• Provides capabilities to query and retrieve unstructured and semi-structured data
 MongoDB
• For managing data that are frequently changing or unstructured
• flexible, highly scalable database designed for web applications
• used to store data in mobile apps, product catalogs, and real-time applications
 Other tools include Hive, Cassandra, Spark, Tableau, Talend, and cloud computing

Big Data Analytics - Applications
Internet of
Things (IoT)
Smart Grid Science
Healthcare Nursing Business
Industry Manufacturing
Public
Agencies

Big Data Analytics - Benefits
 Lead to making better decisions and improves insights and predictions. This can lead to
greater operational efficiency, productivity, reduced cost, and risk
 Eliminates the biases people have when making decisions based on limited information
 Analysis of data to be built into the process that enables automated decision-making
 Helps in reducing rates of return, producing high-quality products
 Improve overall profitability of business
 Helps social media, public and private agencies to explore behavioral patterns of people
 Potentially be used in driving economic growth in developing world

Big Data Challenges
 Complexities
• Processing, storage, and transfer of a large scale of data
• Challenge to filter out the useless information without discarding useful information
 Privacy
• Risk of Data Leakage
• Privacy concern arises continue from the users who outsource their private data into the cloud storage
 Security
• Concerns over the impact that collecting, storing, and processing large amount of data could have on security
• Security is a concern because of the variety and heterogeneity of Big Data
 Data Migration
• Transferring Big Data for distributed processing and storage
 Shortage of HR (Data Scientist)

References
 http://www.dataeconomy.eu/data-value-chain/#page-content
 https://rd-springer-com.eu1.proxy.openathens.net/chapter/10.1007/978-3-319-21569-3_4
 https://www.ibm.com/cloud/blog/structured-vs-unstructured-data

Data Science

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Data Science

Semelhante a Data Science (20)

Mais de Prakhyath Rai

Mais de Prakhyath Rai (12)

Último

Último (20)

Data Science