Introduction to Data Science.pptx

A
CS3352 -Foundations of Data
Science
Introduction to Data Science
Dr.V.Anusuya
Associate Professor/IT
Ramco Institute of Technology
Rajapalayam
Introduction
• Data Science is the area of study which involves extracting
insights from vast amounts of data using various scientific
methods, algorithms and processes.
• Data Science is useful to discover hidden patterns from the
voluminous raw data.
• Data Science is because of the evolution of mathematical
statistics, data analysis, and big data.
Contd.,
• Ingredients = Data
• When a chef is starting out with a new dish.
• After the chef has determined what type of dish he would like to
make, he goes to the fridge to gather ingredients. If he doesn’t
have the necessary ingredients, he takes a trip to the store to
collect them.
• As data scientists, our data points are our ingredients. We may
have some on hand, but we still might have to collect more
through web scraping, SQL queries, etc.
Big Data
• Data Science focuses on processing the huge
volume of heterogeneous data known as big
data.
• Big data refers to the data sets that are large
and complex in nature and difficult to process
using traditional data processing application
software.
Contd.,
• The characteristics of big data are often referred
to as the three Vs:
• Volume—How much data is there?
• Variety—How diverse are different types of
data?
• Velocity—At what speed is new data generated?
• Veracity-How accurate is the data?.
• These four properties make big data different
from the data found in traditional data
management tools.
Contd.,
• The four V’s of data make the data
capturing,cleaning,preprocessing,storing,search
ing,sharing,transferring and visualization
processes a complex task.
• Specialization techniques are required to extract
the insights from this huge volume of data.
• The main things that set a data scientist apart
from a statistician are the ability to work with
big data and experience in machine learning,
computing, and algorithm building.
Big data
• Tools- Hadoop, Pig, Spark, R, Python, and Java.
• Python is a great language for data science
because it has many data science libraries
available, and it’s widely supported by
specialized software.
Contd.,
• Data Science is an interdisciplinary field that
allows to extract knowledge from structured
or unstructured data.
• Translate business problem and Research
project into practical solution.
Benefits and uses of data science
• Data science and big data are used almost everywhere
in both commercial and noncommercial Settings.
• Commercial companies in almost every industry use
data science and big data to gain insights into their
customers, processes, staff, completion, and products.
• Many companies use data science to offer customers a
better user experience, as well as to cross-sell, up-sell,
and personalize their offerings.
Benefits and uses of data science
• Governmental organizations are also aware of data’s value. Many
governmental organizations not only rely on internal data scientists to
discover valuable information, but also share their data with the
public.
• Nongovernmental organizations (NGOs) use it to raise money and
defend their causes.
• Universities use data science in their research but also to enhance the
study experience of their students. The rise of massive open online
courses (MOOC) produces a lot of data, which allows universities to
study how this type of learning can complement traditional classes.
Facets of data
• In data science and big data we’ll come across
many different types of data, and each of them
tends to require different tools and techniques. The
main categories of data are these:
• Structured
• Unstructured
• Natural language
• Machine-generated
• Graph-based
• Audio, video, and images
• Streaming
Structured Data
•Structured data is data that depends on a
data model and resides in a fixed field within
a record.
•To store structured data in tables within
databases or Excel files.
•SQL, or Structured Query Language, is the
preferred way to manage and query data
that resides in databases.
•Hierarchical data such as a family tree.
Example
Unstructured Data
• Unstructured data is data that isn’t easy to fit
into a data model because the content is
context-specific or varying.
• Example of unstructured data is your regular
email.
• Although email contains structured elements
such as the sender, title, and body text.
• The thousands of different languages and
dialects out there further complicate this.
Example
Natural language
• Natural language is a special type of unstructured data; it’s challenging to
process because it requires knowledge of specific data science techniques
and linguistics.
• The natural language processing community has had success in entity
recognition, Speech recognition, summarization, text completion, and
sentiment analysis, but models trained in one domain don’t generalize well to
other domains.
• Even state-of-the-art techniques aren’t able to decipher the meaning of every
piece of text.
• Example: personal voice Assistant, siri, Alexa recognize patterns in speech,
provide a useful response.
• Example: Email Filters- Primary, Social ,Promotions
Machine-generated data
• Machine-generated data is information that’s
automatically created by a computer, process,
application, or other machine without human
intervention.
• Machine-generated data is becoming a major
data resource and will continue to do so.
• The analysis of machine data relies on highly
scalable tools, due to its high volume and
speed. Examples of machine data are web
server logs, call detail records, network event
logs, and telemetry.
Introduction to Data Science.pptx
Graph-based or Networked Data
• A graph is a mathematical structure to model
pair-wise relationships between objects.
• Graph or network data is, in short, data that
focuses on the relationship or adjacency of
objects.
• The graph structures use nodes, edges, and
properties to represent and store graphical data.
Contd.,
• Graph-based data is a natural way to represent
social networks, and its structure allows you to
calculate specific metrics such as the influence
of a person and the shortest path between two
people.
• Example
• LinkedIn-business colleagues, Twitter-
follower list.
• Facebook-connecting edges here to show
friends.
Graph based structure
Specialized query languages such as SPARQL
Audio, image, and video
• Audio, image, and video are data types that
pose specific challenges to a data scientist.
• Tasks that are trivial for humans, such as
recognizing objects in pictures, to be challenging
for computers.
• MLBAM (Major League Baseball Advanced
Media) announced in 2014 that they’ll increase
video capture to approximately 7 TB per game
for the purpose of live, in-game analytics.
Contd.,
• Recently a company called DeepMind
succeeded at creating an algorithm that’s
capable of learning how to play video games.
• This algorithm takes the video screen as input
and learns to interpret everything via a
complex of deep learning.
Streaming Data
• The data flows into the system when an event
happens instead of being loaded into a data
store in a batch.
• Examples are the “What’s trending” on
Twitter, live sporting or music events, and the
stock market.
1 de 24

Recomendados

Data Science Project Lifecycle por
Data Science Project LifecycleData Science Project Lifecycle
Data Science Project LifecycleJason Geng
4.5K visualizações11 slides
Data visualisation & analytics with Tableau por
Data visualisation & analytics with Tableau Data visualisation & analytics with Tableau
Data visualisation & analytics with Tableau Outreach Digital
5.2K visualizações31 slides
Data visualization por
Data visualizationData visualization
Data visualizationJan Willem Tulp
55.5K visualizações51 slides
Artificial Intelligence Notes Unit 1 por
Artificial Intelligence Notes Unit 1 Artificial Intelligence Notes Unit 1
Artificial Intelligence Notes Unit 1 DigiGurukul
16.3K visualizações216 slides
Introduction to Data Science por
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data ScienceSrishti44
1.5K visualizações29 slides
5.2 mining time series data por
5.2 mining time series data5.2 mining time series data
5.2 mining time series dataKrish_ver2
11.2K visualizações16 slides

Mais conteúdo relacionado

Mais procurados

Major issues in data mining por
Major issues in data miningMajor issues in data mining
Major issues in data miningSlideshare
36.3K visualizações10 slides
Data science life cycle por
Data science life cycleData science life cycle
Data science life cycleManoj Mishra
3.8K visualizações24 slides
Data Models por
Data ModelsData Models
Data ModelsRituBhargava7
15.4K visualizações27 slides
Tutorial On Database Management System por
Tutorial On Database Management SystemTutorial On Database Management System
Tutorial On Database Management Systempsathishcs
13.7K visualizações41 slides
Data mining an introduction por
Data mining an introductionData mining an introduction
Data mining an introductionDr-Dipali Meher
7.9K visualizações38 slides
Unit 1-Data Science Process Overview.pptx por
Unit 1-Data Science Process Overview.pptxUnit 1-Data Science Process Overview.pptx
Unit 1-Data Science Process Overview.pptxAnusuya123
82 visualizações60 slides

Mais procurados(20)

Major issues in data mining por Slideshare
Major issues in data miningMajor issues in data mining
Major issues in data mining
Slideshare36.3K visualizações
Data science life cycle por Manoj Mishra
Data science life cycleData science life cycle
Data science life cycle
Manoj Mishra3.8K visualizações
Data Models por RituBhargava7
Data ModelsData Models
Data Models
RituBhargava715.4K visualizações
Tutorial On Database Management System por psathishcs
Tutorial On Database Management SystemTutorial On Database Management System
Tutorial On Database Management System
psathishcs13.7K visualizações
Data mining an introduction por Dr-Dipali Meher
Data mining an introductionData mining an introduction
Data mining an introduction
Dr-Dipali Meher7.9K visualizações
Unit 1-Data Science Process Overview.pptx por Anusuya123
Unit 1-Data Science Process Overview.pptxUnit 1-Data Science Process Overview.pptx
Unit 1-Data Science Process Overview.pptx
Anusuya12382 visualizações
1.7 data reduction por Krish_ver2
1.7 data reduction1.7 data reduction
1.7 data reduction
Krish_ver231.9K visualizações
PPT on Data Science Using Python por NishantKumar1179
PPT on Data Science Using PythonPPT on Data Science Using Python
PPT on Data Science Using Python
NishantKumar11797.4K visualizações
Data mining , Knowledge Discovery Process, Classification por Dr. Abdul Ahad Abro
Data mining , Knowledge Discovery Process, ClassificationData mining , Knowledge Discovery Process, Classification
Data mining , Knowledge Discovery Process, Classification
Dr. Abdul Ahad Abro4.9K visualizações
1. Data Analytics-introduction por krishna singh
1. Data Analytics-introduction1. Data Analytics-introduction
1. Data Analytics-introduction
krishna singh33.5K visualizações
Using python to analyze spatial data por Kudos S.A.S
Using python to analyze spatial dataUsing python to analyze spatial data
Using python to analyze spatial data
Kudos S.A.S6.8K visualizações
Exploratory data analysis with Python por Davis David
Exploratory data analysis with PythonExploratory data analysis with Python
Exploratory data analysis with Python
Davis David4.5K visualizações
The Data Science Process por Vishal Patel
The Data Science ProcessThe Data Science Process
The Data Science Process
Vishal Patel6.2K visualizações
Data integration por Umar Alharaky
Data integrationData integration
Data integration
Umar Alharaky10.8K visualizações
Data mining por Akannsha Totewar
Data miningData mining
Data mining
Akannsha Totewar297K visualizações
Landuse Classification from Satellite Imagery using Deep Learning por DataWorks Summit
Landuse Classification from Satellite Imagery using Deep LearningLanduse Classification from Satellite Imagery using Deep Learning
Landuse Classification from Satellite Imagery using Deep Learning
DataWorks Summit1.7K visualizações
Dimensionality Reduction por mrizwan969
Dimensionality ReductionDimensionality Reduction
Dimensionality Reduction
mrizwan96912.6K visualizações
Support Vector Machines ( SVM ) por Mohammad Junaid Khan
Support Vector Machines ( SVM ) Support Vector Machines ( SVM )
Support Vector Machines ( SVM )
Mohammad Junaid Khan35.6K visualizações
Lesson 2 network database system por GiO Friginal
Lesson 2 network database systemLesson 2 network database system
Lesson 2 network database system
GiO Friginal2.5K visualizações

Similar a Introduction to Data Science.pptx

Data science.chapter-1,2,3 por
Data science.chapter-1,2,3Data science.chapter-1,2,3
Data science.chapter-1,2,3varshakumar21
629 visualizações44 slides
Data science unit1 por
Data science unit1Data science unit1
Data science unit1varshakumar21
2.9K visualizações46 slides
ch2 DS.pptx por
ch2 DS.pptxch2 DS.pptx
ch2 DS.pptxderbew2112
1 visão34 slides
Big data ppt por
Big data pptBig data ppt
Big data pptDeepika ParthaSarathy
6.2K visualizações34 slides
Digital data por
Digital dataDigital data
Digital dataShivanandaVSeeri
1.4K visualizações80 slides
Digital Types por
Digital TypesDigital Types
Digital TypesShivanandaVSeeri
128 visualizações80 slides

Similar a Introduction to Data Science.pptx(20)

Data science.chapter-1,2,3 por varshakumar21
Data science.chapter-1,2,3Data science.chapter-1,2,3
Data science.chapter-1,2,3
varshakumar21629 visualizações
Data science unit1 por varshakumar21
Data science unit1Data science unit1
Data science unit1
varshakumar212.9K visualizações
Digital data por ShivanandaVSeeri
Digital dataDigital data
Digital data
ShivanandaVSeeri1.4K visualizações
Digital Types por ShivanandaVSeeri
Digital TypesDigital Types
Digital Types
ShivanandaVSeeri128 visualizações
Big data por RameshwariPatil3
Big dataBig data
Big data
RameshwariPatil3105 visualizações
big data processing.pptx por ssuser96aab9
big data processing.pptxbig data processing.pptx
big data processing.pptx
ssuser96aab95 visualizações
Behind the scenes of data science por Loïc Lejoly
Behind the scenes of data scienceBehind the scenes of data science
Behind the scenes of data science
Loïc Lejoly101 visualizações
Data Scientist By: Professor Lili Saghafi por Professor Lili Saghafi
Data Scientist By: Professor Lili SaghafiData Scientist By: Professor Lili Saghafi
Data Scientist By: Professor Lili Saghafi
Professor Lili Saghafi98 visualizações
BIg Data Overview por dimantoku
BIg Data OverviewBIg Data Overview
BIg Data Overview
dimantoku68 visualizações
semana1.pptx por AidaVivancoLuna1
semana1.pptxsemana1.pptx
semana1.pptx
AidaVivancoLuna116 visualizações
DATA SCIENCE IS CATALYZING BUSINESS AND INNOVATION por Elvis Muyanja
DATA SCIENCE IS CATALYZING BUSINESS AND INNOVATION DATA SCIENCE IS CATALYZING BUSINESS AND INNOVATION
DATA SCIENCE IS CATALYZING BUSINESS AND INNOVATION
Elvis Muyanja1.6K visualizações
01-Introduction.pdf por ngVnThng12
01-Introduction.pdf01-Introduction.pdf
01-Introduction.pdf
ngVnThng126 visualizações
Introduction to Big Data por Umair Shafique
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
Umair Shafique47 visualizações
Introduction to Big Data Analytics por Utkarsh Sharma
Introduction to Big Data AnalyticsIntroduction to Big Data Analytics
Introduction to Big Data Analytics
Utkarsh Sharma259 visualizações
Data Engineer vs Data Scientist vs Data Analyst.pptx por CarolineRebeccaD
Data Engineer vs Data Scientist vs Data Analyst.pptxData Engineer vs Data Scientist vs Data Analyst.pptx
Data Engineer vs Data Scientist vs Data Analyst.pptx
CarolineRebeccaD182 visualizações
Chapter 2 - Introduction to Data Science.pptx por Wollo UNiversity
Chapter 2 - Introduction to Data Science.pptxChapter 2 - Introduction to Data Science.pptx
Chapter 2 - Introduction to Data Science.pptx
Wollo UNiversity20 visualizações

Mais de Anusuya123

Types of Data-Introduction.pptx por
Types of Data-Introduction.pptxTypes of Data-Introduction.pptx
Types of Data-Introduction.pptxAnusuya123
6 visualizações36 slides
Basic Statistical Descriptions of Data.pptx por
Basic Statistical Descriptions of Data.pptxBasic Statistical Descriptions of Data.pptx
Basic Statistical Descriptions of Data.pptxAnusuya123
93 visualizações22 slides
Data warehousing.pptx por
Data warehousing.pptxData warehousing.pptx
Data warehousing.pptxAnusuya123
56 visualizações8 slides
5.2.2. Memory Consistency Models.pptx por
5.2.2. Memory Consistency Models.pptx5.2.2. Memory Consistency Models.pptx
5.2.2. Memory Consistency Models.pptxAnusuya123
7 visualizações39 slides
5.1.3. Chord.pptx por
5.1.3. Chord.pptx5.1.3. Chord.pptx
5.1.3. Chord.pptxAnusuya123
2 visualizações32 slides
3. Descriptive statistics.ppt por
3. Descriptive statistics.ppt3. Descriptive statistics.ppt
3. Descriptive statistics.pptAnusuya123
2 visualizações41 slides

Mais de Anusuya123(11)

Types of Data-Introduction.pptx por Anusuya123
Types of Data-Introduction.pptxTypes of Data-Introduction.pptx
Types of Data-Introduction.pptx
Anusuya1236 visualizações
Basic Statistical Descriptions of Data.pptx por Anusuya123
Basic Statistical Descriptions of Data.pptxBasic Statistical Descriptions of Data.pptx
Basic Statistical Descriptions of Data.pptx
Anusuya12393 visualizações
Data warehousing.pptx por Anusuya123
Data warehousing.pptxData warehousing.pptx
Data warehousing.pptx
Anusuya12356 visualizações
5.2.2. Memory Consistency Models.pptx por Anusuya123
5.2.2. Memory Consistency Models.pptx5.2.2. Memory Consistency Models.pptx
5.2.2. Memory Consistency Models.pptx
Anusuya1237 visualizações
5.1.3. Chord.pptx por Anusuya123
5.1.3. Chord.pptx5.1.3. Chord.pptx
5.1.3. Chord.pptx
Anusuya1232 visualizações
3. Descriptive statistics.ppt por Anusuya123
3. Descriptive statistics.ppt3. Descriptive statistics.ppt
3. Descriptive statistics.ppt
Anusuya1232 visualizações
1. Intro DS.pptx por Anusuya123
1. Intro DS.pptx1. Intro DS.pptx
1. Intro DS.pptx
Anusuya1238 visualizações
Runtimeenvironment por Anusuya123
RuntimeenvironmentRuntimeenvironment
Runtimeenvironment
Anusuya123124 visualizações
Think pair share por Anusuya123
Think pair shareThink pair share
Think pair share
Anusuya12354 visualizações
Lexical analyzer generator lex por Anusuya123
Lexical analyzer generator lexLexical analyzer generator lex
Lexical analyzer generator lex
Anusuya1234.5K visualizações
Operators in Python por Anusuya123
Operators in PythonOperators in Python
Operators in Python
Anusuya1231.9K visualizações

Último

A multi-microcontroller-based hardware for deploying Tiny machine learning mo... por
A multi-microcontroller-based hardware for deploying Tiny machine learning mo...A multi-microcontroller-based hardware for deploying Tiny machine learning mo...
A multi-microcontroller-based hardware for deploying Tiny machine learning mo...IJECEIAES
13 visualizações10 slides
What is Unit Testing por
What is Unit TestingWhat is Unit Testing
What is Unit TestingSadaaki Emura
23 visualizações25 slides
Update 42 models(Diode/General ) in SPICE PARK(DEC2023) por
Update 42 models(Diode/General ) in SPICE PARK(DEC2023)Update 42 models(Diode/General ) in SPICE PARK(DEC2023)
Update 42 models(Diode/General ) in SPICE PARK(DEC2023)Tsuyoshi Horigome
23 visualizações16 slides
fakenews_DBDA_Mar23.pptx por
fakenews_DBDA_Mar23.pptxfakenews_DBDA_Mar23.pptx
fakenews_DBDA_Mar23.pptxdeepmitra8
14 visualizações34 slides
DevOps to DevSecOps: Enhancing Software Security Throughout The Development L... por
DevOps to DevSecOps: Enhancing Software Security Throughout The Development L...DevOps to DevSecOps: Enhancing Software Security Throughout The Development L...
DevOps to DevSecOps: Enhancing Software Security Throughout The Development L...Anowar Hossain
12 visualizações34 slides
Thermal aware task assignment for multicore processors using genetic algorithm por
Thermal aware task assignment for multicore processors using genetic algorithm Thermal aware task assignment for multicore processors using genetic algorithm
Thermal aware task assignment for multicore processors using genetic algorithm IJECEIAES
31 visualizações12 slides

Último(20)

A multi-microcontroller-based hardware for deploying Tiny machine learning mo... por IJECEIAES
A multi-microcontroller-based hardware for deploying Tiny machine learning mo...A multi-microcontroller-based hardware for deploying Tiny machine learning mo...
A multi-microcontroller-based hardware for deploying Tiny machine learning mo...
IJECEIAES13 visualizações
What is Unit Testing por Sadaaki Emura
What is Unit TestingWhat is Unit Testing
What is Unit Testing
Sadaaki Emura23 visualizações
Update 42 models(Diode/General ) in SPICE PARK(DEC2023) por Tsuyoshi Horigome
Update 42 models(Diode/General ) in SPICE PARK(DEC2023)Update 42 models(Diode/General ) in SPICE PARK(DEC2023)
Update 42 models(Diode/General ) in SPICE PARK(DEC2023)
Tsuyoshi Horigome23 visualizações
fakenews_DBDA_Mar23.pptx por deepmitra8
fakenews_DBDA_Mar23.pptxfakenews_DBDA_Mar23.pptx
fakenews_DBDA_Mar23.pptx
deepmitra814 visualizações
DevOps to DevSecOps: Enhancing Software Security Throughout The Development L... por Anowar Hossain
DevOps to DevSecOps: Enhancing Software Security Throughout The Development L...DevOps to DevSecOps: Enhancing Software Security Throughout The Development L...
DevOps to DevSecOps: Enhancing Software Security Throughout The Development L...
Anowar Hossain12 visualizações
Thermal aware task assignment for multicore processors using genetic algorithm por IJECEIAES
Thermal aware task assignment for multicore processors using genetic algorithm Thermal aware task assignment for multicore processors using genetic algorithm
Thermal aware task assignment for multicore processors using genetic algorithm
IJECEIAES31 visualizações
NEW SUPPLIERS SUPPLIES (copie).pdf por georgesradjou
NEW SUPPLIERS SUPPLIES (copie).pdfNEW SUPPLIERS SUPPLIES (copie).pdf
NEW SUPPLIERS SUPPLIES (copie).pdf
georgesradjou15 visualizações
Activated sludge process .pdf por 8832RafiyaAltaf
Activated sludge process .pdfActivated sludge process .pdf
Activated sludge process .pdf
8832RafiyaAltaf9 visualizações
zincalume water storage tank design.pdf por 3D LABS
zincalume water storage tank design.pdfzincalume water storage tank design.pdf
zincalume water storage tank design.pdf
3D LABS5 visualizações
CHEMICAL KINETICS.pdf por AguedaGutirrez
CHEMICAL KINETICS.pdfCHEMICAL KINETICS.pdf
CHEMICAL KINETICS.pdf
AguedaGutirrez12 visualizações
SUMIT SQL PROJECT SUPERSTORE 1.pptx por Sumit Jadhav
SUMIT SQL PROJECT SUPERSTORE 1.pptxSUMIT SQL PROJECT SUPERSTORE 1.pptx
SUMIT SQL PROJECT SUPERSTORE 1.pptx
Sumit Jadhav 12 visualizações
DESIGN OF SPRINGS-UNIT4.pptx por gopinathcreddy
DESIGN OF SPRINGS-UNIT4.pptxDESIGN OF SPRINGS-UNIT4.pptx
DESIGN OF SPRINGS-UNIT4.pptx
gopinathcreddy19 visualizações
Instrumentation & Control Lab Manual.pdf por NTU Faisalabad
Instrumentation & Control Lab Manual.pdfInstrumentation & Control Lab Manual.pdf
Instrumentation & Control Lab Manual.pdf
NTU Faisalabad 5 visualizações
Object Oriented Programming with JAVA por Demian Antony D'Mello
Object Oriented Programming with JAVAObject Oriented Programming with JAVA
Object Oriented Programming with JAVA
Demian Antony D'Mello119 visualizações
Literature review and Case study on Commercial Complex in Nepal, Durbar mall,... por AakashShakya12
Literature review and Case study on Commercial Complex in Nepal, Durbar mall,...Literature review and Case study on Commercial Complex in Nepal, Durbar mall,...
Literature review and Case study on Commercial Complex in Nepal, Durbar mall,...
AakashShakya1266 visualizações
Saikat Chakraborty Java Oracle Certificate.pdf por SaikatChakraborty787148
Saikat Chakraborty Java Oracle Certificate.pdfSaikat Chakraborty Java Oracle Certificate.pdf
Saikat Chakraborty Java Oracle Certificate.pdf
SaikatChakraborty78714815 visualizações
K8S Roadmap.pdf por MaryamTavakkoli2
K8S Roadmap.pdfK8S Roadmap.pdf
K8S Roadmap.pdf
MaryamTavakkoli26 visualizações
SNMPx por Amatullahbutt
SNMPxSNMPx
SNMPx
Amatullahbutt16 visualizações

Introduction to Data Science.pptx

  • 1. CS3352 -Foundations of Data Science Introduction to Data Science Dr.V.Anusuya Associate Professor/IT Ramco Institute of Technology Rajapalayam
  • 2. Introduction • Data Science is the area of study which involves extracting insights from vast amounts of data using various scientific methods, algorithms and processes. • Data Science is useful to discover hidden patterns from the voluminous raw data. • Data Science is because of the evolution of mathematical statistics, data analysis, and big data.
  • 3. Contd., • Ingredients = Data • When a chef is starting out with a new dish. • After the chef has determined what type of dish he would like to make, he goes to the fridge to gather ingredients. If he doesn’t have the necessary ingredients, he takes a trip to the store to collect them. • As data scientists, our data points are our ingredients. We may have some on hand, but we still might have to collect more through web scraping, SQL queries, etc.
  • 4. Big Data • Data Science focuses on processing the huge volume of heterogeneous data known as big data. • Big data refers to the data sets that are large and complex in nature and difficult to process using traditional data processing application software.
  • 5. Contd., • The characteristics of big data are often referred to as the three Vs: • Volume—How much data is there? • Variety—How diverse are different types of data? • Velocity—At what speed is new data generated? • Veracity-How accurate is the data?. • These four properties make big data different from the data found in traditional data management tools.
  • 6. Contd., • The four V’s of data make the data capturing,cleaning,preprocessing,storing,search ing,sharing,transferring and visualization processes a complex task. • Specialization techniques are required to extract the insights from this huge volume of data. • The main things that set a data scientist apart from a statistician are the ability to work with big data and experience in machine learning, computing, and algorithm building.
  • 7. Big data • Tools- Hadoop, Pig, Spark, R, Python, and Java. • Python is a great language for data science because it has many data science libraries available, and it’s widely supported by specialized software.
  • 8. Contd., • Data Science is an interdisciplinary field that allows to extract knowledge from structured or unstructured data. • Translate business problem and Research project into practical solution.
  • 9. Benefits and uses of data science • Data science and big data are used almost everywhere in both commercial and noncommercial Settings. • Commercial companies in almost every industry use data science and big data to gain insights into their customers, processes, staff, completion, and products. • Many companies use data science to offer customers a better user experience, as well as to cross-sell, up-sell, and personalize their offerings.
  • 10. Benefits and uses of data science • Governmental organizations are also aware of data’s value. Many governmental organizations not only rely on internal data scientists to discover valuable information, but also share their data with the public. • Nongovernmental organizations (NGOs) use it to raise money and defend their causes. • Universities use data science in their research but also to enhance the study experience of their students. The rise of massive open online courses (MOOC) produces a lot of data, which allows universities to study how this type of learning can complement traditional classes.
  • 11. Facets of data • In data science and big data we’ll come across many different types of data, and each of them tends to require different tools and techniques. The main categories of data are these: • Structured • Unstructured • Natural language • Machine-generated • Graph-based • Audio, video, and images • Streaming
  • 12. Structured Data •Structured data is data that depends on a data model and resides in a fixed field within a record. •To store structured data in tables within databases or Excel files. •SQL, or Structured Query Language, is the preferred way to manage and query data that resides in databases. •Hierarchical data such as a family tree.
  • 14. Unstructured Data • Unstructured data is data that isn’t easy to fit into a data model because the content is context-specific or varying. • Example of unstructured data is your regular email. • Although email contains structured elements such as the sender, title, and body text. • The thousands of different languages and dialects out there further complicate this.
  • 16. Natural language • Natural language is a special type of unstructured data; it’s challenging to process because it requires knowledge of specific data science techniques and linguistics. • The natural language processing community has had success in entity recognition, Speech recognition, summarization, text completion, and sentiment analysis, but models trained in one domain don’t generalize well to other domains. • Even state-of-the-art techniques aren’t able to decipher the meaning of every piece of text. • Example: personal voice Assistant, siri, Alexa recognize patterns in speech, provide a useful response. • Example: Email Filters- Primary, Social ,Promotions
  • 17. Machine-generated data • Machine-generated data is information that’s automatically created by a computer, process, application, or other machine without human intervention. • Machine-generated data is becoming a major data resource and will continue to do so. • The analysis of machine data relies on highly scalable tools, due to its high volume and speed. Examples of machine data are web server logs, call detail records, network event logs, and telemetry.
  • 19. Graph-based or Networked Data • A graph is a mathematical structure to model pair-wise relationships between objects. • Graph or network data is, in short, data that focuses on the relationship or adjacency of objects. • The graph structures use nodes, edges, and properties to represent and store graphical data.
  • 20. Contd., • Graph-based data is a natural way to represent social networks, and its structure allows you to calculate specific metrics such as the influence of a person and the shortest path between two people. • Example • LinkedIn-business colleagues, Twitter- follower list. • Facebook-connecting edges here to show friends.
  • 21. Graph based structure Specialized query languages such as SPARQL
  • 22. Audio, image, and video • Audio, image, and video are data types that pose specific challenges to a data scientist. • Tasks that are trivial for humans, such as recognizing objects in pictures, to be challenging for computers. • MLBAM (Major League Baseball Advanced Media) announced in 2014 that they’ll increase video capture to approximately 7 TB per game for the purpose of live, in-game analytics.
  • 23. Contd., • Recently a company called DeepMind succeeded at creating an algorithm that’s capable of learning how to play video games. • This algorithm takes the video screen as input and learns to interpret everything via a complex of deep learning.
  • 24. Streaming Data • The data flows into the system when an event happens instead of being loaded into a data store in a batch. • Examples are the “What’s trending” on Twitter, live sporting or music events, and the stock market.