SlideShare uma empresa Scribd logo
1 de 46
DATA SCIENCE
NAME OF STAFF : S.JAMUNA
NAME OF THE STUDENT: J.YASMIN
STUDENT REG NO :CB17S250453
CLASS :III BCA ‘B’
BATCH :2017-2020
YEAR :2020
SUBJECT CODE :19SDS21/30
DATA SCIENCE
UNIT -1
S.VAISHNAVI
CHAPTER-1
DATA SCIENCE IN A BIG DATA WORLD
 Big data is a blanket term for any collection of data sets so large or complex that it becomes
difficult to process them using traditional data management techniques such as, for
example, the RDBMS (relational database management systems).
 Data science involves using methods to analyze massive amounts of data and extract the
knowledge it contains.
 Data science and big data evolved from statistics and traditional data management but are
now considered to be distinct disciplines.
 The characteristics of big data are often referred to as the three Vs:
 Volume—How much data is there?
 Variety—How diverse are different types of data?
 Velocity—At what speed is new data generated?
• Often these characteristics are complemented with a fourth V, veracity:
How accurate is the data?. These four properties make big data different
from the data found in traditional data management tools.
• Consequently, the challenges they bring can be felt in almost every aspect:
data capture, curation, storage, search, sharing, transfer, and visualization.
In addition, big data calls for specialized techniques to extract the insights.
• Data science is an evolutionary extension of statistics capable of dealing
with the insight.
• Data science is an evolutionary extension of statistics capable of dealing
with the massive amounts of data produced today..
BENEFITS AND USES OF DATA SCIENCE AND BIG DATA
• Data science and big data are used almost everywhere in both commercial and
noncommercial settings.
• The number of use cases is vast, and the examples we’ll provide throughout this book
only scratch the surface of the possibilities.
• Commercial companies in almost every industry use data science and big data to gain
insights into their customers, processes, staff, completion, and products.
• Many companies use data science to offer customers a better user experience, as well
as to cross-sell, up-sell, and personalize their offerings.
• You can use this data to gain insights or build data-driven applications.
• Data.gov is but one example; it’s the home of the US Government’s open data.
• A data scientist in a governmental organization gets to work on diverse projects such as
detecting fraud and other criminal activity or optimizing project funding .
• A well-known example was provided by Edward Snowden, who leaked internal
documents of the American National Security Agency and the British Government
Communications Headquarters that show clearly how they used data science
and big data to monitor millions of individuals.
• Then they applied data science techniques to distill information.
FACETS OF DATA
• In data science and big data you’ll come across many different types of data, and each of
them tends to require different tools and techniques.
The main categories of data are these:
■ Structured
■ Unstructured
■ Natural language
■ Machine-generated
■ Graph-based
■ Audio, video, and images
■ Streaming
STRUCTURED DATA
• Structured data is data that depends on a data model and resides in a fixed field within a
record.
• As such, it’s often easy to store structured data in tables within databases or Excel files
SQL, or Structured Query Language, is the preferred way to manage and query data that
resides in databases.
• You may also come across structured data that might give you a hard time storing it in a
traditional relational database.
• Hierarchical data such as a family tree is one such example.
• The world isn’t made up of structured data, though; it’s imposed upon it by humans and
machines.
• More often, data comes unstructured.
An Excel table is an example of structured data.
UNSTRUCTURED DATA
• Unstructured data is data that isn’t easy to fit into a data model because the content is
context-specific or varying.
• One example of unstructured data is your regular email. Although email contains
structured elements such as the sender, title, and body text, it’s a challenge to find the
number of people who have written an email complaint about a specific employee
because so many ways exist to refer to a person, for example. The thousands of different
languages and dialects out there further complicate this.
MACHINE-GENERATED DATA
• Machine-generated data is information that’s automatically created by a computer
,process , application, or other machine without human intervention.
• Machine-generated data is becoming a major data resource and will continue to do so .
Wikibon has forecast that the market value of the industrial Internet (a term coined by
Frost & Sullivan to refer to the integration of complex physical machinery with
networked sensors and software) will be approximately $540 billion in 2020. IDC
(International Data Corporation)
has estimated there will be 26 times more connected things than people in
2020. This network is commonly referred to as the internet of things.
Example of machine-generated data
GRAPH-BASED OR NETWORK DATA:
• “Graph data” can be a confusing term because any data can be shown in a
graph.
• “Graph” in this case points to mathematical graph theory. In graph theory, a
graph is a mathematical structure to model pair-wise relationships between
objects.
• Graph or network data is, in short, data that focuses on the relationship or
adjacency of objects.
• The graph structures use nodes, edges, and properties to represent and store
graphical data.
Myriam
Elizabeth
Jack
Er
Lucy
Liam
Florin
Barack
Kim
Guy
Willia
m
John
John
Maria
Friends in a social network are an example of graph-based data.
AUDIO, IMAGE, AND VIDEO:
• Audio, image, and video are data types that pose specific challenges to a data
scientist.
• Tasks that are trivial for humans, such as recognizing objects in pictures, turn
out to
be challenging for computers.
• MLBAM (Major League Baseball Advanced Media)announced in 2014 that
they’ll increase video capture to approximately 7 TB per game for the
purpose of live, in-game analytics.
• High-speed cameras at stadiums will capture ball and athlete movements to
calculate in real time, for example, the path taken by a defender relative to
two baselines.
STREAMING DATA
• While streaming data can take almost any of the previous forms, it has an
extra
• property. The data flows into the system when an event happens instead
of being loaded into a data store in a batch. Although this isn’t really a
different type of data,
• we treat it here as such because you need to adapt your process to deal
with this type of information.
• Examples are the “What’s trending” on Twitter, live sporting or music
events, and the stock market.
THE DATA SCIENCE PROCESS
The data science process typically consists of six steps, as you
can see in the mind map
Data science process
2: Retrieving data
1: Setting the research goal
3: Data preparation
4: Data exploration
5: Data modeling
6: Presentation and
automation
SETTING THE RESEARCH GOAL
• Data science is mostly applied in the context of an organization.
• When the business asks you to perform a data science project , you’ll first
prepare a project charter.
SETTING THE RESEARCH GOAL
Data science is mostly applied in the context of an organization. When the
business asks you to perform a data science project , you’ll first prepare a project
charter.
DATA PREPARATION
• Data collection is an error-prone process; in this phase you enhance the
quality of the data and prepare it for use in subsequent steps. This phase
consists of three subphases:
• Data cleansing removes false values from a data source and inconsistencies
across data sources, data integration enriches data sources by combining
information
DATA EXPLORATION
• Data exploration is concerned with building a deeper understanding of
your data.
• You try to understand how variables interact with each other, the
distribution of the
data, and whether there are outliers
DATA MODELING OR MODEL BUILDING
• In this phase you use models, domain knowledge, and insights about the
data you found in the previous steps to answer the research question.
Finally, you present the results to your business.
• These results can take many forms , ranging from presentations to research
reports. Sometimes you’ll need to automate the execution of the process
because the business will want to use the insights you gained in another
project or enable an operational process to use the outcome from your
model.
CHAPTER-2
THE DATA SCIENCE PROCESS
OVERVIEW OF THE DATA SCIENCE PROCESS
• Following a structured approach to data science helps you to maximize your
chances of success in a data science project at the lowest cost. It also makes it
possible to take up a project as a team, with each team member focusing on what
they do best.
The six steps of the data science process
1. The first step of this process is setting a research goal. The
main purpose here is making sure all the stakeholders
understand the what, how, and why of the project.
• In every serious project this will result in a project charter.
2. The second phase is data retrieval. You want to have data
available for analysis, so this step includes finding suitable
data and getting access to the data from the data owner.
• The result is data in its raw form, which probably needs
polishing and transformation before it becomes usable.
3 Now that you have the raw data, it’s time to prepare it.
This includes transforming
the data from a raw form into data that’s directly usable in
your models.
4.The fourth step is data exploration. The goal of this step
is to gain a deep understanding
of the data.
5.Finally, we get to the sexiest part: model building (often
referred to as “data modeling”
throughout this book).
6.The last step of the data science model is presenting your
results and automating the analysis, if needed. One goal of
a project is to change a process and/or make better decisions
STEP 1: DEFINING RESEARCH GOALS AND
CREATING
A PROJECT CHARTER
STEP 2: RETRIEVING DATA
• The next step in data science is to retrieve the
required data (figure 2.3). Some times you need to go
into the field and design a data collection process
yourself, but most of the time you won’t be involved
in this step.
STEP 3: CLEANSING, INTEGRATING, AND
TRANSFORMING DATA
• The data received from the data retrieval phase is
likely to be “a diamond in the rough.” Your task now
is to sanitize and prepare it for use in the modeling and
reporting phase.
STEP 4: EXPLORATORY DATA ANALYSIS
• During exploratory data analysis you take a deep dive into
the data (see figure 2.14).
• Information becomes much easier to grasp when shown in
a picture, therefore you mainly use graphical techniques to
gain an understanding of your data and the interactions
between variables.
STEP 5: BUILD THE MODELS
• With clean data in place and a good understanding of the
content, you’re ready to build models with the goal of
making better predictions, classifying objects, or gaining
an understanding of the system that you’re modeling.
STEP 6: PRESENTING FINDINGS AND BUILDING
APPLICATIONS ON TOP OF THEM
• After you’ve successfully analyzed the data and built a
well-performing model, you’re ready to present your
findings to the world (figure 2.28).
• This is an exciting part; all your hours of of hard work
have paid off and you can explain what you found to the
stakeholders.
Chapter-3
Machine learning
WHAT IS MACHINE LEARNING AND WHY SHOULD YOU CARE
ABOUT IT?
“Machine learning is a field of study that gives computers the ability to
learn without being explicitly programmed.”
—Arthur Samuel, 19591
The definition of machine learning coined by Arthur Samuel is often
quoted and is genius in its broadness, but it leaves you with the question
of how the computer learns. To achieve machine learning, experts develop
general-purpose algorithms that can be used on large classes of learning
problems.
• When machine learning is seen as a process, the following
definition is insightful:
• “Machine learning is the process by which a computer can
work more accurately as it collects and learns from the data
it is given.”
—Mike Roberts2
APPLICATIONS FOR MACHINE LEARNING IN DATA
SCIENCE
 Regression and classification are of primary importance to a
data scientist. To achieve these goals, one of the main tools a
data scientist uses is machine learning. The uses for regression
and automatic classification are wide ranging, such as the
following:
■ Finding oil fields, gold mines, or archeological sites based on
existing sites (classification and regression)
■ Finding place names or persons in text (classification)
■ Identifying people based on pictures or voice recordings
(classification)
WHERE MACHINE LEARNING IS USED IN THE DATA
SCIENCE PROCESS
Although machine learning is mainly linked to the data-modeling
step of the data science process, it can be used at almost every step.
To refresh your memory from previous chapters, the data science
process
The data modeling phase can’t start until you have qualitative
raw data you can understand.
But prior to that, the data preparation phase can benefit from
the use of machine learning.
An example would be cleansing a list of text strings; machine
learning can group similar strings together so it becomes easier
to correct spelling errors.
PYTHON TOOLS USED IN MACHINE LEARNING
Python has an overwhelming number of packages that can be
used in a machine
learning setting. The Python machine learning ecosystem can
be divided into three
main types of packages,
PACKAGES FOR WORKING WITH DATA IN MEMORY
When prototyping, the following packages can get you
started by providing advanced
functionalities with a few lines of code:
■ SciPy is a library that integrates fundamental packages
often used in scientific
computing such as NumPy, matplotlib, Pandas, and Sym Py.
■ NumPy gives you access to powerful array functions and
linear algebra functions.
■ Matplotlib is a popular 2D plotting package with some 3D
functionality.
■ Pandas is a high-performance, but easy-to-use, data-
wrangling package.
It introduces data frames to Python, a type of in-memory data
table. It’s a concept that
should sound familiar to regular users of R.
.
THE MODELING PROCESS
The modeling phase consists of four steps:
1 Feature engineering and model selection
2 Training the model
3 Model validation and selection
4 Applying the trained model to unseen data Before you find a
good model, you’ll probably iterate among the first three steps.
The last step isn’t always present because sometimes the goal
isn’t prediction but explanation (root cause analysis). For
instance, you might want to find out the causes of species’
extinctions but not necessarily predict which one is next in line to
leave our planet.
ENGINEERING FEATURES AND SELECTING A MODEL
With engineering features, you must come up with and create possible
predictors for
the model. This is one of the most important steps in the process
because a model
THE MODELING PROCESS 63
recombines these features to achieve its predictions. Often you may
need to consult
an expert or the appropriate literature to come up with meaningful
features.
Certain features are the variables you get from a data set, as is the case
with the provided
data sets in our exercises and in most school exercises. In practice
you’ll need to
find the features yourself, which may be scattered among different
data sets.
TRAINING YOUR MODEL
With the right predictors in place and a modeling
technique in mind, you can progress to model training. In
this phase you present to your model data from which it
can learn.
VALIDATING A MODEL
Data science has many modeling techniques, and the
question is which one is the right one to use. A good
model has two properties: it has good predictive power
and it generalizes well to data it hasn’t seen.
PREDICTING NEW OBSERVATIONS
If you’ve implemented the first three steps successfully,
you now have a performant model that generalizes to
unseen data. The process of applying your model to new
data is called model scoring.
TYPES OF MACHINE LEARNING
Broadly speaking, we can divide the different approaches to
machine learning by the amount of human effort that’s required to
coordinate them and how they use labeled data—data with a
category or a real-value number assigned to it that represents the
outcome of previous observations.
 Supervised learning techniques attempt to discern results and
learn by trying to find patterns in a labeled data set. Human
interaction is required to label the data.
■ Unsupervised learning techniques don’t rely on labeled data
and attempt to find patterns in a data set without human
interaction.
■ Semi-supervised learning techniques need labeled data, and
therefore human interaction, to find patterns in the data set, but
they can still progress toward a result and learn even if passed
unlabeled data as well.
 Supervised learning techniques attempt to discern result
and learn by trying to find patterns in a labeled data set.
Human interaction is required to label the data.
■ Unsupervised learning techniques don’t rely on labeled data
and attempt to find patterns in a data set without human
interaction.
■ Semi-supervised learning techniques need labeled data, and
therefore human interaction, to find patterns in the data set, but
they can still progress toward a result and learn even if passed
unlabeled data as well.
• Blurry grayscale
representation of the number 0
with its corresponding matrix.
• A simple
Captcha control can be
used to prevent automated
spam being sent through an
online web form.
UNSUPERVISED LEARNING
It’s generally true that most large data sets don’t have labels
on their data, so unless you
sort through it all and give it labels, the supervised learning
approach to data won’t work. Instead, we must take the
approach that will work with this data because
■ We can study the distribution of the data and infer truths
about the data in different parts of the distribution.
■ We can study the structure and values in the data and infer
new, more meaningful data and structure from it.
• k-means is a good general-purpose algorithm with which to get
started. However, like all the clustering algorithms, you need to
specify the number of desired clusters in advance, which necessarily
results in a process of trial and error before reaching a
decent conclusion.
SEMI-SUPERVISED LEARNING
• It shouldn’t surprise you to learn that while we’d like all our data to be
labeled so we can use the more powerful supervised machine learning
techniques, in reality we often start with only minimally labeled data, if
it’s labeled at all.
• We can use our unsupervised machine learning techniques to analyze
what we have and perhaps add labels to the data set, but it will be
prohibitively costly to label it all. Our goal then is to train our predictor
models with as little labeled data as possible.
• A common semi-supervised learning technique is label
propagation. In this technique , you start with a labeled
data set and give the same label to similar data points.
• This is similar to running a clustering algorithm over the
data set and labeling each cluster based on the labels they
contain.
Thank you

Mais conteúdo relacionado

Mais procurados

Data warehouse architecture
Data warehouse architectureData warehouse architecture
Data warehouse architecturepcherukumalla
 
Dm from databases perspective u 1
Dm from databases perspective u 1Dm from databases perspective u 1
Dm from databases perspective u 1sakthyvel3
 
Lecture1 introduction to big data
Lecture1 introduction to big dataLecture1 introduction to big data
Lecture1 introduction to big datahktripathy
 
History of Data Science
History of Data ScienceHistory of Data Science
History of Data ScienceDaniel Caesar
 
Data mining :Concepts and Techniques Chapter 2, data
Data mining :Concepts and Techniques Chapter 2, dataData mining :Concepts and Techniques Chapter 2, data
Data mining :Concepts and Techniques Chapter 2, dataSalah Amean
 
DATA WAREHOUSING
DATA WAREHOUSINGDATA WAREHOUSING
DATA WAREHOUSINGKing Julian
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessingankur bhalla
 
Dimensional Modeling
Dimensional ModelingDimensional Modeling
Dimensional ModelingSunita Sahu
 
Exploratory data analysis with Python
Exploratory data analysis with PythonExploratory data analysis with Python
Exploratory data analysis with PythonDavis David
 
Introduction to data science club
Introduction to data science clubIntroduction to data science club
Introduction to data science clubData Science Club
 
Big Data - The 5 Vs Everyone Must Know
Big Data - The 5 Vs Everyone Must KnowBig Data - The 5 Vs Everyone Must Know
Big Data - The 5 Vs Everyone Must KnowBernard Marr
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data ScienceNiko Vuokko
 
Introduction to Big Data Analytics and Data Science
Introduction to Big Data Analytics and Data ScienceIntroduction to Big Data Analytics and Data Science
Introduction to Big Data Analytics and Data ScienceData Science Thailand
 
Database Management System
Database Management SystemDatabase Management System
Database Management SystemNishant Munjal
 

Mais procurados (20)

Data warehouse architecture
Data warehouse architectureData warehouse architecture
Data warehouse architecture
 
Dm from databases perspective u 1
Dm from databases perspective u 1Dm from databases perspective u 1
Dm from databases perspective u 1
 
Data Wrangling
Data WranglingData Wrangling
Data Wrangling
 
Lecture1 introduction to big data
Lecture1 introduction to big dataLecture1 introduction to big data
Lecture1 introduction to big data
 
History of Data Science
History of Data ScienceHistory of Data Science
History of Data Science
 
Data mining :Concepts and Techniques Chapter 2, data
Data mining :Concepts and Techniques Chapter 2, dataData mining :Concepts and Techniques Chapter 2, data
Data mining :Concepts and Techniques Chapter 2, data
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
DATA WAREHOUSING
DATA WAREHOUSINGDATA WAREHOUSING
DATA WAREHOUSING
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Data mining tasks
Data mining tasksData mining tasks
Data mining tasks
 
Data analytics
Data analyticsData analytics
Data analytics
 
Dimensional Modeling
Dimensional ModelingDimensional Modeling
Dimensional Modeling
 
Exploratory data analysis with Python
Exploratory data analysis with PythonExploratory data analysis with Python
Exploratory data analysis with Python
 
Introduction to data science club
Introduction to data science clubIntroduction to data science club
Introduction to data science club
 
Big Data - The 5 Vs Everyone Must Know
Big Data - The 5 Vs Everyone Must KnowBig Data - The 5 Vs Everyone Must Know
Big Data - The 5 Vs Everyone Must Know
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Big data Analytics
Big data AnalyticsBig data Analytics
Big data Analytics
 
Introduction to Big Data Analytics and Data Science
Introduction to Big Data Analytics and Data ScienceIntroduction to Big Data Analytics and Data Science
Introduction to Big Data Analytics and Data Science
 
Database Management System
Database Management SystemDatabase Management System
Database Management System
 
Introduction to Data Mining
Introduction to Data MiningIntroduction to Data Mining
Introduction to Data Mining
 

Semelhante a Data science unit1

Introduction to Big Data Analytics
Introduction to Big Data AnalyticsIntroduction to Big Data Analytics
Introduction to Big Data AnalyticsUtkarsh Sharma
 
Introduction to Data Science.pptx
Introduction to Data Science.pptxIntroduction to Data Science.pptx
Introduction to Data Science.pptxAnusuya123
 
Explorasi Data untuk Peluang Bisnis dan Pengembangan Karir.pptx
Explorasi Data untuk Peluang Bisnis dan Pengembangan Karir.pptxExplorasi Data untuk Peluang Bisnis dan Pengembangan Karir.pptx
Explorasi Data untuk Peluang Bisnis dan Pengembangan Karir.pptxwindu19
 
Big Data in Practice.pdf
Big Data in Practice.pdfBig Data in Practice.pdf
Big Data in Practice.pdfTom Tan
 
Unit 1-Data Science Process Overview.pptx
Unit 1-Data Science Process Overview.pptxUnit 1-Data Science Process Overview.pptx
Unit 1-Data Science Process Overview.pptxAnusuya123
 
Introducition to Data scinece compiled by hu
Introducition to Data scinece compiled by huIntroducition to Data scinece compiled by hu
Introducition to Data scinece compiled by huwekineheshete
 
Behind the scenes of data science
Behind the scenes of data scienceBehind the scenes of data science
Behind the scenes of data scienceLoïc Lejoly
 
TOPIC.pptx
TOPIC.pptxTOPIC.pptx
TOPIC.pptxinfinix8
 
Getting Started in Data Science
Getting Started in Data ScienceGetting Started in Data Science
Getting Started in Data ScienceThinkful
 
Getting started in Data Science (April 2017, Los Angeles)
Getting started in Data Science (April 2017, Los Angeles)Getting started in Data Science (April 2017, Los Angeles)
Getting started in Data Science (April 2017, Los Angeles)Thinkful
 
Introduction to Business and Data Analysis Undergraduate.pdf
Introduction to Business and Data Analysis Undergraduate.pdfIntroduction to Business and Data Analysis Undergraduate.pdf
Introduction to Business and Data Analysis Undergraduate.pdfAbdulrahimShaibuIssa
 
big data and machine learning ppt.pptx
big data and machine learning ppt.pptxbig data and machine learning ppt.pptx
big data and machine learning ppt.pptxNATASHABANO
 
Career in Data Science (July 2017, DTLA)
Career in Data Science (July 2017, DTLA)Career in Data Science (July 2017, DTLA)
Career in Data Science (July 2017, DTLA)Thinkful
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big DataUmair Shafique
 

Semelhante a Data science unit1 (20)

Big data
Big dataBig data
Big data
 
Introduction to Big Data Analytics
Introduction to Big Data AnalyticsIntroduction to Big Data Analytics
Introduction to Big Data Analytics
 
basis data 02.pptx
basis data 02.pptxbasis data 02.pptx
basis data 02.pptx
 
Introduction to Data Science.pptx
Introduction to Data Science.pptxIntroduction to Data Science.pptx
Introduction to Data Science.pptx
 
1 UNIT-DSP.pptx
1 UNIT-DSP.pptx1 UNIT-DSP.pptx
1 UNIT-DSP.pptx
 
Explorasi Data untuk Peluang Bisnis dan Pengembangan Karir.pptx
Explorasi Data untuk Peluang Bisnis dan Pengembangan Karir.pptxExplorasi Data untuk Peluang Bisnis dan Pengembangan Karir.pptx
Explorasi Data untuk Peluang Bisnis dan Pengembangan Karir.pptx
 
Big Data in Practice.pdf
Big Data in Practice.pdfBig Data in Practice.pdf
Big Data in Practice.pdf
 
Big_Data.pptx
Big_Data.pptxBig_Data.pptx
Big_Data.pptx
 
Unit 1-Data Science Process Overview.pptx
Unit 1-Data Science Process Overview.pptxUnit 1-Data Science Process Overview.pptx
Unit 1-Data Science Process Overview.pptx
 
Introducition to Data scinece compiled by hu
Introducition to Data scinece compiled by huIntroducition to Data scinece compiled by hu
Introducition to Data scinece compiled by hu
 
Behind the scenes of data science
Behind the scenes of data scienceBehind the scenes of data science
Behind the scenes of data science
 
TOPIC.pptx
TOPIC.pptxTOPIC.pptx
TOPIC.pptx
 
Getting Started in Data Science
Getting Started in Data ScienceGetting Started in Data Science
Getting Started in Data Science
 
Getting started in Data Science (April 2017, Los Angeles)
Getting started in Data Science (April 2017, Los Angeles)Getting started in Data Science (April 2017, Los Angeles)
Getting started in Data Science (April 2017, Los Angeles)
 
big-data.pdf
big-data.pdfbig-data.pdf
big-data.pdf
 
Introduction to Business and Data Analysis Undergraduate.pdf
Introduction to Business and Data Analysis Undergraduate.pdfIntroduction to Business and Data Analysis Undergraduate.pdf
Introduction to Business and Data Analysis Undergraduate.pdf
 
big data and machine learning ppt.pptx
big data and machine learning ppt.pptxbig data and machine learning ppt.pptx
big data and machine learning ppt.pptx
 
M.Florence Dayana
M.Florence DayanaM.Florence Dayana
M.Florence Dayana
 
Career in Data Science (July 2017, DTLA)
Career in Data Science (July 2017, DTLA)Career in Data Science (July 2017, DTLA)
Career in Data Science (July 2017, DTLA)
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 

Último

On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsMebane Rash
 
Micro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfMicro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfPoh-Sun Goh
 
General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...Poonam Aher Patil
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.christianmathematics
 
Jamworks pilot and AI at Jisc (20/03/2024)
Jamworks pilot and AI at Jisc (20/03/2024)Jamworks pilot and AI at Jisc (20/03/2024)
Jamworks pilot and AI at Jisc (20/03/2024)Jisc
 
ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.MaryamAhmad92
 
Salient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functionsSalient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functionsKarakKing
 
Sociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning ExhibitSociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning Exhibitjbellavia9
 
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...ZurliaSoop
 
Google Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptxGoogle Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptxDr. Sarita Anand
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.pptRamjanShidvankar
 
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdfUGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdfNirmal Dwivedi
 
How to Add New Custom Addons Path in Odoo 17
How to Add New Custom Addons Path in Odoo 17How to Add New Custom Addons Path in Odoo 17
How to Add New Custom Addons Path in Odoo 17Celine George
 
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...Nguyen Thanh Tu Collection
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxheathfieldcps1
 
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdf
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdfUnit 3 Emotional Intelligence and Spiritual Intelligence.pdf
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdfDr Vijay Vishwakarma
 
SOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning PresentationSOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning Presentationcamerronhm
 
Fostering Friendships - Enhancing Social Bonds in the Classroom
Fostering Friendships - Enhancing Social Bonds  in the ClassroomFostering Friendships - Enhancing Social Bonds  in the Classroom
Fostering Friendships - Enhancing Social Bonds in the ClassroomPooky Knightsmith
 
On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptx
On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptxOn_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptx
On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptxPooja Bhuva
 
Plant propagation: Sexual and Asexual propapagation.pptx
Plant propagation: Sexual and Asexual propapagation.pptxPlant propagation: Sexual and Asexual propapagation.pptx
Plant propagation: Sexual and Asexual propapagation.pptxUmeshTimilsina1
 

Último (20)

On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan Fellows
 
Micro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfMicro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdf
 
General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.
 
Jamworks pilot and AI at Jisc (20/03/2024)
Jamworks pilot and AI at Jisc (20/03/2024)Jamworks pilot and AI at Jisc (20/03/2024)
Jamworks pilot and AI at Jisc (20/03/2024)
 
ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.
 
Salient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functionsSalient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functions
 
Sociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning ExhibitSociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning Exhibit
 
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Google Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptxGoogle Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptx
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.ppt
 
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdfUGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
 
How to Add New Custom Addons Path in Odoo 17
How to Add New Custom Addons Path in Odoo 17How to Add New Custom Addons Path in Odoo 17
How to Add New Custom Addons Path in Odoo 17
 
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
 
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdf
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdfUnit 3 Emotional Intelligence and Spiritual Intelligence.pdf
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdf
 
SOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning PresentationSOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning Presentation
 
Fostering Friendships - Enhancing Social Bonds in the Classroom
Fostering Friendships - Enhancing Social Bonds  in the ClassroomFostering Friendships - Enhancing Social Bonds  in the Classroom
Fostering Friendships - Enhancing Social Bonds in the Classroom
 
On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptx
On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptxOn_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptx
On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptx
 
Plant propagation: Sexual and Asexual propapagation.pptx
Plant propagation: Sexual and Asexual propapagation.pptxPlant propagation: Sexual and Asexual propapagation.pptx
Plant propagation: Sexual and Asexual propapagation.pptx
 

Data science unit1

  • 1. DATA SCIENCE NAME OF STAFF : S.JAMUNA NAME OF THE STUDENT: J.YASMIN STUDENT REG NO :CB17S250453 CLASS :III BCA ‘B’ BATCH :2017-2020 YEAR :2020 SUBJECT CODE :19SDS21/30
  • 3. CHAPTER-1 DATA SCIENCE IN A BIG DATA WORLD  Big data is a blanket term for any collection of data sets so large or complex that it becomes difficult to process them using traditional data management techniques such as, for example, the RDBMS (relational database management systems).  Data science involves using methods to analyze massive amounts of data and extract the knowledge it contains.  Data science and big data evolved from statistics and traditional data management but are now considered to be distinct disciplines.  The characteristics of big data are often referred to as the three Vs:  Volume—How much data is there?  Variety—How diverse are different types of data?  Velocity—At what speed is new data generated?
  • 4. • Often these characteristics are complemented with a fourth V, veracity: How accurate is the data?. These four properties make big data different from the data found in traditional data management tools. • Consequently, the challenges they bring can be felt in almost every aspect: data capture, curation, storage, search, sharing, transfer, and visualization. In addition, big data calls for specialized techniques to extract the insights. • Data science is an evolutionary extension of statistics capable of dealing with the insight. • Data science is an evolutionary extension of statistics capable of dealing with the massive amounts of data produced today..
  • 5. BENEFITS AND USES OF DATA SCIENCE AND BIG DATA • Data science and big data are used almost everywhere in both commercial and noncommercial settings. • The number of use cases is vast, and the examples we’ll provide throughout this book only scratch the surface of the possibilities. • Commercial companies in almost every industry use data science and big data to gain insights into their customers, processes, staff, completion, and products. • Many companies use data science to offer customers a better user experience, as well as to cross-sell, up-sell, and personalize their offerings.
  • 6. • You can use this data to gain insights or build data-driven applications. • Data.gov is but one example; it’s the home of the US Government’s open data. • A data scientist in a governmental organization gets to work on diverse projects such as detecting fraud and other criminal activity or optimizing project funding . • A well-known example was provided by Edward Snowden, who leaked internal documents of the American National Security Agency and the British Government Communications Headquarters that show clearly how they used data science and big data to monitor millions of individuals. • Then they applied data science techniques to distill information.
  • 7. FACETS OF DATA • In data science and big data you’ll come across many different types of data, and each of them tends to require different tools and techniques. The main categories of data are these: ■ Structured ■ Unstructured ■ Natural language ■ Machine-generated ■ Graph-based ■ Audio, video, and images ■ Streaming
  • 8. STRUCTURED DATA • Structured data is data that depends on a data model and resides in a fixed field within a record. • As such, it’s often easy to store structured data in tables within databases or Excel files SQL, or Structured Query Language, is the preferred way to manage and query data that resides in databases. • You may also come across structured data that might give you a hard time storing it in a traditional relational database. • Hierarchical data such as a family tree is one such example. • The world isn’t made up of structured data, though; it’s imposed upon it by humans and machines. • More often, data comes unstructured.
  • 9. An Excel table is an example of structured data.
  • 10. UNSTRUCTURED DATA • Unstructured data is data that isn’t easy to fit into a data model because the content is context-specific or varying. • One example of unstructured data is your regular email. Although email contains structured elements such as the sender, title, and body text, it’s a challenge to find the number of people who have written an email complaint about a specific employee because so many ways exist to refer to a person, for example. The thousands of different languages and dialects out there further complicate this.
  • 11. MACHINE-GENERATED DATA • Machine-generated data is information that’s automatically created by a computer ,process , application, or other machine without human intervention. • Machine-generated data is becoming a major data resource and will continue to do so . Wikibon has forecast that the market value of the industrial Internet (a term coined by Frost & Sullivan to refer to the integration of complex physical machinery with networked sensors and software) will be approximately $540 billion in 2020. IDC (International Data Corporation) has estimated there will be 26 times more connected things than people in 2020. This network is commonly referred to as the internet of things.
  • 13. GRAPH-BASED OR NETWORK DATA: • “Graph data” can be a confusing term because any data can be shown in a graph. • “Graph” in this case points to mathematical graph theory. In graph theory, a graph is a mathematical structure to model pair-wise relationships between objects. • Graph or network data is, in short, data that focuses on the relationship or adjacency of objects. • The graph structures use nodes, edges, and properties to represent and store graphical data.
  • 15. AUDIO, IMAGE, AND VIDEO: • Audio, image, and video are data types that pose specific challenges to a data scientist. • Tasks that are trivial for humans, such as recognizing objects in pictures, turn out to be challenging for computers. • MLBAM (Major League Baseball Advanced Media)announced in 2014 that they’ll increase video capture to approximately 7 TB per game for the purpose of live, in-game analytics. • High-speed cameras at stadiums will capture ball and athlete movements to calculate in real time, for example, the path taken by a defender relative to two baselines.
  • 16. STREAMING DATA • While streaming data can take almost any of the previous forms, it has an extra • property. The data flows into the system when an event happens instead of being loaded into a data store in a batch. Although this isn’t really a different type of data, • we treat it here as such because you need to adapt your process to deal with this type of information. • Examples are the “What’s trending” on Twitter, live sporting or music events, and the stock market.
  • 17. THE DATA SCIENCE PROCESS The data science process typically consists of six steps, as you can see in the mind map Data science process 2: Retrieving data 1: Setting the research goal 3: Data preparation 4: Data exploration 5: Data modeling 6: Presentation and automation
  • 18. SETTING THE RESEARCH GOAL • Data science is mostly applied in the context of an organization. • When the business asks you to perform a data science project , you’ll first prepare a project charter. SETTING THE RESEARCH GOAL Data science is mostly applied in the context of an organization. When the business asks you to perform a data science project , you’ll first prepare a project charter. DATA PREPARATION • Data collection is an error-prone process; in this phase you enhance the quality of the data and prepare it for use in subsequent steps. This phase consists of three subphases: • Data cleansing removes false values from a data source and inconsistencies across data sources, data integration enriches data sources by combining information
  • 19. DATA EXPLORATION • Data exploration is concerned with building a deeper understanding of your data. • You try to understand how variables interact with each other, the distribution of the data, and whether there are outliers DATA MODELING OR MODEL BUILDING • In this phase you use models, domain knowledge, and insights about the data you found in the previous steps to answer the research question. Finally, you present the results to your business. • These results can take many forms , ranging from presentations to research reports. Sometimes you’ll need to automate the execution of the process because the business will want to use the insights you gained in another project or enable an operational process to use the outcome from your model.
  • 20. CHAPTER-2 THE DATA SCIENCE PROCESS OVERVIEW OF THE DATA SCIENCE PROCESS • Following a structured approach to data science helps you to maximize your chances of success in a data science project at the lowest cost. It also makes it possible to take up a project as a team, with each team member focusing on what they do best.
  • 21. The six steps of the data science process
  • 22. 1. The first step of this process is setting a research goal. The main purpose here is making sure all the stakeholders understand the what, how, and why of the project. • In every serious project this will result in a project charter. 2. The second phase is data retrieval. You want to have data available for analysis, so this step includes finding suitable data and getting access to the data from the data owner. • The result is data in its raw form, which probably needs polishing and transformation before it becomes usable.
  • 23. 3 Now that you have the raw data, it’s time to prepare it. This includes transforming the data from a raw form into data that’s directly usable in your models. 4.The fourth step is data exploration. The goal of this step is to gain a deep understanding of the data. 5.Finally, we get to the sexiest part: model building (often referred to as “data modeling” throughout this book). 6.The last step of the data science model is presenting your results and automating the analysis, if needed. One goal of a project is to change a process and/or make better decisions
  • 24. STEP 1: DEFINING RESEARCH GOALS AND CREATING A PROJECT CHARTER
  • 25. STEP 2: RETRIEVING DATA • The next step in data science is to retrieve the required data (figure 2.3). Some times you need to go into the field and design a data collection process yourself, but most of the time you won’t be involved in this step.
  • 26. STEP 3: CLEANSING, INTEGRATING, AND TRANSFORMING DATA • The data received from the data retrieval phase is likely to be “a diamond in the rough.” Your task now is to sanitize and prepare it for use in the modeling and reporting phase.
  • 27. STEP 4: EXPLORATORY DATA ANALYSIS • During exploratory data analysis you take a deep dive into the data (see figure 2.14). • Information becomes much easier to grasp when shown in a picture, therefore you mainly use graphical techniques to gain an understanding of your data and the interactions between variables.
  • 28. STEP 5: BUILD THE MODELS • With clean data in place and a good understanding of the content, you’re ready to build models with the goal of making better predictions, classifying objects, or gaining an understanding of the system that you’re modeling.
  • 29. STEP 6: PRESENTING FINDINGS AND BUILDING APPLICATIONS ON TOP OF THEM • After you’ve successfully analyzed the data and built a well-performing model, you’re ready to present your findings to the world (figure 2.28). • This is an exciting part; all your hours of of hard work have paid off and you can explain what you found to the stakeholders.
  • 30. Chapter-3 Machine learning WHAT IS MACHINE LEARNING AND WHY SHOULD YOU CARE ABOUT IT? “Machine learning is a field of study that gives computers the ability to learn without being explicitly programmed.” —Arthur Samuel, 19591 The definition of machine learning coined by Arthur Samuel is often quoted and is genius in its broadness, but it leaves you with the question of how the computer learns. To achieve machine learning, experts develop general-purpose algorithms that can be used on large classes of learning problems.
  • 31. • When machine learning is seen as a process, the following definition is insightful: • “Machine learning is the process by which a computer can work more accurately as it collects and learns from the data it is given.” —Mike Roberts2 APPLICATIONS FOR MACHINE LEARNING IN DATA SCIENCE  Regression and classification are of primary importance to a data scientist. To achieve these goals, one of the main tools a data scientist uses is machine learning. The uses for regression and automatic classification are wide ranging, such as the following: ■ Finding oil fields, gold mines, or archeological sites based on existing sites (classification and regression) ■ Finding place names or persons in text (classification) ■ Identifying people based on pictures or voice recordings (classification)
  • 32. WHERE MACHINE LEARNING IS USED IN THE DATA SCIENCE PROCESS Although machine learning is mainly linked to the data-modeling step of the data science process, it can be used at almost every step. To refresh your memory from previous chapters, the data science process
  • 33. The data modeling phase can’t start until you have qualitative raw data you can understand. But prior to that, the data preparation phase can benefit from the use of machine learning. An example would be cleansing a list of text strings; machine learning can group similar strings together so it becomes easier to correct spelling errors. PYTHON TOOLS USED IN MACHINE LEARNING Python has an overwhelming number of packages that can be used in a machine learning setting. The Python machine learning ecosystem can be divided into three main types of packages,
  • 34.
  • 35. PACKAGES FOR WORKING WITH DATA IN MEMORY When prototyping, the following packages can get you started by providing advanced functionalities with a few lines of code: ■ SciPy is a library that integrates fundamental packages often used in scientific computing such as NumPy, matplotlib, Pandas, and Sym Py. ■ NumPy gives you access to powerful array functions and linear algebra functions. ■ Matplotlib is a popular 2D plotting package with some 3D functionality. ■ Pandas is a high-performance, but easy-to-use, data- wrangling package. It introduces data frames to Python, a type of in-memory data table. It’s a concept that should sound familiar to regular users of R. .
  • 36. THE MODELING PROCESS The modeling phase consists of four steps: 1 Feature engineering and model selection 2 Training the model 3 Model validation and selection 4 Applying the trained model to unseen data Before you find a good model, you’ll probably iterate among the first three steps. The last step isn’t always present because sometimes the goal isn’t prediction but explanation (root cause analysis). For instance, you might want to find out the causes of species’ extinctions but not necessarily predict which one is next in line to leave our planet.
  • 37. ENGINEERING FEATURES AND SELECTING A MODEL With engineering features, you must come up with and create possible predictors for the model. This is one of the most important steps in the process because a model THE MODELING PROCESS 63 recombines these features to achieve its predictions. Often you may need to consult an expert or the appropriate literature to come up with meaningful features. Certain features are the variables you get from a data set, as is the case with the provided data sets in our exercises and in most school exercises. In practice you’ll need to find the features yourself, which may be scattered among different data sets.
  • 38. TRAINING YOUR MODEL With the right predictors in place and a modeling technique in mind, you can progress to model training. In this phase you present to your model data from which it can learn. VALIDATING A MODEL Data science has many modeling techniques, and the question is which one is the right one to use. A good model has two properties: it has good predictive power and it generalizes well to data it hasn’t seen. PREDICTING NEW OBSERVATIONS If you’ve implemented the first three steps successfully, you now have a performant model that generalizes to unseen data. The process of applying your model to new data is called model scoring.
  • 39. TYPES OF MACHINE LEARNING Broadly speaking, we can divide the different approaches to machine learning by the amount of human effort that’s required to coordinate them and how they use labeled data—data with a category or a real-value number assigned to it that represents the outcome of previous observations.  Supervised learning techniques attempt to discern results and learn by trying to find patterns in a labeled data set. Human interaction is required to label the data. ■ Unsupervised learning techniques don’t rely on labeled data and attempt to find patterns in a data set without human interaction. ■ Semi-supervised learning techniques need labeled data, and therefore human interaction, to find patterns in the data set, but they can still progress toward a result and learn even if passed unlabeled data as well.
  • 40.  Supervised learning techniques attempt to discern result and learn by trying to find patterns in a labeled data set. Human interaction is required to label the data. ■ Unsupervised learning techniques don’t rely on labeled data and attempt to find patterns in a data set without human interaction. ■ Semi-supervised learning techniques need labeled data, and therefore human interaction, to find patterns in the data set, but they can still progress toward a result and learn even if passed unlabeled data as well.
  • 41. • Blurry grayscale representation of the number 0 with its corresponding matrix. • A simple Captcha control can be used to prevent automated spam being sent through an online web form.
  • 42. UNSUPERVISED LEARNING It’s generally true that most large data sets don’t have labels on their data, so unless you sort through it all and give it labels, the supervised learning approach to data won’t work. Instead, we must take the approach that will work with this data because ■ We can study the distribution of the data and infer truths about the data in different parts of the distribution. ■ We can study the structure and values in the data and infer new, more meaningful data and structure from it.
  • 43. • k-means is a good general-purpose algorithm with which to get started. However, like all the clustering algorithms, you need to specify the number of desired clusters in advance, which necessarily results in a process of trial and error before reaching a decent conclusion.
  • 44. SEMI-SUPERVISED LEARNING • It shouldn’t surprise you to learn that while we’d like all our data to be labeled so we can use the more powerful supervised machine learning techniques, in reality we often start with only minimally labeled data, if it’s labeled at all. • We can use our unsupervised machine learning techniques to analyze what we have and perhaps add labels to the data set, but it will be prohibitively costly to label it all. Our goal then is to train our predictor models with as little labeled data as possible.
  • 45. • A common semi-supervised learning technique is label propagation. In this technique , you start with a labeled data set and give the same label to similar data points. • This is similar to running a clustering algorithm over the data set and labeling each cluster based on the labels they contain.