1. DS101:
Introduction to AI and DS
Lecture 1: Introduction to Data Science
Dr. Sudheer
hsudheer@ifheindia.org
1
2. 2
Course Code Course Title L P U
DS101
Introduction to Data Science and Artificial
Intelligence
3 0 3
Team of Instructors: 1. Ms Sathya AR 2. Ms. P Rohini 3. Dr. H Sudheer 4. Dr. P. Sirisha
Course Objective:
1. The objective of this course is to expose the students to fundamental concepts of data science and their
implementation using Python programming.
2. Introduce the mathematical foundations required for data science
3. To explore the various data pre-processing techniques
4. To Summarize the aspects of exploratory data analysis (EDA): Uses of EDA; Role of metadata in EDA; Data
transformations identified through EDA.
5. To understand the AI approaches in Data Science.
3. 3
Textbook (s) T1 Cathy O’Neil and Rachel Schutt, “Doing Data Science, Straight Talk From The
Frontline”, O’Reilly, 2014.
T2 Artificial Intelligence A Modern Approach, by Stuart Russell and Peter
Norvig, 3 rd Edition, Pearson Education, 2010, ISBN 13:978-0-13-604259-4.
Reference Book(s) R1 Python Data Science Handbook, Essential Tools for Working with Data, Jake
VanderPlas,Orielly, 2017
R2
Data Science from Scratch: FIRST PRINCIPLES WITH PYTHON, Joel Grus,
Orielly,2019
R3 The Data Science HandBook, Field Cady ,Wiley,2017
R4
Jiawei Han, Micheline Kamber and Jian Pei, “ Data Mining: Concepts and
Techniques”, Third Edition. ISBN 0123814790, 2011
Online Resources R5 https://onlinecourses.nptel.ac.in/noc22_cs72/preview
R6 https://www.udemy.com/course/complete-python-bootcamp/
R7
https://lms.simplilearn.com/courses/4227/Introduction-to-Data-
Science/syllabus
4. 4
“Data is the new oil. It’s valuable, but if unrefined it cannot really be used. It has
to be changed into gas, plastic, chemicals, etc to create a valuable entity that
drives profitable activity; so must data be broken down, analyzed for it to have
value.” — Clive Humby, 2006
7. Data Science is the science which uses computer science, statistics
and machine learning, visualization and human-computer
interactions to collect, clean, integrate, analyze, visualize, interact
with data to create data products.
Data science is an interdisciplinary field that uses scientific methods,
processes, algorithms and systems to extract knowledge and insights
from noisy, structured and unstructured data] and apply knowledge from
data across a broad range of application domains. Data science is related
to data mining, machine learning and big data.
SOURCE : WIKIPIDEA
7
9. “Big Data” Sources
Every:
Click
Ad impression
Billing event
Fast Forward, pause,…
Server request
Transaction
Network message
Fault
…
User Generated (Web &
Mobile)
….
.
Internet of Things / M2M Health/Scientific Computing
It’s All Happening On-line
12. The Current Landscape (with a Little History)
12
Data science is a broad field that refers to the collective
processes, theories, concepts, tools and technologies that
enable the review, analysis and extraction of valuable
knowledge and information from raw data.
Source: Techopedia
Drew Conway’s Venn diagram of data science
13. Rise of the Data Scientist
13
skills of Data Geeks:
Statistics – traditional analysis you’re used to thinking about
Data Munging – parsing, scraping, and formatting data
Visualization – graphs, tools, etc.
Harvard Business Review declared data scientist to be the “Sexiest Job of the
21st Century”.
https://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-21st-century
14. The Role of the Social Scientist in Data Science
14
Both LinkedIn and Facebook are social network companies.
Often‐ times a description or definition of data scientist includes hybrid sta
tistician, software engineer, and social scientist.
If they’re social science-y problems like friend recommendations or people
you know or user segmentation, then by all means, bring on the social
scientist! Social scientists also do tend to be good question askers and have
other good investigative qualities, so a social scientist who also has the
quantitative and programming chops makes a great data scientist.
15. Data Science Jobs
15
Most of the job descriptions: they ask data scientists to be experts in
computer science, statistics, communication, data visualization, and to have
extensive domain expertise.
Nobody is an expert in everything, which is why it makes more sense to create
teams of people who have different profiles and different expertise together,
as a team, they can specialize in all those things.
A Data Science Profile :
• Computer science
• Math
• Statistics
• Machine learning
• Domain expertise
• Communication and presentation skills
• Data visualization
16. Rachel’s data science profile, which she created to illustrate trying to visualize oneself as a data
scientist; she wanted students and guest lecturers to “riff” on this—to add buckets or remove
skills, use a different scale or visualization method, and think about the drawbacks of self-
reporting
16
17. Data science team profiles can be
constructed from data scientist
profiles; there should be alignment
between the data science team
profile and the profile of the data
problems they try to solve
17
18. Data science workflow
18
Section 2
https://cacm.acm.org/blogs/blog-cacm/169199-data-science-
workflow-overview-and-challenges/fulltext
21. What is hard about Data Science
21
• Overcoming assumptions
• Making ad-hoc explanations of data patterns
• Overgeneralizing
• Communication
• Not checking enough (validate models, data pipeline
integrity, etc.)
• Using statistical tests correctly
• Prototype Production transitions
• Data pipeline complexity (who do you ask?)
23. What are Data Scientists really doing?
23
Section 2
https://visit.figure-eight.com/rs/416-ZBE-
142/images/CrowdFlower_DataScienceReport_2016.pdf
Notas do Editor
Ronny Kohavi* keynote at KDD 2015
People are incredibly clever at explaining “very surprising results”. Unfortunately most very surprising results are caused by data pipeline errors.
Beware “HiPPOs” (Highest Paid-Person’s Opinion)
Quote from paper “I’d rather the data go away than be wrong and not know”
Assumptions not communicated: transformations not documented.
Quote from paper “I’d rather the data go away than be wrong and not know”
Assumptions not communicated: transformations not documented.