Presentation on Demystifying Data Science. I presented this ppt at a panel discussion organised by Christ University on March 1, 2019. The presentation tries to present a realistic perspective of Data Science to aspiring Data Scientists. This perspective is from my own experience as a Data Scientist.
4. WHAT IS DATA SCIENCE?
“Data science is the field of study that combines domain expertise, programming
skills, and knowledge of math and statistics to extract meaningful insights from data.”
“Data science is the discipline of making data useful.”
“Data Science as a multi-disciplinary subject encompasses the use of mathematics, statistics, and
computer science to study and evaluate data. The key objective of Data Science is to extract valuable
information for use in strategic decision making, product development, trend analysis and forecasting.”
“Data science is a ‘concept to unify statistics, data analysis, machine learning and their
related methods’ in order to ‘understand and analyze actual phenomena’ with data.”
R Venkat Raman
5. WHAT IS DATA SCIENCE?
THE VENN DIAGRAMS
R Venkat Raman
6. WHO IS A DATA SCIENTIST?
“An ideal data scientist is someone who has the both the engineering skills to acquire and manage
large data sets, and also has the statistician’s skills to extract value from the large data sets and
present that data to a large audience”
“A data scientist is someone who blends, math, algorithms, and an understanding of human
behaviour with the ability to hack systems together to get answers to interesting human questions
from data”
“A Data Scientist is a person who does Data Science”
“Person who is better at statistics than any software engineer and
better at software engineering than any statistician.”
R Venkat Raman
17. INCREASED STORAGE AND COMPUTING POWER
THE STATISTICS – MACHINE LEARNING DIVERGENCE
• In the 20th century, the computing and storage power was less. This required statisticians to infer a lot of things from a
sample. Hence inferential statistics was heavily used and relied upon.
• Fast forward now, the computing and storage power has increased substantially. This enabled machine learning and Deep
learning to blossom. In Machine/Deep Learning, more data the better as the prediction improves with more quality training
data. This thinking is divergent from a 20th century statistical thinking.
R Venkat Raman
18. EXPLOSION OF DATA
• 2.5 quintillion bytes of data created each day1
• 90% of the data in the world today has been created in the last two
years alone1
• More than 3.7 Billion humans use the internet 1
• Every minute Snapchat users share 527,760 photos, Users watch
4,146,600 YouTube videos, 456,000 tweets are sent on Twitter,
Instagram users post 46,740 photos
• Close to 3 Billion smartphone users in the world
1:Report as of 2018
There is tremendous scope to extract insights out of these data !
Hence the demand for Data Scientists.
R Venkat Raman
21. DATA SCIENCE – A TEAM EFFORT
Data Engineers Data Scientists Data Storyteller/TranslatorsSoftware Engineers
What They Do
Skill Set
Tools Used
• Create Data pipelines.
• Evaluate Databases
• Design Schemas
• Perform ETL
• Knowledge of Databases
• Scripting skills (Linux
commands)
• Knowledge of Cloud
technologies
• SQL commands
• Apply statistical/Machine
learning techniques to
solve business problems
• Perform R&D
• Innovate new solutions
• Develop Data science
products
• Knowledge of statistical
and mathematical
concepts
• Knowledge of various
statistical/ML algorithms
• Scripting skills
(R/Python)
• SQL commands
• Help design UI (front end
coding)
• Do backend coding
• Help deploy data science
solution in production
• Automate the entire
process
• Knowledge of
Programming concepts
• Programming languages
• Knowledge of Databases
• Knowledge of Restful
APIs
• Scripting skills (Linux
commands)
• Communicate Data Science
solutions in Business friendly/ non
technical terms
• Understand business requirements
and translate them to Data science
problems
• Design persuasive Data
visualizations
• High level understanding of
statistics and ML concepts
• Business acumen
• Good soft skills
• Creativity
• Persuasion and articulation
R Venkat Raman
23. THE DATA SCIENTIST TALENT STACK
IDEA INSPIRED BY SCOTT ADAM’S TALENT STACK THEORY
Knowledge of Inner
workings of Algorithms
Statistics/Maths Skills
Coding/ Technical Skills
Persuasion /Storytelling
R Venkat Raman
24. THE PATH TO BECOME A DATA SCIENTIST
• Can anyone become a Data Scientist ?
Yes
• Can a person become a Data Scientist just by doing some Moocs/short courses for a duration of 3-6 months ?
No
R Venkat Raman
25. HOW GOOD ARE THE MOOCS AND KAGGLE COMPETITIONS?
TOO MUCH SIGNALING
• There are thousands of courses available online now.
• While the courses may be useful to build knowledge or act as a
repository for revising concepts, the course certificates by
themselves does not guarantee to a person a Data Science Job
• Millions of people take the same courses and the solutions to the
questions of these Moocs are easily hackable or available
• Kaggle competitions are a competition more for showcasing processing
speed or ensemble techniques than intellectual rigor.
• The data is never clean in real life as given in Kaggle competitions
• But Kaggle kernels are useful
MOOCs
Kaggle Competitions
R Venkat Raman
26. GETTING HIRED AS A DATA SCIENTIST
HOW TO IMPROVE VISIBILITY AND BECOME EMPLOYABLE
• Focus on a specific area like NLP, Computer Vision,
Marketing Analytics, Classical Statistical applications. Try to
be specialist than a generalist.
• This strategy will work to gain entry into the field of
Data Science. But as one gains more experience, it
becomes harder to stay a specialist unless one is in
an academic framework.
• Write technical and non technical blogs
• Try the Feynman technique of learning things
• Do pet projects, develop small products, put the code on
GitHub
• Learn niche and complimentary skills like putting the code
in production or how to dockerize codes.
• Network with Data Scientists in Industry and Academia
• Follow the Data Scientists on Twitter or LinkedIn
• As an Institution or Individual, start Data Science podcasts
R Venkat Raman