SlideShare uma empresa Scribd logo
1 de 16
Welcome to CMPSC-310!
Introduction to Data Science
What Is Data Science?
Extraction of knowledge from data (also known as
knowledge discovery and data mining, KDD).
Data science :=
Computer science (for data structures,
algorithms, visualization, big data support, general
programming) +
Statistics (for regressions and inference) +
Domain knowledge (for asking questions and
interpreting results). 2
Data, Information, Knowledge, etc.
3
(by David Somerville @smrvl)
Data Science and Other Disciplines: BI
Business Intelligence engineers traditionally make tools for others to analyze
data with. BI engineers do not analyze the data. Data scientists will both make
and analyze using what they made. If you are a software engineer you need to
learn statistical modeling and how to communicate results. You will need to use
these datasets and work with them to make decisions.
4
Data Science and Other Disciplines: STATS
Statisticians are traditionally content with the assumption (condition) that all their
data will fit in main memory at the same time. Statisticians traditionally used
math or created new math to squeeze as much information as possible from small
numbers of observations or features. Data scientists recognize the need to use
and create math to handle analyses in data-poor environments but will use and
create new software engineering tools to handle very large datasets, and they
recognize that some the models are the same in both cases. You need to learn to
deal with data that does not fit in memory to be a data scientist because it’s no
longer safe to assume.
5
Data Science and Other Disciplines: DB
Database programmers and administrators bring useful skills to data science
but they are traditionally focused on one data model: relational. Handling
graphs’ nodes and edges (e.g., pagerank), images, video, text, as well as SQL
when appropriate, are more like data science. You need to deal with unstructured
data to be a data scientist.
6
Data Science and Other Disciplines: Visualization
Visualization experts and business analysts bring skills but are traditionally not
concerned with massive scale like hundreds or thousands of machines. If you
are a business analyst then you need to learn about algorithms and tradeoffs at
large scale. With cloud computing and with algorithms, you may get an answer but
it may cost more or less than it did 5 years ago. It is no longer safe to throw your
trust over the wall to some algorithm or to your staff to run some algorithm. You
will need to internalize the tradeoffs of choosing one model or another yourself.
7
Data Science and Other Disciplines: ML
Machine learning is similar to data science but it’s a small fraction of it. The
getting of data, cleaning, exploring, and making interactive visualizations and data
products for yourself and for others to use (e.g. data driven language translators,
spellcheckers) as well as doing ML, these are more like data science.
8
Topics
● Numeric data analysis
● Signal processing
● Text data analysis (information/document/text retrieval, natural language
processing)
● Statistical inference
● Databases (information integration)
● Complex network analysis
● Data visualization 9
Define the Question of Study
● Descriptive: Describe a set of data.
● Exploratory: Find new relationships.
● Inferential: Use a small data sample to describe a bigger population. Based
on statistics.
● Predictive: Use data on some objects to predict values for another object.
● Causal: Does one variable affect another variable? Based on statistics.
Correlation != Causation.
● Mechanistic: Exactly how does one variable affect another variable? Based
on deep domain knowledge. 10
Get and Clean Data
1. Define the ideal data set
Determine what data you can access
2. Obtain the data
Raw data vs processed data. Always use raw data, but process it once; record all
processing steps
3. Clean the data
11
Explore Data
● Exploratory data analysis
● Model data and predict
● Interpret results
● Challenge results
● Present results to the data sponsor
12
Create Reproducible Code
● Don't do things by hand–teach the computer! All things done by hand must be
precisely documents
● Don't use interactive GUI tools (no history!)
● Use version control software (Git/GitHub)
● Avoid intermediate files, unless they are hard to build (in which case cache
them)
13
Report Structure
● Project report
○ Abstract: A brief description of the project.
○ Introduction.
○ Methods.
○ Results.
○ Conclusion.
● Code
○ Well-commented scripts that can be executed without any command line parameters or
interaction. 14
Suggested Directory Structure
● data – for the input data, if needed
● cache – for the previously downloaded data
● results – for numerical results
● code – for the Python script(s)
● doc – for the report and figures
15
Data Acquisition Pipeline
16

Mais conteúdo relacionado

Mais procurados

Data science life cycle
Data science life cycleData science life cycle
Data science life cycleManoj Mishra
 
What is Data Science? |Role of Data Science in Big Data, Hadoop & Machine Lea...
What is Data Science? |Role of Data Science in Big Data, Hadoop & Machine Lea...What is Data Science? |Role of Data Science in Big Data, Hadoop & Machine Lea...
What is Data Science? |Role of Data Science in Big Data, Hadoop & Machine Lea...vinayiqbusiness
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data ScienceANOOP V S
 
Introduction To Data Science
Introduction To Data ScienceIntroduction To Data Science
Introduction To Data ScienceSpotle.ai
 
Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data scienceSampath Kumar
 
Session 10 handling bigger data
Session 10 handling bigger dataSession 10 handling bigger data
Session 10 handling bigger databodaceacat
 
Data Scientist Roles and Responsibilities | Data Scientist Career | Data Scie...
Data Scientist Roles and Responsibilities | Data Scientist Career | Data Scie...Data Scientist Roles and Responsibilities | Data Scientist Career | Data Scie...
Data Scientist Roles and Responsibilities | Data Scientist Career | Data Scie...Edureka!
 

Mais procurados (19)

2005)
2005)2005)
2005)
 
Data Science Project Lifecycle and Skill Set
Data Science Project Lifecycle and Skill SetData Science Project Lifecycle and Skill Set
Data Science Project Lifecycle and Skill Set
 
Data science life cycle
Data science life cycleData science life cycle
Data science life cycle
 
What is Data Science? |Role of Data Science in Big Data, Hadoop & Machine Lea...
What is Data Science? |Role of Data Science in Big Data, Hadoop & Machine Lea...What is Data Science? |Role of Data Science in Big Data, Hadoop & Machine Lea...
What is Data Science? |Role of Data Science in Big Data, Hadoop & Machine Lea...
 
Unit 3 part 2
Unit  3 part 2Unit  3 part 2
Unit 3 part 2
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Introduction To Data Science
Introduction To Data ScienceIntroduction To Data Science
Introduction To Data Science
 
Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data science
 
Session 10 handling bigger data
Session 10 handling bigger dataSession 10 handling bigger data
Session 10 handling bigger data
 
Data science
Data scienceData science
Data science
 
Data science
Data scienceData science
Data science
 
Data science
Data scienceData science
Data science
 
Data science
Data science Data science
Data science
 
Data science Big Data
Data science Big DataData science Big Data
Data science Big Data
 
Data Science
Data ScienceData Science
Data Science
 
data science
data sciencedata science
data science
 
Data Scientist Roles and Responsibilities | Data Scientist Career | Data Scie...
Data Scientist Roles and Responsibilities | Data Scientist Career | Data Scie...Data Scientist Roles and Responsibilities | Data Scientist Career | Data Scie...
Data Scientist Roles and Responsibilities | Data Scientist Career | Data Scie...
 
Paper presentation
Paper presentationPaper presentation
Paper presentation
 
50 Years of Data Science
50 Years of Data Science50 Years of Data Science
50 Years of Data Science
 

Semelhante a Welcome to CS310!

Ch1IntroductiontoDataScience.pptx
Ch1IntroductiontoDataScience.pptxCh1IntroductiontoDataScience.pptx
Ch1IntroductiontoDataScience.pptxAbderrahmanABID2
 
DATASCIENCE vs BUSINESS INTELLIGENCE.pptx
DATASCIENCE vs BUSINESS INTELLIGENCE.pptxDATASCIENCE vs BUSINESS INTELLIGENCE.pptx
DATASCIENCE vs BUSINESS INTELLIGENCE.pptxOTA13NayabNakhwa
 
Data Science Introduction: Concepts, lifecycle, applications.pptx
Data Science Introduction: Concepts, lifecycle, applications.pptxData Science Introduction: Concepts, lifecycle, applications.pptx
Data Science Introduction: Concepts, lifecycle, applications.pptxsumitkumar600840
 
Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg
Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGargColloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg
Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGargShiv Shakti Ghosh
 
Introduction to Data Science.pptx
Introduction to Data Science.pptxIntroduction to Data Science.pptx
Introduction to Data Science.pptxDr.Shweta
 
Introduction to Data Analysis Course Notes.pdf
Introduction to Data Analysis Course Notes.pdfIntroduction to Data Analysis Course Notes.pdf
Introduction to Data Analysis Course Notes.pdfGraceOkeke3
 
Introduction to Data Science.pdf
Introduction to Data Science.pdfIntroduction to Data Science.pdf
Introduction to Data Science.pdfUniversity of Sindh
 
Data science presentation
Data science presentationData science presentation
Data science presentationMSDEVMTL
 
Introduction to Data Science.pptx
Introduction to Data Science.pptxIntroduction to Data Science.pptx
Introduction to Data Science.pptxVrishit Saraswat
 
Self Study Business Approach to DS_01022022.docx
Self Study Business Approach to DS_01022022.docxSelf Study Business Approach to DS_01022022.docx
Self Study Business Approach to DS_01022022.docxShanmugasundaram M
 
Data Science Job ready #DataScienceInterview Question and Answers 2022 | #Dat...
Data Science Job ready #DataScienceInterview Question and Answers 2022 | #Dat...Data Science Job ready #DataScienceInterview Question and Answers 2022 | #Dat...
Data Science Job ready #DataScienceInterview Question and Answers 2022 | #Dat...Rohit Dubey
 
Model evaluation in the land of deep learning
Model evaluation in the land of deep learningModel evaluation in the land of deep learning
Model evaluation in the land of deep learningPramit Choudhary
 
Data science Nagarajan and madhav.pptx
Data science Nagarajan and madhav.pptxData science Nagarajan and madhav.pptx
Data science Nagarajan and madhav.pptxNagarajanG35
 
Introduction to data science.pdf
Introduction to data science.pdfIntroduction to data science.pdf
Introduction to data science.pdfalsaid fathy
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data AnalyticsOsman Ali
 
Data analytics career path
Data analytics career pathData analytics career path
Data analytics career pathRubikal
 

Semelhante a Welcome to CS310! (20)

Ch1IntroductiontoDataScience.pptx
Ch1IntroductiontoDataScience.pptxCh1IntroductiontoDataScience.pptx
Ch1IntroductiontoDataScience.pptx
 
DATASCIENCE vs BUSINESS INTELLIGENCE.pptx
DATASCIENCE vs BUSINESS INTELLIGENCE.pptxDATASCIENCE vs BUSINESS INTELLIGENCE.pptx
DATASCIENCE vs BUSINESS INTELLIGENCE.pptx
 
Data Science Introduction: Concepts, lifecycle, applications.pptx
Data Science Introduction: Concepts, lifecycle, applications.pptxData Science Introduction: Concepts, lifecycle, applications.pptx
Data Science Introduction: Concepts, lifecycle, applications.pptx
 
Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg
Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGargColloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg
Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg
 
Introduction to Data Science.pptx
Introduction to Data Science.pptxIntroduction to Data Science.pptx
Introduction to Data Science.pptx
 
Introduction to Data Analysis Course Notes.pdf
Introduction to Data Analysis Course Notes.pdfIntroduction to Data Analysis Course Notes.pdf
Introduction to Data Analysis Course Notes.pdf
 
Introduction to Data Science.pdf
Introduction to Data Science.pdfIntroduction to Data Science.pdf
Introduction to Data Science.pdf
 
1 UNIT-DSP.pptx
1 UNIT-DSP.pptx1 UNIT-DSP.pptx
1 UNIT-DSP.pptx
 
Data science presentation
Data science presentationData science presentation
Data science presentation
 
Introduction to Data Science.pptx
Introduction to Data Science.pptxIntroduction to Data Science.pptx
Introduction to Data Science.pptx
 
Self Study Business Approach to DS_01022022.docx
Self Study Business Approach to DS_01022022.docxSelf Study Business Approach to DS_01022022.docx
Self Study Business Approach to DS_01022022.docx
 
Data Science Job ready #DataScienceInterview Question and Answers 2022 | #Dat...
Data Science Job ready #DataScienceInterview Question and Answers 2022 | #Dat...Data Science Job ready #DataScienceInterview Question and Answers 2022 | #Dat...
Data Science Job ready #DataScienceInterview Question and Answers 2022 | #Dat...
 
Model evaluation in the land of deep learning
Model evaluation in the land of deep learningModel evaluation in the land of deep learning
Model evaluation in the land of deep learning
 
Data science Nagarajan and madhav.pptx
Data science Nagarajan and madhav.pptxData science Nagarajan and madhav.pptx
Data science Nagarajan and madhav.pptx
 
Introduction to data science.pdf
Introduction to data science.pdfIntroduction to data science.pdf
Introduction to data science.pdf
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
 
Bigdataanalytics
BigdataanalyticsBigdataanalytics
Bigdataanalytics
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
 
Data Analytics Career Paths
Data Analytics Career PathsData Analytics Career Paths
Data Analytics Career Paths
 
Data analytics career path
Data analytics career pathData analytics career path
Data analytics career path
 

Mais de Dmitry Zinoviev

Machine Learning Basics for Dummies (no math!)
Machine Learning Basics for Dummies (no math!)Machine Learning Basics for Dummies (no math!)
Machine Learning Basics for Dummies (no math!)Dmitry Zinoviev
 
WHat is star discourse in post-Soviet film journals?
WHat is star discourse in post-Soviet film journals?WHat is star discourse in post-Soviet film journals?
WHat is star discourse in post-Soviet film journals?Dmitry Zinoviev
 
The “Musk” Effect at Twitter
The “Musk” Effect at TwitterThe “Musk” Effect at Twitter
The “Musk” Effect at TwitterDmitry Zinoviev
 
Are Twitter Networks of Regional Entrepreneurs Gendered?
Are Twitter Networks of Regional Entrepreneurs Gendered?Are Twitter Networks of Regional Entrepreneurs Gendered?
Are Twitter Networks of Regional Entrepreneurs Gendered?Dmitry Zinoviev
 
Using Complex Network Analysis for Periodization
Using Complex Network Analysis for PeriodizationUsing Complex Network Analysis for Periodization
Using Complex Network Analysis for PeriodizationDmitry Zinoviev
 
Text analysis of The Book Club Play
Text analysis of The Book Club PlayText analysis of The Book Club Play
Text analysis of The Book Club PlayDmitry Zinoviev
 
Exploring the History of Mental Stigma
Exploring the History of Mental StigmaExploring the History of Mental Stigma
Exploring the History of Mental StigmaDmitry Zinoviev
 
Roles and Words in a massive NSSI-Related Interaction Network
Roles and Words in a massive NSSI-Related Interaction NetworkRoles and Words in a massive NSSI-Related Interaction Network
Roles and Words in a massive NSSI-Related Interaction NetworkDmitry Zinoviev
 
“A Quaint and Curious Volume of Forgotten Lore,” or an Exercise in Digital Hu...
“A Quaint and Curious Volume of Forgotten Lore,” or an Exercise in Digital Hu...“A Quaint and Curious Volume of Forgotten Lore,” or an Exercise in Digital Hu...
“A Quaint and Curious Volume of Forgotten Lore,” or an Exercise in Digital Hu...Dmitry Zinoviev
 
Network analysis of the 2016 USA presidential campaign tweets
Network analysis of the 2016 USA presidential campaign tweetsNetwork analysis of the 2016 USA presidential campaign tweets
Network analysis of the 2016 USA presidential campaign tweetsDmitry Zinoviev
 
Network Analysis of The Shining
Network Analysis of The ShiningNetwork Analysis of The Shining
Network Analysis of The ShiningDmitry Zinoviev
 
The Lord of the Ring. A Network Analysis
The Lord of the Ring. A Network AnalysisThe Lord of the Ring. A Network Analysis
The Lord of the Ring. A Network AnalysisDmitry Zinoviev
 
DaVinci Code. Network Analysis
DaVinci Code. Network AnalysisDaVinci Code. Network Analysis
DaVinci Code. Network AnalysisDmitry Zinoviev
 
Soviet Popular Music Landscape: Community Structure and Success Predictors
Soviet Popular Music Landscape: Community Structure and Success PredictorsSoviet Popular Music Landscape: Community Structure and Success Predictors
Soviet Popular Music Landscape: Community Structure and Success PredictorsDmitry Zinoviev
 
C for Java programmers (part 2)
C for Java programmers (part 2)C for Java programmers (part 2)
C for Java programmers (part 2)Dmitry Zinoviev
 

Mais de Dmitry Zinoviev (20)

Machine Learning Basics for Dummies (no math!)
Machine Learning Basics for Dummies (no math!)Machine Learning Basics for Dummies (no math!)
Machine Learning Basics for Dummies (no math!)
 
WHat is star discourse in post-Soviet film journals?
WHat is star discourse in post-Soviet film journals?WHat is star discourse in post-Soviet film journals?
WHat is star discourse in post-Soviet film journals?
 
The “Musk” Effect at Twitter
The “Musk” Effect at TwitterThe “Musk” Effect at Twitter
The “Musk” Effect at Twitter
 
Are Twitter Networks of Regional Entrepreneurs Gendered?
Are Twitter Networks of Regional Entrepreneurs Gendered?Are Twitter Networks of Regional Entrepreneurs Gendered?
Are Twitter Networks of Regional Entrepreneurs Gendered?
 
Using Complex Network Analysis for Periodization
Using Complex Network Analysis for PeriodizationUsing Complex Network Analysis for Periodization
Using Complex Network Analysis for Periodization
 
Algorithms
AlgorithmsAlgorithms
Algorithms
 
Text analysis of The Book Club Play
Text analysis of The Book Club PlayText analysis of The Book Club Play
Text analysis of The Book Club Play
 
Exploring the History of Mental Stigma
Exploring the History of Mental StigmaExploring the History of Mental Stigma
Exploring the History of Mental Stigma
 
Roles and Words in a massive NSSI-Related Interaction Network
Roles and Words in a massive NSSI-Related Interaction NetworkRoles and Words in a massive NSSI-Related Interaction Network
Roles and Words in a massive NSSI-Related Interaction Network
 
“A Quaint and Curious Volume of Forgotten Lore,” or an Exercise in Digital Hu...
“A Quaint and Curious Volume of Forgotten Lore,” or an Exercise in Digital Hu...“A Quaint and Curious Volume of Forgotten Lore,” or an Exercise in Digital Hu...
“A Quaint and Curious Volume of Forgotten Lore,” or an Exercise in Digital Hu...
 
Network analysis of the 2016 USA presidential campaign tweets
Network analysis of the 2016 USA presidential campaign tweetsNetwork analysis of the 2016 USA presidential campaign tweets
Network analysis of the 2016 USA presidential campaign tweets
 
Network Analysis of The Shining
Network Analysis of The ShiningNetwork Analysis of The Shining
Network Analysis of The Shining
 
The Lord of the Ring. A Network Analysis
The Lord of the Ring. A Network AnalysisThe Lord of the Ring. A Network Analysis
The Lord of the Ring. A Network Analysis
 
Pickling and CSV
Pickling and CSVPickling and CSV
Pickling and CSV
 
Python overview
Python overviewPython overview
Python overview
 
Programming languages
Programming languagesProgramming languages
Programming languages
 
The P4 of Networkacy
The P4 of NetworkacyThe P4 of Networkacy
The P4 of Networkacy
 
DaVinci Code. Network Analysis
DaVinci Code. Network AnalysisDaVinci Code. Network Analysis
DaVinci Code. Network Analysis
 
Soviet Popular Music Landscape: Community Structure and Success Predictors
Soviet Popular Music Landscape: Community Structure and Success PredictorsSoviet Popular Music Landscape: Community Structure and Success Predictors
Soviet Popular Music Landscape: Community Structure and Success Predictors
 
C for Java programmers (part 2)
C for Java programmers (part 2)C for Java programmers (part 2)
C for Java programmers (part 2)
 

Último

ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesBoston Institute of Analytics
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 

Último (20)

ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 

Welcome to CS310!

  • 2. What Is Data Science? Extraction of knowledge from data (also known as knowledge discovery and data mining, KDD). Data science := Computer science (for data structures, algorithms, visualization, big data support, general programming) + Statistics (for regressions and inference) + Domain knowledge (for asking questions and interpreting results). 2
  • 3. Data, Information, Knowledge, etc. 3 (by David Somerville @smrvl)
  • 4. Data Science and Other Disciplines: BI Business Intelligence engineers traditionally make tools for others to analyze data with. BI engineers do not analyze the data. Data scientists will both make and analyze using what they made. If you are a software engineer you need to learn statistical modeling and how to communicate results. You will need to use these datasets and work with them to make decisions. 4
  • 5. Data Science and Other Disciplines: STATS Statisticians are traditionally content with the assumption (condition) that all their data will fit in main memory at the same time. Statisticians traditionally used math or created new math to squeeze as much information as possible from small numbers of observations or features. Data scientists recognize the need to use and create math to handle analyses in data-poor environments but will use and create new software engineering tools to handle very large datasets, and they recognize that some the models are the same in both cases. You need to learn to deal with data that does not fit in memory to be a data scientist because it’s no longer safe to assume. 5
  • 6. Data Science and Other Disciplines: DB Database programmers and administrators bring useful skills to data science but they are traditionally focused on one data model: relational. Handling graphs’ nodes and edges (e.g., pagerank), images, video, text, as well as SQL when appropriate, are more like data science. You need to deal with unstructured data to be a data scientist. 6
  • 7. Data Science and Other Disciplines: Visualization Visualization experts and business analysts bring skills but are traditionally not concerned with massive scale like hundreds or thousands of machines. If you are a business analyst then you need to learn about algorithms and tradeoffs at large scale. With cloud computing and with algorithms, you may get an answer but it may cost more or less than it did 5 years ago. It is no longer safe to throw your trust over the wall to some algorithm or to your staff to run some algorithm. You will need to internalize the tradeoffs of choosing one model or another yourself. 7
  • 8. Data Science and Other Disciplines: ML Machine learning is similar to data science but it’s a small fraction of it. The getting of data, cleaning, exploring, and making interactive visualizations and data products for yourself and for others to use (e.g. data driven language translators, spellcheckers) as well as doing ML, these are more like data science. 8
  • 9. Topics ● Numeric data analysis ● Signal processing ● Text data analysis (information/document/text retrieval, natural language processing) ● Statistical inference ● Databases (information integration) ● Complex network analysis ● Data visualization 9
  • 10. Define the Question of Study ● Descriptive: Describe a set of data. ● Exploratory: Find new relationships. ● Inferential: Use a small data sample to describe a bigger population. Based on statistics. ● Predictive: Use data on some objects to predict values for another object. ● Causal: Does one variable affect another variable? Based on statistics. Correlation != Causation. ● Mechanistic: Exactly how does one variable affect another variable? Based on deep domain knowledge. 10
  • 11. Get and Clean Data 1. Define the ideal data set Determine what data you can access 2. Obtain the data Raw data vs processed data. Always use raw data, but process it once; record all processing steps 3. Clean the data 11
  • 12. Explore Data ● Exploratory data analysis ● Model data and predict ● Interpret results ● Challenge results ● Present results to the data sponsor 12
  • 13. Create Reproducible Code ● Don't do things by hand–teach the computer! All things done by hand must be precisely documents ● Don't use interactive GUI tools (no history!) ● Use version control software (Git/GitHub) ● Avoid intermediate files, unless they are hard to build (in which case cache them) 13
  • 14. Report Structure ● Project report ○ Abstract: A brief description of the project. ○ Introduction. ○ Methods. ○ Results. ○ Conclusion. ● Code ○ Well-commented scripts that can be executed without any command line parameters or interaction. 14
  • 15. Suggested Directory Structure ● data – for the input data, if needed ● cache – for the previously downloaded data ● results – for numerical results ● code – for the Python script(s) ● doc – for the report and figures 15