O slideshow foi denunciado.
Seu SlideShare está sendo baixado. ×

Data Science Project Lifecycle and Skill Set

Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Carregando em…3
×

Confira estes a seguir

1 de 15 Anúncio
Anúncio

Mais Conteúdo rRelacionado

Diapositivos para si (20)

Quem viu também gostou (20)

Anúncio

Semelhante a Data Science Project Lifecycle and Skill Set (20)

Mais de IDEAS - Int'l Data Engineering and Science Association (20)

Anúncio

Mais recentes (20)

Data Science Project Lifecycle and Skill Set

  1. 1. Data Science Project Lifecycle and Data Scientist Skill Set Jason Geng @Data Application Lab Miya Du @Data Science Association
  2. 2. Business Requirement Data Acquisition Data Preparation Hypothesis & Modeling Evaluation & Interpretation Deployment Operations Optimization
  3. 3. Business Requirements  Data scientists need to work with business people and those with expertise in understanding the data, understanding the business  Specify the business requirements  For instance, the healthcare data
  4. 4. e.g. ‘DISCWT’: ‘This the discharge-level weight on the HCUP nationwide data to produce national estimates’ Understand the data: Understand the Business: Goal: Predict Readmission Rate Database: Healthcare: Readmissions Database Modeling
  5. 5. Data Collection  Data from product line  Purchase third party data  Social media (Facebook, LinkedIn)  Web crawling  Open source data (Opendata, U.S. Census Data) Challenge Data Storage Data Management
  6. 6. Legacy data OLTP Web Log Web Crawler Open Source Third Party Data Social Media Data XML CSV LOG SQL … Product Line Business Intelligence Data Science App
  7. 7. Data preparation (data wrangling)  Cleaning data (semantic errors, missing entries, or inconsistent formatting)  Challenge: data integration  80% time in project workflow Data Source A Data Source B Data Source B ETL Data Warehouse
  8. 8. Feature engineering Select or creating features Research feature relevance Experiment and validation Change the feature set Go back to feature selection step
  9. 9. Modeling Reference Source: http://scikit-learn.org/stable/tutorial/machine_learning_map/
  10. 10. Deploy to product line
  11. 11. Machine Learning Data Collection Communication & Storytelling Data Wrangling Product Development & Feedback Analysis Data Visualization Statistics Domain Knowledge & Business Mindset Data Science Skill Tree
  12. 12. Required Knowledge Skillsets Knowledge Domain Knowledge and Business Mindset Programming R, Python, NLP, Java, Distributed System Industry Various Concentrations(Finance, E- Commerce, Geo, Biology, Medicine) Data Collection & Wrangling Database Database Systems and Management Big Data Big Data Processing and Analytics Statistics Modeling, Inference and Optimization Machine learning Data Mining and Machine Learning Data Visualization Data Visualization and Exploratory Analytics Communication and Storytelling Professional Speaking and Writing
  13. 13. Program Comparison University Name Northwestern CMU Johns Hopkins Columbia University Stanford Berkeley UW USC Domain Knowledge & Business Mindset Programming ✓ ✓ ✓ ✓ ✓ ✓ ✓ Industry ✓ ✓ ✓ ✓ ✓ Data Collection & Wrangling Database ✓ ✓ ✓ ✓ Big Data ✓ ✓ ✓ ✓ ✓ ✓ Statistics ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ Machine learning ✓ ✓ ✓ ✓ ✓ ✓ ✓ Data Visualization ✓ ✓ ✓ ✓ Communication and Storytelling ✓ ✓ ✓
  14. 14. Thank you! https://www.DataAppLab.com Feb 2017 PPT: Xiaolu Zhao @ Feb 16, 2017

Notas do Editor

  • Add health care
    Re-adminssion Niu ying
  • 为了所有数据能在全国范围的医疗数据做横向比较而算出来的系数
  • Data source + add picture => bring challenge

×