O slideshow foi denunciado.
Utilizamos seu perfil e dados de atividades no LinkedIn para personalizar e exibir anúncios mais relevantes. Altere suas preferências de anúncios quando desejar.

RAMP: data challenges with modularization and code submission

1.392 visualizações

Publicada em

http://www.ramp.studio
1) A short history of RAMPs: Rapid Analytics and Model Prototyping
a) motivations
b) design principles
c) the current tool
2) Three data challenges
a) anomaly detection in the LHC ATLAS detector
b) classifying and regressing on molecular spectra
c) time series forecasting of El Niño
3) What have we learned?
a) Course RAMPs beat single day hackatons significantly
i) larger number of students?
ii) longer RAMPs?
iii) M.Sc. students are better than data science researchers?
iv) stronger incentives?
v) closed phase preceding an open phase (vs pure open RAMP) helps to create diversity?
b) Open phase helps novice participants to catch up: the goal of teaching!
c) Sometimes also makes the best and blended score better
d) Human blending often beats machine blending
e) Human feature engineering easily beats deep learning on some data

Publicada em: Ciências
  • Seja o primeiro a comentar

RAMP: data challenges with modularization and code submission

  1. 1. Center for Data Science Paris-Saclay B. Kégl (CNRS) 1 Université Paris-Saclay / CNRS BALÁZS KÉGL RAMP DATA CHALLENGES WITH MODULARIZATION AND CODE SUBMISSION LESSONS LEARNED
  2. 2. Center for Data Science Paris-Saclay B. Kégl (CNRS) • A short history of RAMPs • motivations, design principles, and the current tool • Three data challenges • anomaly detection in the LHC ATLAS detector • classifying and quantifying drug preparations for cancer therapy • time series forecasting of El Niño • What have we learned? • number of participants, incentives? • open vs closed? • blending vs human ingenuity 2 OUTLINE
  3. 3. Center for Data Science Paris-Saclay UNIVERSITÉ PARIS-SACLAY 3 + horizontal multi-disciplinary and multi-partner initiatives to create cohesion
  4. 4. Center for Data Science Paris-Saclay B. Kégl (CNRS) Biology & bioinformatics IBISC/UEvry LRI/UPSud Hepatinov CESP/UPSud-UVSQ-Inserm IGM-I2BC/UPSud MIA/Agro MIAj-MIG/INRA LMAS/Centrale Chemistry EA4041/UPSud Earth sciences LATMOS/UVSQ GEOPS/UPSud IPSL/UVSQ LSCE/UVSQ LMD/Polytechnique Economy LM/ENSAE RITM/UPSud LFA/ENSAE Neuroscience UNICOG/Inserm U1000/Inserm NeuroSpin/CEA Particle physics astrophysics & cosmology LPP/Polytechnique DMPH/ONERA CosmoStat/CEA IAS/UPSud AIM/CEA LAL/UPSud 250researchers in 35laboratories Machine learning LRI/UPSud LTCI/Telecom CMLA/Cachan LS/ENSAE LIX/Polytechnique MIA/Agro CMA/Polytechnique LSS/Supélec CVN/Centrale LMAS/Centrale DTIM/ONERA IBISC/UEvry Visualization INRIA LIMSI Signal processing LTCI/Telecom CMA/Polytechnique CVN/Centrale LSS/Supélec CMLA/Cachan LIMSI DTIM/ONERA Statistics LMO/UPSud LS/ENSAE LSS/Supélec CMA/Polytechnique LMAS/Centrale MIA/AgroParisTech machine learning information retrieval signal processing data visualization databases Domain science human society life brain earth universe Tool building software engineering clouds/grids high-performance computing optimization Domain scientistSoftware engineer datascience-paris-saclay.fr LIST/CEA 4 Center for Data Science Paris-Saclay A multi-disciplinary initiative, building interfaces, matching people, helping them launching projects 345 affiliated researchers, 50 laboratories
  5. 5. Center for Data Science Paris-Saclay B. Kégl (CNRS) 5 Data scientist Data trainer Applied scientist Domain expertSoftware engineer Data engineer Tool building Data domains Data science statistics
 machine learning information retrieval signal processing data visualization databases • interdisciplinary projects • data challenges • ultrawalls and interactive visualization • coding sprints • Open Software Initiative • code consolidator and engineering projects software engineering
 clouds/grids high-performance
 computing optimization energy and physical sciences health and life sciences Earth and environment economy and society brain ! • data science RAMPs and TSs • IT platform for linked data THE DATA SCIENCE ECOSYSTEM https://medium.com/@balazskegl/the-data-science-ecosystem-678459ba6013
  6. 6. Center for Data Science Paris-Saclay B. Kégl (CNRS) • We have realized how much data scientists are ill-equipped to manage the data science development process • collaborating with domain scientists or business units on data-driven problem/product formulation • bench for logging experiments, guiding (human) model search and tuning • managing data science teams and collaborating with each other • collaborating with AI (out-of-the-box AutoML does not work: the search space is too big) • managing the productionalization of a prototype model 6 LACK OF TOOLS
  7. 7. 7 RAPID ANALYTICS AND MODEL PROTOTYPING (RAMP) http://www.ramp.studio
  8. 8. • Organizers have no direct access to solutions • Emphasize competition: participants cannot build on each other’s solutions • No modularization: ideas go unnoticed unless packaged into a top submission 8 LIMITATIONS OF DATA CHALLENGES
  9. 9. Center for Data Science Paris-Saclay B. Kégl (CNRS) • Challenge with code submission • Following Nielsen’s three crowdsourcing principles: • modularity: workflows are sliced into workflow element modules that can be tackled independently • encourage small contributions: e.g., copy another submission, add features, change the hyperparameters, resubmit • rich and well structured information commons: open and download each other’s code, discuss on slack 9 RAMP
  10. 10. Center for Data Science Paris-Saclay B. Kégl (CNRS) • Roughly two formats • single day hackatons with 20-50 participants, open leaderboard, 15 minute timeout • 1-3 week course challenges up 150 students (but no limit really): closed phase with 1-3 submissions per day followed by an open phase with 15 minute timeout • 600+ users, 5000+ models 10 RAMP RAPID ANALYTICS AND MODEL PROTOTYPING
  11. 11. Center for Data Science Paris-Saclay B. Kégl (CNRS) RAMP RAPID ANALYTICS AND MODEL PROTOTYPING 11 frontend DB backend users submissions score problems workflow starting kit crossval data workflow train+test+blend
  12. 12. Center for Data Science Paris-Saclay B. Kégl (CNRS) RAMP 12
  13. 13. Center for Data Science Paris-Saclay B. Kégl (CNRS) RAMP 13
  14. 14. Center for Data Science Paris-Saclay B. Kégl (CNRS) RAMP 14
  15. 15. Center for Data Science Paris-Saclay B. Kégl (CNRS) RAMP 15
  16. 16. Center for Data Science Paris-Saclay B. Kégl (CNRS) RAMP 16
  17. 17. Center for Data Science Paris-Saclay B. Kégl (CNRS) RAMP 17
  18. 18. Center for Data Science Paris-Saclay B. Kégl (CNRS) RAMP 18
  19. 19. Center for Data Science Paris-Saclay B. Kégl (CNRS) RAMP 19
  20. 20. Center for Data Science Paris-Saclay B. Kégl (CNRS) RAMP 20
  21. 21. Center for Data Science Paris-Saclay B. Kégl (CNRS) RAMP 21
  22. 22. Center for Data Science Paris-Saclay B. Kégl (CNRS) RAMP 22
  23. 23. 23 Three recent RAMPs
  24. 24. Center for Data Science Paris-Saclay B. Kégl (CNRS) ANOMALY DETECTION IN THE LHC ATLAS DETECTOR 24 reconstruction +simulated anomalies classifier anomaly (isSkewed = 1) correct
 (isSkewed = 0) ?
  25. 25. Center for Data Science Paris-Saclay B. Kégl (CNRS) FORECASTING EL NINO SIX MONTHS AHEAD 25
  26. 26. Center for Data Science Paris-Saclay B. Kégl (CNRS) FORECASTING EL NINO SIX MONTHS AHEAD 26 … 300.14 299.83 298.76 299.87 299.82 300.15 300.10 299.50… … feature extractor x (a fixed length feature vector) regressor • We give the full series to the feature extractor • It could look ahead in the future (even inadvertently) • Checking lookahead by a randomized test
  27. 27. Center for Data Science Paris-Saclay B. Kégl (CNRS) CLASSIFYING AND REGRESSING ON MOLECULAR SPECTRA 27 chemotherapy drug in elastic pocket laser spectrometer molecular spectra feature extractor 1 feature extractor 2 regressor concentration classifier drug type
  28. 28. 28 Analyzing the analysis
  29. 29. Center for Data Science Paris-Saclay B. Kégl (CNRS) OPEN PHASE LETS PARTICIPANTS CATCH UP 29
  30. 30. Center for Data Science Paris-Saclay B. Kégl (CNRS) 30 T-SNE ON TEST PREDICTIONS starting kit the crowd early influencers inventors
  31. 31. Center for Data Science Paris-Saclay B. Kégl (CNRS) 31
  32. 32. Center for Data Science Paris-Saclay B. Kégl (CNRS) 32
  33. 33. Center for Data Science Paris-Saclay B. Kégl (CNRS) 33
  34. 34. Center for Data Science Paris-Saclay B. Kégl (CNRS) 34
  35. 35. Center for Data Science Paris-Saclay B. Kégl (CNRS) 35 the single day hackaton ceiling what you achieved with a well tuned deep net the diversity gap the human blender gap
  36. 36. Center for Data Science Paris-Saclay B. Kégl (CNRS) 36 the single day hackaton floor
  37. 37. Center for Data Science Paris-Saclay B. Kégl (CNRS) 37 the single day hackaton floor
  38. 38. Center for Data Science Paris-Saclay B. Kégl (CNRS) • Open phase helps novice participants to catch up: the goal of teaching! • Sometimes also makes the best and blended score better • Human blending often beats machine blending • Human feature engineering easily beats deep learning on some data • Course RAMPs beat single day hackatons significantly • larger number of students? • longer RAMPs? • novice and master-level students are better than data science researchers? • stronger incentives? • closed phase preceding an open phase (vs pure open RAMP) helps to create diversity? 38 WHAT WE LEARNED
  39. 39. Center for Data Science Paris-Saclay B. Kégl (CNRS) 39 closed phase followed by open the second diversity gap
  40. 40. Center for Data Science Paris-Saclay B. Kégl (CNRS) 40
  41. 41. Center for Data Science Paris-Saclay B. Kégl (CNRS) 41
  42. 42. Center for Data Science Paris-Saclay B. Kégl (CNRS) • Fast development of analytics solutions • Teaching support • Networking • Support for collaborative team work 42 THE RAMP TOOL A prototyping tool for collaborative development of data science workflows
  43. 43. Center for Data Science Paris-Saclay B. Kégl (CNRS) • More RAMPs, sign up at http://www.ramp.studio
 if interested • Tools to manage the crowd, redirect group attention, etc. • Human - AI interaction: “hyperopt this module for me, would you?” • Streamlining deployment 43 WHAT’S NEXT

×