SlideShare uma empresa Scribd logo
1 de 29
Big Data:
the weakest link
Vivek Nair, Tim Menzies
{vivekaxl,tim.menzies}@gmail.com
HPCC Eng. Summit - Sept 29, 2015
Where is the weakest link?
2
Where is the weakest link?
3
Where is the weakest link?
4
Where is the weakest link?
5
Where is the weakest link?
6
Premise of Big Data
Analysis is a “systems” task?
• Better conclusions =
same algorithms + more
data + more cpu
• If so, then …
– No role for human error
– All insight is auto-generated
from CPUs.
Analysis is a “human” task?
• Current results on “software
analytics”
– A human-intensive process
7
Q: Is Big Data a “Systems” or “Human”-task?
A: Yes
8
Code used in my
last paper
(1100 LOC of Python
calling scikitlearn)
9
Use a Higher-Level languages?
• ECL solves this problem?
• But if you can write it quick,
– you can write it wrong, quick.
10
Is this really a problem?
• Q: What would we expect
to see if…
– Top experts, publishing in top
journals
– Many of the same data sets
– 8 years of trying
• A:
– Perhaps some upward
progress
– Perhaps a little less variance
11
So, what do
we see?
• Software analytics
– Defect prediction
– Many of the same learners,
– Many of the same data sets
• 42 papers,
top journals,
• 23 author groups
• 2002 to 2010
• Y-axis measures
mean performance
12
Researcher Bias: The Use of Machine Learning in Software Defect Prediction, Martin Shepperd,
David Bowes, and Tracy Hall, IEEE TRANS on Soft. Eng. , 40(6), JUNE 2014
13
http://fivethirt
yeight.com/fe
atures/science
-isnt-broken/
A little theory
• James D. Herbsleb, CMU
• Socio-Technical Coordination
• A predictor for higher defects:
– Groups of programmers working
on similar functions then,
– but do not sharing that expertise
14
Q: How to find expertise groups
within the HPCC community?
A: using data mining
15
Static features and commit history
can act as a cue for expertise
● Our motivation
o “relation between embodiment and language
acquisition by locating the ‘minimal set of
necessary features’ that enable language of any
kind to be learned” - The Philosophy of Expertise
16
Software analytics results:
learn predictors for expertise
● “...counts of the cumulative number of different
developers changing a file over its lifetime can help
to improve defect predictions…”[1]
● “Quantify person's experience with a part of code
using change history of the code”[2]
● “RevFinder, a file location-based code-reviewer
recommendation approach” [3]
● “30% of its code entities has more than 0.3 of
similarity with at least one developer vocabulary”
[4]
17
[1] Ostrand, Thomas J., Elaine J. Weyuker, and Robert M. Bell.
"Programmer-based fault prediction." Proceedings of the 6th
International Conference on Predictive Models in Software Engineering.
ACM, 2010.
[2] Mockus, Audris, and James D. Herbsleb. "Expertise browser: a
quantitative approach to identifying expertise." Proceedings of the
24th international conference on software engineering. ACM, 2002.
[3] Thongtanunam, Patanamon, et al. "Who should review my code? A
file location-based code-reviewer recommendation approach for
Modern Code Review."Software Analysis, Evolution and Reengineering
(SANER), 2015 IEEE 22nd International Conference on. IEEE, 2015.
[4] Santos, Katyusco de F., Dalton DS Guerrero, and Jorge CA de
Figueiredo. "Using Developers Contributions on Software Vocabularies
to Identify Experts."Information Technology-New Generations (ITNG),
2015 12th International Conference on. IEEE, 2015.
Q: And what data mining suite will we
use to mine data about programmers?
• A: need you ask?
18
Source Code
19
But what are we clustering?
Developer products
• Lightweight parsing of source code
• Developers profiles, accessed
via social media sites
Languages Used
Skill Set (self promotion)
Data processing
1. Github repos (for code) ➔ Social media(for years of work)
2. Static code analysis: frequency counts of AST features
(e.g. count loops, returns, var comparisons, map, etc )
3. Bayes classifier
Early
career
Later career
Classification
- Features: Nodes of AST
- Algorithms Used: Simple Cart, Random
Forest, Naive Bayes etc.
- Can distinguish expert from novice
programmers
•precision= 78% early career
•precision = 74% later career
* Using Weka
Current status
The good news
• Can auto-find groups of
better programmers
• Can do that for very large
data sets
– The ECL advantages
The other news
• Seeking larger data sets
• Talking to HackerRank
• Looking at ways to
instrument the HPCC
forums
– Matchmaker tools
– Affinity groups
25
Where is the weakest link?
26
Where is the weakest link?
27
We can make that link stronger
28
Acknowledgements:
Thanks to funding from LexisNexis
29

Mais conteúdo relacionado

Mais procurados

[Webinar] How Big Data and Machine Learning Are Transforming ITSM
[Webinar] How Big Data and Machine Learning Are Transforming ITSM[Webinar] How Big Data and Machine Learning Are Transforming ITSM
[Webinar] How Big Data and Machine Learning Are Transforming ITSMSunView Software, Inc.
 
A New Year in Data Science: ML Unpaused
A New Year in Data Science: ML UnpausedA New Year in Data Science: ML Unpaused
A New Year in Data Science: ML UnpausedPaco Nathan
 
machine learning in the age of big data: new approaches and business applicat...
machine learning in the age of big data: new approaches and business applicat...machine learning in the age of big data: new approaches and business applicat...
machine learning in the age of big data: new approaches and business applicat...Armando Vieira
 
ACM Chicago March 2019 meeting: Software Engineering and AI - Prof. Tao Xie, ...
ACM Chicago March 2019 meeting: Software Engineering and AI - Prof. Tao Xie, ...ACM Chicago March 2019 meeting: Software Engineering and AI - Prof. Tao Xie, ...
ACM Chicago March 2019 meeting: Software Engineering and AI - Prof. Tao Xie, ...ACM Chicago
 
Data Science in the Real World: Making a Difference
Data Science in the Real World: Making a Difference Data Science in the Real World: Making a Difference
Data Science in the Real World: Making a Difference Srinath Perera
 
Andres Rodriguez at AI Frontiers: Catalyzing Deep Learning's Impact in the En...
Andres Rodriguez at AI Frontiers: Catalyzing Deep Learning's Impact in the En...Andres Rodriguez at AI Frontiers: Catalyzing Deep Learning's Impact in the En...
Andres Rodriguez at AI Frontiers: Catalyzing Deep Learning's Impact in the En...AI Frontiers
 
Crowdsourced Data Processing: Industry and Academic Perspectives
Crowdsourced Data Processing: Industry and Academic PerspectivesCrowdsourced Data Processing: Industry and Academic Perspectives
Crowdsourced Data Processing: Industry and Academic PerspectivesAditya Parameswaran
 
Agile data science
Agile data scienceAgile data science
Agile data scienceJoel Horwitz
 
The Science of Data Science
The Science of Data Science The Science of Data Science
The Science of Data Science James Hendler
 
Data science presentation
Data science presentationData science presentation
Data science presentationMSDEVMTL
 
Python for Data Science with Anaconda
Python for Data Science with AnacondaPython for Data Science with Anaconda
Python for Data Science with AnacondaTravis Oliphant
 
Data Science Reinvents Learning?
Data Science Reinvents Learning?Data Science Reinvents Learning?
Data Science Reinvents Learning?Paco Nathan
 
Data Science, Machine Learning, and H2O
Data Science, Machine Learning, and H2OData Science, Machine Learning, and H2O
Data Science, Machine Learning, and H2OSri Ambati
 
Machine learning on Hadoop data lakes
Machine learning on Hadoop data lakesMachine learning on Hadoop data lakes
Machine learning on Hadoop data lakesDataWorks Summit
 
Introduction To Data Science
Introduction To Data ScienceIntroduction To Data Science
Introduction To Data ScienceSpotle.ai
 
Data Science in Future Tense
Data Science in Future TenseData Science in Future Tense
Data Science in Future TensePaco Nathan
 
Intro to Machine Learning with H2O and AWS
Intro to Machine Learning with H2O and AWSIntro to Machine Learning with H2O and AWS
Intro to Machine Learning with H2O and AWSSri Ambati
 
Data Science 101
Data Science 101Data Science 101
Data Science 101odsc
 
Introduction to Big Data/Machine Learning
Introduction to Big Data/Machine LearningIntroduction to Big Data/Machine Learning
Introduction to Big Data/Machine LearningLars Marius Garshol
 

Mais procurados (20)

[Webinar] How Big Data and Machine Learning Are Transforming ITSM
[Webinar] How Big Data and Machine Learning Are Transforming ITSM[Webinar] How Big Data and Machine Learning Are Transforming ITSM
[Webinar] How Big Data and Machine Learning Are Transforming ITSM
 
A New Year in Data Science: ML Unpaused
A New Year in Data Science: ML UnpausedA New Year in Data Science: ML Unpaused
A New Year in Data Science: ML Unpaused
 
machine learning in the age of big data: new approaches and business applicat...
machine learning in the age of big data: new approaches and business applicat...machine learning in the age of big data: new approaches and business applicat...
machine learning in the age of big data: new approaches and business applicat...
 
ACM Chicago March 2019 meeting: Software Engineering and AI - Prof. Tao Xie, ...
ACM Chicago March 2019 meeting: Software Engineering and AI - Prof. Tao Xie, ...ACM Chicago March 2019 meeting: Software Engineering and AI - Prof. Tao Xie, ...
ACM Chicago March 2019 meeting: Software Engineering and AI - Prof. Tao Xie, ...
 
Data Science in the Real World: Making a Difference
Data Science in the Real World: Making a Difference Data Science in the Real World: Making a Difference
Data Science in the Real World: Making a Difference
 
Andres Rodriguez at AI Frontiers: Catalyzing Deep Learning's Impact in the En...
Andres Rodriguez at AI Frontiers: Catalyzing Deep Learning's Impact in the En...Andres Rodriguez at AI Frontiers: Catalyzing Deep Learning's Impact in the En...
Andres Rodriguez at AI Frontiers: Catalyzing Deep Learning's Impact in the En...
 
Crowdsourced Data Processing: Industry and Academic Perspectives
Crowdsourced Data Processing: Industry and Academic PerspectivesCrowdsourced Data Processing: Industry and Academic Perspectives
Crowdsourced Data Processing: Industry and Academic Perspectives
 
Agile data science
Agile data scienceAgile data science
Agile data science
 
The Science of Data Science
The Science of Data Science The Science of Data Science
The Science of Data Science
 
Data science presentation
Data science presentationData science presentation
Data science presentation
 
Python for Data Science with Anaconda
Python for Data Science with AnacondaPython for Data Science with Anaconda
Python for Data Science with Anaconda
 
Data Science Reinvents Learning?
Data Science Reinvents Learning?Data Science Reinvents Learning?
Data Science Reinvents Learning?
 
Data Science, Machine Learning, and H2O
Data Science, Machine Learning, and H2OData Science, Machine Learning, and H2O
Data Science, Machine Learning, and H2O
 
Machine learning on Hadoop data lakes
Machine learning on Hadoop data lakesMachine learning on Hadoop data lakes
Machine learning on Hadoop data lakes
 
Introduction To Data Science
Introduction To Data ScienceIntroduction To Data Science
Introduction To Data Science
 
Data Science in Future Tense
Data Science in Future TenseData Science in Future Tense
Data Science in Future Tense
 
Intro to Machine Learning with H2O and AWS
Intro to Machine Learning with H2O and AWSIntro to Machine Learning with H2O and AWS
Intro to Machine Learning with H2O and AWS
 
Data Science 101
Data Science 101Data Science 101
Data Science 101
 
Introduction to Big Data/Machine Learning
Introduction to Big Data/Machine LearningIntroduction to Big Data/Machine Learning
Introduction to Big Data/Machine Learning
 
OpenML data@Sheffield
OpenML data@SheffieldOpenML data@Sheffield
OpenML data@Sheffield
 

Semelhante a Big Data: the weakest link

Software Mining and Software Datasets
Software Mining and Software DatasetsSoftware Mining and Software Datasets
Software Mining and Software DatasetsTao Xie
 
Intelligent Software Engineering: Synergy between AI and Software Engineering
Intelligent Software Engineering: Synergy between AI and Software EngineeringIntelligent Software Engineering: Synergy between AI and Software Engineering
Intelligent Software Engineering: Synergy between AI and Software EngineeringTao Xie
 
TechEvent 2019: Artificial Intelligence in Dev & Ops; Martin Luckow - Trivadis
TechEvent 2019: Artificial Intelligence in Dev & Ops; Martin Luckow - TrivadisTechEvent 2019: Artificial Intelligence in Dev & Ops; Martin Luckow - Trivadis
TechEvent 2019: Artificial Intelligence in Dev & Ops; Martin Luckow - TrivadisTrivadis
 
Synergy of Human and Artificial Intelligence in Software Engineering
Synergy of Human and Artificial Intelligence in Software EngineeringSynergy of Human and Artificial Intelligence in Software Engineering
Synergy of Human and Artificial Intelligence in Software EngineeringTao Xie
 
Visualization for Software Analytics
Visualization for Software AnalyticsVisualization for Software Analytics
Visualization for Software AnalyticsMargaret-Anne Storey
 
Keynote at-icpc-2020
Keynote at-icpc-2020Keynote at-icpc-2020
Keynote at-icpc-2020Ralf Laemmel
 
Software Analytics: Towards Software Mining that Matters (2014)
Software Analytics:Towards Software Mining that Matters (2014)Software Analytics:Towards Software Mining that Matters (2014)
Software Analytics: Towards Software Mining that Matters (2014)Tao Xie
 
Software Analytics - Achievements and Challenges
Software Analytics - Achievements and ChallengesSoftware Analytics - Achievements and Challenges
Software Analytics - Achievements and ChallengesTao Xie
 
Applying AI to software engineering problems: Do not forget the human!
Applying AI to software engineering problems: Do not forget the human!Applying AI to software engineering problems: Do not forget the human!
Applying AI to software engineering problems: Do not forget the human!University of Córdoba
 
Intelligent Software Engineering: Synergy between AI and Software Engineering...
Intelligent Software Engineering: Synergy between AI and Software Engineering...Intelligent Software Engineering: Synergy between AI and Software Engineering...
Intelligent Software Engineering: Synergy between AI and Software Engineering...Tao Xie
 
Towards Reusable Research Software
Towards Reusable Research SoftwareTowards Reusable Research Software
Towards Reusable Research Softwaredgarijo
 
How ChatGPT and AI-assisted coding changes software engineering profoundly
How ChatGPT and AI-assisted coding changes software engineering profoundlyHow ChatGPT and AI-assisted coding changes software engineering profoundly
How ChatGPT and AI-assisted coding changes software engineering profoundlyPekka Abrahamsson / Tampere University
 
Lies, Damned Lies and Software Analytics: Why Big Data Needs Rich Data
Lies, Damned Lies and Software Analytics:  Why Big Data Needs Rich DataLies, Damned Lies and Software Analytics:  Why Big Data Needs Rich Data
Lies, Damned Lies and Software Analytics: Why Big Data Needs Rich DataMargaret-Anne Storey
 
Software Analytics: Data Analytics for Software Engineering
Software Analytics: Data Analytics for Software EngineeringSoftware Analytics: Data Analytics for Software Engineering
Software Analytics: Data Analytics for Software EngineeringTao Xie
 
Sudipta_Mukherjee_Resume_APR_2023.pdf
Sudipta_Mukherjee_Resume_APR_2023.pdfSudipta_Mukherjee_Resume_APR_2023.pdf
Sudipta_Mukherjee_Resume_APR_2023.pdfsudipto801
 
Creating An Incremental Architecture For Your System
Creating An Incremental Architecture For Your SystemCreating An Incremental Architecture For Your System
Creating An Incremental Architecture For Your SystemGiovanni Asproni
 
SBQS 2013 Keynote: Cooperative Testing and Analysis
SBQS 2013 Keynote: Cooperative Testing and AnalysisSBQS 2013 Keynote: Cooperative Testing and Analysis
SBQS 2013 Keynote: Cooperative Testing and AnalysisTao Xie
 

Semelhante a Big Data: the weakest link (20)

Software Mining and Software Datasets
Software Mining and Software DatasetsSoftware Mining and Software Datasets
Software Mining and Software Datasets
 
Intelligent Software Engineering: Synergy between AI and Software Engineering
Intelligent Software Engineering: Synergy between AI and Software EngineeringIntelligent Software Engineering: Synergy between AI and Software Engineering
Intelligent Software Engineering: Synergy between AI and Software Engineering
 
TechEvent 2019: Artificial Intelligence in Dev & Ops; Martin Luckow - Trivadis
TechEvent 2019: Artificial Intelligence in Dev & Ops; Martin Luckow - TrivadisTechEvent 2019: Artificial Intelligence in Dev & Ops; Martin Luckow - Trivadis
TechEvent 2019: Artificial Intelligence in Dev & Ops; Martin Luckow - Trivadis
 
Synergy of Human and Artificial Intelligence in Software Engineering
Synergy of Human and Artificial Intelligence in Software EngineeringSynergy of Human and Artificial Intelligence in Software Engineering
Synergy of Human and Artificial Intelligence in Software Engineering
 
Visualization for Software Analytics
Visualization for Software AnalyticsVisualization for Software Analytics
Visualization for Software Analytics
 
Keynote at-icpc-2020
Keynote at-icpc-2020Keynote at-icpc-2020
Keynote at-icpc-2020
 
Software Analytics: Towards Software Mining that Matters (2014)
Software Analytics:Towards Software Mining that Matters (2014)Software Analytics:Towards Software Mining that Matters (2014)
Software Analytics: Towards Software Mining that Matters (2014)
 
Software Analytics - Achievements and Challenges
Software Analytics - Achievements and ChallengesSoftware Analytics - Achievements and Challenges
Software Analytics - Achievements and Challenges
 
Applying AI to software engineering problems: Do not forget the human!
Applying AI to software engineering problems: Do not forget the human!Applying AI to software engineering problems: Do not forget the human!
Applying AI to software engineering problems: Do not forget the human!
 
Intelligent Software Engineering: Synergy between AI and Software Engineering...
Intelligent Software Engineering: Synergy between AI and Software Engineering...Intelligent Software Engineering: Synergy between AI and Software Engineering...
Intelligent Software Engineering: Synergy between AI and Software Engineering...
 
Towards Reusable Research Software
Towards Reusable Research SoftwareTowards Reusable Research Software
Towards Reusable Research Software
 
How ChatGPT and AI-assisted coding changes software engineering profoundly
How ChatGPT and AI-assisted coding changes software engineering profoundlyHow ChatGPT and AI-assisted coding changes software engineering profoundly
How ChatGPT and AI-assisted coding changes software engineering profoundly
 
Lies, Damned Lies and Software Analytics: Why Big Data Needs Rich Data
Lies, Damned Lies and Software Analytics:  Why Big Data Needs Rich DataLies, Damned Lies and Software Analytics:  Why Big Data Needs Rich Data
Lies, Damned Lies and Software Analytics: Why Big Data Needs Rich Data
 
Reproducible Science and Deep Software Variability
Reproducible Science and Deep Software VariabilityReproducible Science and Deep Software Variability
Reproducible Science and Deep Software Variability
 
NoSQL (Not Only SQL)
NoSQL (Not Only SQL)NoSQL (Not Only SQL)
NoSQL (Not Only SQL)
 
Software Analytics: Data Analytics for Software Engineering
Software Analytics: Data Analytics for Software EngineeringSoftware Analytics: Data Analytics for Software Engineering
Software Analytics: Data Analytics for Software Engineering
 
Sudipta_Mukherjee_Resume_APR_2023.pdf
Sudipta_Mukherjee_Resume_APR_2023.pdfSudipta_Mukherjee_Resume_APR_2023.pdf
Sudipta_Mukherjee_Resume_APR_2023.pdf
 
Creating An Incremental Architecture For Your System
Creating An Incremental Architecture For Your SystemCreating An Incremental Architecture For Your System
Creating An Incremental Architecture For Your System
 
Mastering Software Variability for Innovation and Science
Mastering Software Variability for Innovation and ScienceMastering Software Variability for Innovation and Science
Mastering Software Variability for Innovation and Science
 
SBQS 2013 Keynote: Cooperative Testing and Analysis
SBQS 2013 Keynote: Cooperative Testing and AnalysisSBQS 2013 Keynote: Cooperative Testing and Analysis
SBQS 2013 Keynote: Cooperative Testing and Analysis
 

Mais de CS, NcState

Lexisnexis june9
Lexisnexis june9Lexisnexis june9
Lexisnexis june9CS, NcState
 
Welcome to ICSE NIER’15 (new ideas and emerging results).
Welcome to ICSE NIER’15 (new ideas and emerging results).Welcome to ICSE NIER’15 (new ideas and emerging results).
Welcome to ICSE NIER’15 (new ideas and emerging results).CS, NcState
 
Icse15 Tech-briefing Data Science
Icse15 Tech-briefing Data ScienceIcse15 Tech-briefing Data Science
Icse15 Tech-briefing Data ScienceCS, NcState
 
Kits to Find the Bits that Fits
Kits to Find  the Bits that Fits Kits to Find  the Bits that Fits
Kits to Find the Bits that Fits CS, NcState
 
Ai4se lab template
Ai4se lab templateAi4se lab template
Ai4se lab templateCS, NcState
 
Automated Software Enging, Fall 2015, NCSU
Automated Software Enging, Fall 2015, NCSUAutomated Software Enging, Fall 2015, NCSU
Automated Software Enging, Fall 2015, NCSUCS, NcState
 
Requirements Engineering
Requirements EngineeringRequirements Engineering
Requirements EngineeringCS, NcState
 
172529main ken and_tim_software_assurance_research_at_west_virginia
172529main ken and_tim_software_assurance_research_at_west_virginia172529main ken and_tim_software_assurance_research_at_west_virginia
172529main ken and_tim_software_assurance_research_at_west_virginiaCS, NcState
 
Automated Software Engineering
Automated Software EngineeringAutomated Software Engineering
Automated Software EngineeringCS, NcState
 
Next Generation “Treatment Learning” (finding the diamonds in the dust)
Next Generation “Treatment Learning” (finding the diamonds in the dust)Next Generation “Treatment Learning” (finding the diamonds in the dust)
Next Generation “Treatment Learning” (finding the diamonds in the dust)CS, NcState
 
Tim Menzies, directions in Data Science
Tim Menzies, directions in Data ScienceTim Menzies, directions in Data Science
Tim Menzies, directions in Data ScienceCS, NcState
 
Dagstuhl14 intro-v1
Dagstuhl14 intro-v1Dagstuhl14 intro-v1
Dagstuhl14 intro-v1CS, NcState
 
The Art and Science of Analyzing Software Data
The Art and Science of Analyzing Software DataThe Art and Science of Analyzing Software Data
The Art and Science of Analyzing Software DataCS, NcState
 
What Metrics Matter?
What Metrics Matter? What Metrics Matter?
What Metrics Matter? CS, NcState
 
In the age of Big Data, what role for Software Engineers?
In the age of Big Data, what role for Software Engineers?In the age of Big Data, what role for Software Engineers?
In the age of Big Data, what role for Software Engineers?CS, NcState
 
Sayyad slides ase13_v4
Sayyad slides ase13_v4Sayyad slides ase13_v4
Sayyad slides ase13_v4CS, NcState
 
Warning: don't do CS
Warning: don't do CSWarning: don't do CS
Warning: don't do CSCS, NcState
 

Mais de CS, NcState (20)

Lexisnexis june9
Lexisnexis june9Lexisnexis june9
Lexisnexis june9
 
Welcome to ICSE NIER’15 (new ideas and emerging results).
Welcome to ICSE NIER’15 (new ideas and emerging results).Welcome to ICSE NIER’15 (new ideas and emerging results).
Welcome to ICSE NIER’15 (new ideas and emerging results).
 
Icse15 Tech-briefing Data Science
Icse15 Tech-briefing Data ScienceIcse15 Tech-briefing Data Science
Icse15 Tech-briefing Data Science
 
Kits to Find the Bits that Fits
Kits to Find  the Bits that Fits Kits to Find  the Bits that Fits
Kits to Find the Bits that Fits
 
Ai4se lab template
Ai4se lab templateAi4se lab template
Ai4se lab template
 
Automated Software Enging, Fall 2015, NCSU
Automated Software Enging, Fall 2015, NCSUAutomated Software Enging, Fall 2015, NCSU
Automated Software Enging, Fall 2015, NCSU
 
Requirements Engineering
Requirements EngineeringRequirements Engineering
Requirements Engineering
 
172529main ken and_tim_software_assurance_research_at_west_virginia
172529main ken and_tim_software_assurance_research_at_west_virginia172529main ken and_tim_software_assurance_research_at_west_virginia
172529main ken and_tim_software_assurance_research_at_west_virginia
 
Automated Software Engineering
Automated Software EngineeringAutomated Software Engineering
Automated Software Engineering
 
Next Generation “Treatment Learning” (finding the diamonds in the dust)
Next Generation “Treatment Learning” (finding the diamonds in the dust)Next Generation “Treatment Learning” (finding the diamonds in the dust)
Next Generation “Treatment Learning” (finding the diamonds in the dust)
 
Tim Menzies, directions in Data Science
Tim Menzies, directions in Data ScienceTim Menzies, directions in Data Science
Tim Menzies, directions in Data Science
 
Goldrush
GoldrushGoldrush
Goldrush
 
Dagstuhl14 intro-v1
Dagstuhl14 intro-v1Dagstuhl14 intro-v1
Dagstuhl14 intro-v1
 
Know thy tools
Know thy toolsKnow thy tools
Know thy tools
 
The Art and Science of Analyzing Software Data
The Art and Science of Analyzing Software DataThe Art and Science of Analyzing Software Data
The Art and Science of Analyzing Software Data
 
What Metrics Matter?
What Metrics Matter? What Metrics Matter?
What Metrics Matter?
 
In the age of Big Data, what role for Software Engineers?
In the age of Big Data, what role for Software Engineers?In the age of Big Data, what role for Software Engineers?
In the age of Big Data, what role for Software Engineers?
 
Sayyad slides ase13_v4
Sayyad slides ase13_v4Sayyad slides ase13_v4
Sayyad slides ase13_v4
 
Ase2013
Ase2013Ase2013
Ase2013
 
Warning: don't do CS
Warning: don't do CSWarning: don't do CS
Warning: don't do CS
 

Último

Call for Papers - International Journal of Intelligent Systems and Applicatio...
Call for Papers - International Journal of Intelligent Systems and Applicatio...Call for Papers - International Journal of Intelligent Systems and Applicatio...
Call for Papers - International Journal of Intelligent Systems and Applicatio...Christo Ananth
 
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptxBSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptxfenichawla
 
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756dollysharma2066
 
AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdfankushspencer015
 
UNIT-III FMM. DIMENSIONAL ANALYSIS
UNIT-III FMM.        DIMENSIONAL ANALYSISUNIT-III FMM.        DIMENSIONAL ANALYSIS
UNIT-III FMM. DIMENSIONAL ANALYSISrknatarajan
 
Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . pptDineshKumar4165
 
data_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfdata_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfJiananWang21
 
Double rodded leveling 1 pdf activity 01
Double rodded leveling 1 pdf activity 01Double rodded leveling 1 pdf activity 01
Double rodded leveling 1 pdf activity 01KreezheaRecto
 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...Call Girls in Nagpur High Profile
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...ranjana rawat
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordAsst.prof M.Gokilavani
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Christo Ananth
 
Vivazz, Mieres Social Housing Design Spain
Vivazz, Mieres Social Housing Design SpainVivazz, Mieres Social Housing Design Spain
Vivazz, Mieres Social Housing Design Spaintimesproduction05
 
Generative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTGenerative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTbhaskargani46
 
Unit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdfUnit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdfRagavanV2
 
Extrusion Processes and Their Limitations
Extrusion Processes and Their LimitationsExtrusion Processes and Their Limitations
Extrusion Processes and Their Limitations120cr0395
 

Último (20)

Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
 
Call for Papers - International Journal of Intelligent Systems and Applicatio...
Call for Papers - International Journal of Intelligent Systems and Applicatio...Call for Papers - International Journal of Intelligent Systems and Applicatio...
Call for Papers - International Journal of Intelligent Systems and Applicatio...
 
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptxBSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
 
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
 
AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdf
 
Roadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and RoutesRoadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and Routes
 
UNIT-III FMM. DIMENSIONAL ANALYSIS
UNIT-III FMM.        DIMENSIONAL ANALYSISUNIT-III FMM.        DIMENSIONAL ANALYSIS
UNIT-III FMM. DIMENSIONAL ANALYSIS
 
Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . ppt
 
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
 
data_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfdata_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdf
 
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
 
Double rodded leveling 1 pdf activity 01
Double rodded leveling 1 pdf activity 01Double rodded leveling 1 pdf activity 01
Double rodded leveling 1 pdf activity 01
 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
 
Vivazz, Mieres Social Housing Design Spain
Vivazz, Mieres Social Housing Design SpainVivazz, Mieres Social Housing Design Spain
Vivazz, Mieres Social Housing Design Spain
 
Generative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTGenerative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPT
 
Unit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdfUnit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdf
 
Extrusion Processes and Their Limitations
Extrusion Processes and Their LimitationsExtrusion Processes and Their Limitations
Extrusion Processes and Their Limitations
 

Big Data: the weakest link

  • 1. Big Data: the weakest link Vivek Nair, Tim Menzies {vivekaxl,tim.menzies}@gmail.com HPCC Eng. Summit - Sept 29, 2015
  • 2. Where is the weakest link? 2
  • 3. Where is the weakest link? 3
  • 4. Where is the weakest link? 4
  • 5. Where is the weakest link? 5
  • 6. Where is the weakest link? 6
  • 7. Premise of Big Data Analysis is a “systems” task? • Better conclusions = same algorithms + more data + more cpu • If so, then … – No role for human error – All insight is auto-generated from CPUs. Analysis is a “human” task? • Current results on “software analytics” – A human-intensive process 7
  • 8. Q: Is Big Data a “Systems” or “Human”-task? A: Yes 8
  • 9. Code used in my last paper (1100 LOC of Python calling scikitlearn) 9
  • 10. Use a Higher-Level languages? • ECL solves this problem? • But if you can write it quick, – you can write it wrong, quick. 10
  • 11. Is this really a problem? • Q: What would we expect to see if… – Top experts, publishing in top journals – Many of the same data sets – 8 years of trying • A: – Perhaps some upward progress – Perhaps a little less variance 11 So, what do we see?
  • 12. • Software analytics – Defect prediction – Many of the same learners, – Many of the same data sets • 42 papers, top journals, • 23 author groups • 2002 to 2010 • Y-axis measures mean performance 12 Researcher Bias: The Use of Machine Learning in Software Defect Prediction, Martin Shepperd, David Bowes, and Tracy Hall, IEEE TRANS on Soft. Eng. , 40(6), JUNE 2014
  • 14. A little theory • James D. Herbsleb, CMU • Socio-Technical Coordination • A predictor for higher defects: – Groups of programmers working on similar functions then, – but do not sharing that expertise 14
  • 15. Q: How to find expertise groups within the HPCC community? A: using data mining 15
  • 16. Static features and commit history can act as a cue for expertise ● Our motivation o “relation between embodiment and language acquisition by locating the ‘minimal set of necessary features’ that enable language of any kind to be learned” - The Philosophy of Expertise 16
  • 17. Software analytics results: learn predictors for expertise ● “...counts of the cumulative number of different developers changing a file over its lifetime can help to improve defect predictions…”[1] ● “Quantify person's experience with a part of code using change history of the code”[2] ● “RevFinder, a file location-based code-reviewer recommendation approach” [3] ● “30% of its code entities has more than 0.3 of similarity with at least one developer vocabulary” [4] 17 [1] Ostrand, Thomas J., Elaine J. Weyuker, and Robert M. Bell. "Programmer-based fault prediction." Proceedings of the 6th International Conference on Predictive Models in Software Engineering. ACM, 2010. [2] Mockus, Audris, and James D. Herbsleb. "Expertise browser: a quantitative approach to identifying expertise." Proceedings of the 24th international conference on software engineering. ACM, 2002. [3] Thongtanunam, Patanamon, et al. "Who should review my code? A file location-based code-reviewer recommendation approach for Modern Code Review."Software Analysis, Evolution and Reengineering (SANER), 2015 IEEE 22nd International Conference on. IEEE, 2015. [4] Santos, Katyusco de F., Dalton DS Guerrero, and Jorge CA de Figueiredo. "Using Developers Contributions on Software Vocabularies to Identify Experts."Information Technology-New Generations (ITNG), 2015 12th International Conference on. IEEE, 2015.
  • 18. Q: And what data mining suite will we use to mine data about programmers? • A: need you ask? 18
  • 20. But what are we clustering? Developer products • Lightweight parsing of source code • Developers profiles, accessed via social media sites
  • 22. Skill Set (self promotion)
  • 23. Data processing 1. Github repos (for code) ➔ Social media(for years of work) 2. Static code analysis: frequency counts of AST features (e.g. count loops, returns, var comparisons, map, etc ) 3. Bayes classifier Early career Later career
  • 24. Classification - Features: Nodes of AST - Algorithms Used: Simple Cart, Random Forest, Naive Bayes etc. - Can distinguish expert from novice programmers •precision= 78% early career •precision = 74% later career * Using Weka
  • 25. Current status The good news • Can auto-find groups of better programmers • Can do that for very large data sets – The ECL advantages The other news • Seeking larger data sets • Talking to HackerRank • Looking at ways to instrument the HPCC forums – Matchmaker tools – Affinity groups 25
  • 26. Where is the weakest link? 26
  • 27. Where is the weakest link? 27
  • 28. We can make that link stronger 28