SlideShare uma empresa Scribd logo
1 de 64
Baixar para ler offline
“Software is eating the world”
128k LoC
4-5M LoC
9M LoC
18M LoC
45M LoC
150M LoC
Machine Learning on Source Code
Francesc Campoy
VP of Product & DevRel
source{d}
Machine Learning for Large Scale Code Analysis
@francesc | #MLonCode
Francesc Campoy
Agenda
● Machine Learning on Source Code
● Research
● Use Cases
● The Future
Machine Learning on Source Code
Machine Learning on Source Code
Field of Machine Learning where the input data is source code.
MLonCode
Machine Learning on Source Code
Requires:
● Lots of data
● Really, lots and lots of data
● Fancy ML Algorithms
● A little bit of luck
Related Fields:
● Data Mining
● Natural Language Processing
● Graph Based Machine Learning
Challenge #1
Data Retrieval
The datasets of ML on Code
● GH Archive: https://www.gharchive.org
● Public Git Archive https://pga.sourced.tech
Announcement: blog.sourced.tech/post/announcing-pga
Public Git Archive
Rooted repositories
blog.sourced.tech/post/pga_history/
Challenge #2
Data Analysis
'112', '97', '99', '107', '97', '103',
'101', '32', '109', '97', '105', '110',
'10', '10', '105', '109', '112', '111',
'114', '116', '32', '40', '10', '9',
'34', '102', '109', '116', '34', '10',
'41', '10', '10', '102', '117', '110',
'99', '32', '109', '97', '105', '110',
'40', '41', '32', '123', '10', '9',
'102', '109', '116', '46', '80', '114',
'105', '110', '116', '108', '110', '40',
'34', '72', '101', '108', '108', '111',
'44', '32', '112', '108', '97', '121',
'103', '114', '111', '117', '110', '100',
'34', '41', '10', '125', '10'
package main
import “fmt”
func main() {
fmt.Println(“Hello, Copenhagen”)
}
What is Source Code
package package
IDENT main
;
import import
STRING "fmt"
;
func func
IDENT main
(
)
What is Source Code
{
IDENT fmt
.
IDENT Println
(
STRING "Hello, Denver"
)
;
}
;
package main
import “fmt”
func main() {
fmt.Println(“Hello, Copenhagen”)
}
What is Source Code
package main
import “fmt”
func main() {
fmt.Println(“Hello, Copenhagen”)
}
What is Source Code
package main
import “fmt”
func main() {
fmt.Println(“Hello, Copenhagen”)
}
What is Source Code
● A sequence of bytes
● A sequence of tokens
● An abstract syntax tree
● A graph (e.g. Control Flow Graph)
Tasks
● Language Classification
● File Parsing
● Token Extraction
● History Analysis
● Reference Resolution
Tools
● enry
● babelfish
● libuast & XPath selectors
● go-git
● kythe.io
Analyzing Code
source{d} engine
github.com/src-d/engine
babelfish
gitbase
jupiter
Demo time!
Challenge #3
Learning from Source Code
Neural Networks
Basically fancy linear regression machines
Given an input of a constant length,
they predict an output of constant length.
Example:
MNIST:
Input: images with 28x28 px
Output: a digit from zero to 9
MNIST
~0
~0
~0
~0
~0
~0
~0
~0
~1
~0
MLonCode: Predict the next token
for
i
:=
0
;
i
<
10
;
i
++
Recurrent Neural Networks
Can process sequences of variable length.
Uses its own output as a new input.
Example:
Natural Language Translation:
Input: “bonjour, les gauffres”
Output: “hi, waffles”
MLonCode: Code Generation
charRNN: Given n characters, predict the next one
Trained over the Go standard library
Achieved 61% accuracy on predictions.
Before training
r t,
kp0t@pp kpktp 0p000 xS%%%?ttk?^@p0rk^@%ppp@ta#p^@ #pp}}%p^@?P%^@@k#%@P}}ta S?@}^@t%@% %%aNt i
^@SSt@@ pyikkp?%y ?t k L P0L t% ^@i%yy ^@p i? %L%LL tyLP?a ?L@Ly?tkk^@ @^@ykk^@i#P^@iL@??@%1tt%^@tPTta L
^@LL%% %i1::yyy^@^@t tP @?@a#Patt 1^@@ k^@k ? yt%L1^@tP%k1?k? % ^@i ^@ta1?1taktt1P?a^@^@Pkt?#^@t^@##1?##
#^@t11#:^@%??t%1^@a 1?a at1P ^@^@Pt #%^@^@ ^@aaak^@#a#?P1Pa^@tt%?^@kt?#akP ?#^@i%%aa ^@1%t tt?a?%
t^@k^@^@k^@ a : ^@1 P# % ^@^@#t% :% kkP ^@#?P: t^@a
?%##?kkPaP^@ #a k?t?? ai?i%PPk taP% P^@ k^@iiai#?^@# #t ?# P?P^@ i^@ttPt #
1%11 ti a^@k P^@k ^@kt %^@%y?#a a#% @? kt ^@t%k? ^@PtttkL tkLa1 ^@iaay?p1% Pta tt ik?ty
k^@kpt%^@tktpkryyp^@?pP# %kt?ki? i @t^@k^@%#P} ?at}akP##Pa11%^@i% ^@?ia ia%##%tki %
}i%%%}} a ay^@%yt }%t ^@tU%a% t}yi^@ ^@ @t yt%? aP @% ^@??^@%? ^@??k#%
kk#%t?a: P}^@t :#^@#1t^@#: w^@P#%w:Pt t # t%aa%i@ak@@^@ka@^@a # y}^@# ^@? % tP i?
?tk ktPPt a tpprrpt? a^@ pP pt %p ? k? ^@^@ kP^@%%?tk a Pt^@#
tP? P kkP1L1tP a%? t1P%PPti^@?%ytk %#%%t?@?^@ty^@iyk%1#^@@^@1#t a t@P^@^@ P@^@1P^@%%#@P:^@%^@ t
1:#P#@LtL#@L L1 %%dt??^@L ^@iBt yTk%p ^@i
After one epoch (dataset seen once)
if testingValuesIntering() {
t.SetCaterCleen(time.SewsallSetrive(true)
if weq := nil {
t.Errorf("eshould: wont %v", touts anverals, prc.Strnared, error
}
t, err := ntr.Soare(cueper(err, err)
if err != nil {
t.Errorf("preveth dime il resetests:%d; want %#',", tr.test.Into
}
if err != nil {
return
}
if err == nel {
t.Errorf("LoconserrSathe foot %q::%q: %s;%want %d", d, err)
},
defarenContateFule(temt.Canses)
}
if err != nil {
return err
}
// Treters and restives of the sesconse stmpeletatareservet
// This no to the result digares wheckader. Constate bytes alleal
After two epochs
if !ok {
t.Errorf("%d: %v not %v", i, err)
}
if !ot.Close()
if enr != nil {
t.Fatal(err)
}
if !ers != nil {
t.Fatal(err)
}
if err != nil {
t.Fatal(err)
}
if err != nil {
t.Errorf("error %q: %s not %v", i, err)
}
return nil
}
if got := t.struct(); !ok {
t.Fatalf("Got %q: %q, %v, want %q", test, true
}
if !strings.Connig(t) {
t.Fatalf("Got %q: %q", want %q", t, err)
}
if !ot {
t.Errorf("%s < %v", x, y)
}
if !ok {
t.Errorf("%d <= %d", err)
}
if !stricgs(); !ot {
t.Errorf("!(%d <= %v", x, e)
}
}
if !ot != nil {
return ""
}
After many epochs
Learning to Represent Programs with Graphs
from, err := os.Open("a.txt")
if err != nil {
log.Fatal(err)
}
defer from.Close()
to, err := os.Open("b.txt")
if err != nil {
log.Fatal(err)
}
defer ???.Close()
io.Copy(to, from)
Miltiadis Allamanis, Marc Brockschmidt, Mahmoud Khademi
https://arxiv.org/abs/1711.00740
The VARMISUSE Task:
Given a program and a gap in it,
predict what variable is missing.
code2vec: Learning Distributed Representations of Code
Uri Alon, Meital Zilberstein, Omer Levy, Eran Yahav
https://arxiv.org/abs/1803.09473 | https://code2vec.org/
Much more research
github.com/src-d/awesome-machine-learning-on-source-code
Challenge #4
What can we build?
Predictable vs Predicted
~0
~0
~0
~0
~0
~0
~0
~0
~1
~0
A
G
o
PR
An attention model for code reviews.
Can you see the mistake?
Prediction vs Expectation
for i := 0; i < 10; i-- {
if i %2 == 0 {
fmt.Println("where's the mistake?")
}
}
Can you see the mistake?
Prediction vs Expectation
for i := 0; i < 10; i-- {
if i %2 == 0 {
fmt.Println("where's the mistake?")
}
}
Can you see the mistake?
VARMISUSE
from, err := os.Open("a.txt")
if err != nil {
log.Fatal(err)
}
defer from.Close()
to, err := os.Open("b.txt")
if err != nil {
log.Fatal(err)
}
defer from.Close()
io.Copy(to, from)
Can you see the mistake?
VARMISUSE
from, err := os.Open("a.txt")
if err != nil {
log.Fatal(err)
}
defer from.Close()
to, err := os.Open("b.txt")
if err != nil {
log.Fatal(err)
}
defer from.Close() ← s/from/to/
io.Copy(to, from)
Is this a good name?
func XXX(list []string, text string) bool {
for _, s := range list {
if s == text {
return true
}
}
return false
}
Suggestions:
● Contains
● Has
func XXX(list []string, text string) int {
for i, s := range list {
if s == text {
return i
}
}
return -1
}
Suggestions:
● Find
● Index
code2vec: Learning Distributed Representations of Code
Splitting millions of identifiers with Deep Learning
https://blog.sourced.tech/post/idsplit/
isthisCorrect? → is this correct? → isThisCorrect?
Demo time!
learning Go
code2vec.org
neural splitter
source: WOCinTech
Assisted code review! src-d/lookout
putting everything together
lookout
Coming up soon:
● Automated Style Guide Enforcing
● Bug Prediction
Coming … later:
● Automated Code Review
● Code Generation: from unit tests, specification, natural language
description.
● Natural Analysis: code description and conversational analysis.
● Education
And so much more
Will developers be replaced?
Developers will be empowered.
Want to know more?
● sourced.tech (pssh, we’re hiring)
● bit.ly/awesome-mloncode
● francesc@sourced.tech
● come say hi, I have stickers
Thanks
francesc

Mais conteúdo relacionado

Mais procurados

TDD in C - Recently Used List Kata
TDD in C - Recently Used List KataTDD in C - Recently Used List Kata
TDD in C - Recently Used List KataOlve Maudal
 
PyWPS Development restart
PyWPS Development restartPyWPS Development restart
PyWPS Development restartJachym Cepicky
 
Async await in C++
Async await in C++Async await in C++
Async await in C++cppfrug
 
Command line arguments that make you smile
Command line arguments that make you smileCommand line arguments that make you smile
Command line arguments that make you smileMartin Melin
 
Golang iran - tutorial go programming language - Preliminary
Golang iran - tutorial  go programming language - PreliminaryGolang iran - tutorial  go programming language - Preliminary
Golang iran - tutorial go programming language - Preliminarygo-lang
 
A peek on numerical programming in perl and python e christopher dyken 2005
A peek on numerical programming in perl and python  e christopher dyken  2005A peek on numerical programming in perl and python  e christopher dyken  2005
A peek on numerical programming in perl and python e christopher dyken 2005Jules Krdenas
 
Integrating R with C++: Rcpp, RInside and RProtoBuf
Integrating R with C++: Rcpp, RInside and RProtoBufIntegrating R with C++: Rcpp, RInside and RProtoBuf
Integrating R with C++: Rcpp, RInside and RProtoBufRomain Francois
 
CodiLime Tech Talk - Grzegorz Rozdzialik: What the java script
CodiLime Tech Talk - Grzegorz Rozdzialik: What the java scriptCodiLime Tech Talk - Grzegorz Rozdzialik: What the java script
CodiLime Tech Talk - Grzegorz Rozdzialik: What the java scriptCodiLime
 
Антон Бикинеев, Reflection in C++Next
Антон Бикинеев,  Reflection in C++NextАнтон Бикинеев,  Reflection in C++Next
Антон Бикинеев, Reflection in C++NextSergey Platonov
 
Fuzzing: The New Unit Testing
Fuzzing: The New Unit TestingFuzzing: The New Unit Testing
Fuzzing: The New Unit TestingDmitry Vyukov
 
Tensor comprehensions
Tensor comprehensionsTensor comprehensions
Tensor comprehensionsMr. Vengineer
 
[PyCon 2014 APAC] How to integrate python into a scala stack to build realtim...
[PyCon 2014 APAC] How to integrate python into a scala stack to build realtim...[PyCon 2014 APAC] How to integrate python into a scala stack to build realtim...
[PyCon 2014 APAC] How to integrate python into a scala stack to build realtim...Jerry Chou
 
Crange: Clang based tool to index and cross-reference C/C++ source code
Crange: Clang based tool to index and cross-reference C/C++ source code Crange: Clang based tool to index and cross-reference C/C++ source code
Crange: Clang based tool to index and cross-reference C/C++ source code Anurag Patel
 

Mais procurados (20)

Go Lang Tutorial
Go Lang TutorialGo Lang Tutorial
Go Lang Tutorial
 
Python to scala
Python to scalaPython to scala
Python to scala
 
TDD in C - Recently Used List Kata
TDD in C - Recently Used List KataTDD in C - Recently Used List Kata
TDD in C - Recently Used List Kata
 
PyWPS Development restart
PyWPS Development restartPyWPS Development restart
PyWPS Development restart
 
Async await in C++
Async await in C++Async await in C++
Async await in C++
 
Go. why it goes v2
Go. why it goes v2Go. why it goes v2
Go. why it goes v2
 
Command line arguments that make you smile
Command line arguments that make you smileCommand line arguments that make you smile
Command line arguments that make you smile
 
RAII and ScopeGuard
RAII and ScopeGuardRAII and ScopeGuard
RAII and ScopeGuard
 
Golang iran - tutorial go programming language - Preliminary
Golang iran - tutorial  go programming language - PreliminaryGolang iran - tutorial  go programming language - Preliminary
Golang iran - tutorial go programming language - Preliminary
 
A peek on numerical programming in perl and python e christopher dyken 2005
A peek on numerical programming in perl and python  e christopher dyken  2005A peek on numerical programming in perl and python  e christopher dyken  2005
A peek on numerical programming in perl and python e christopher dyken 2005
 
Golang dot-testing-lite
Golang dot-testing-liteGolang dot-testing-lite
Golang dot-testing-lite
 
Integrating R with C++: Rcpp, RInside and RProtoBuf
Integrating R with C++: Rcpp, RInside and RProtoBufIntegrating R with C++: Rcpp, RInside and RProtoBuf
Integrating R with C++: Rcpp, RInside and RProtoBuf
 
CodiLime Tech Talk - Grzegorz Rozdzialik: What the java script
CodiLime Tech Talk - Grzegorz Rozdzialik: What the java scriptCodiLime Tech Talk - Grzegorz Rozdzialik: What the java script
CodiLime Tech Talk - Grzegorz Rozdzialik: What the java script
 
Антон Бикинеев, Reflection in C++Next
Антон Бикинеев,  Reflection in C++NextАнтон Бикинеев,  Reflection in C++Next
Антон Бикинеев, Reflection in C++Next
 
TensorFlow XLA RPC
TensorFlow XLA RPCTensorFlow XLA RPC
TensorFlow XLA RPC
 
Fuzzing: The New Unit Testing
Fuzzing: The New Unit TestingFuzzing: The New Unit Testing
Fuzzing: The New Unit Testing
 
Tensor comprehensions
Tensor comprehensionsTensor comprehensions
Tensor comprehensions
 
Summary of C++17 features
Summary of C++17 featuresSummary of C++17 features
Summary of C++17 features
 
[PyCon 2014 APAC] How to integrate python into a scala stack to build realtim...
[PyCon 2014 APAC] How to integrate python into a scala stack to build realtim...[PyCon 2014 APAC] How to integrate python into a scala stack to build realtim...
[PyCon 2014 APAC] How to integrate python into a scala stack to build realtim...
 
Crange: Clang based tool to index and cross-reference C/C++ source code
Crange: Clang based tool to index and cross-reference C/C++ source code Crange: Clang based tool to index and cross-reference C/C++ source code
Crange: Clang based tool to index and cross-reference C/C++ source code
 

Semelhante a Machine Learning on Code - SF meetup

Machine learning on Go Code
Machine learning on Go CodeMachine learning on Go Code
Machine learning on Go Codesource{d}
 
Introduction to source{d} Engine and source{d} Lookout
Introduction to source{d} Engine and source{d} Lookout Introduction to source{d} Engine and source{d} Lookout
Introduction to source{d} Engine and source{d} Lookout source{d}
 
100 bugs in Open Source C/C++ projects
100 bugs in Open Source C/C++ projects 100 bugs in Open Source C/C++ projects
100 bugs in Open Source C/C++ projects Andrey Karpov
 
Code Analysis-run time error prediction
Code Analysis-run time error predictionCode Analysis-run time error prediction
Code Analysis-run time error predictionNIKHIL NAWATHE
 
The operation principles of PVS-Studio static code analyzer
The operation principles of PVS-Studio static code analyzerThe operation principles of PVS-Studio static code analyzer
The operation principles of PVS-Studio static code analyzerAndrey Karpov
 
Go 1.10 Release Party - PDX Go
Go 1.10 Release Party - PDX GoGo 1.10 Release Party - PDX Go
Go 1.10 Release Party - PDX GoRodolfo Carvalho
 
Tips And Tricks For Bioinformatics Software Engineering
Tips And Tricks For Bioinformatics Software EngineeringTips And Tricks For Bioinformatics Software Engineering
Tips And Tricks For Bioinformatics Software Engineeringjtdudley
 
.NET Foundation, Future of .NET and C#
.NET Foundation, Future of .NET and C#.NET Foundation, Future of .NET and C#
.NET Foundation, Future of .NET and C#Bertrand Le Roy
 
"The OpenCV Open Source Computer Vision Library: Latest Developments," a Pres...
"The OpenCV Open Source Computer Vision Library: Latest Developments," a Pres..."The OpenCV Open Source Computer Vision Library: Latest Developments," a Pres...
"The OpenCV Open Source Computer Vision Library: Latest Developments," a Pres...Edge AI and Vision Alliance
 
Gae icc fall2011
Gae icc fall2011Gae icc fall2011
Gae icc fall2011Juan Gomez
 
Making fitting in RooFit faster
Making fitting in RooFit fasterMaking fitting in RooFit faster
Making fitting in RooFit fasterPatrick Bos
 
Standardizing on a single N-dimensional array API for Python
Standardizing on a single N-dimensional array API for PythonStandardizing on a single N-dimensional array API for Python
Standardizing on a single N-dimensional array API for PythonRalf Gommers
 
Kamil witecki asynchronous, yet readable, code
Kamil witecki asynchronous, yet readable, codeKamil witecki asynchronous, yet readable, code
Kamil witecki asynchronous, yet readable, codeKamil Witecki
 
App secforum2014 andrivet-cplusplus11-metaprogramming_applied_to_software_obf...
App secforum2014 andrivet-cplusplus11-metaprogramming_applied_to_software_obf...App secforum2014 andrivet-cplusplus11-metaprogramming_applied_to_software_obf...
App secforum2014 andrivet-cplusplus11-metaprogramming_applied_to_software_obf...Cyber Security Alliance
 
Overview Of Parallel Development - Ericnel
Overview Of Parallel Development -  EricnelOverview Of Parallel Development -  Ericnel
Overview Of Parallel Development - Ericnelukdpe
 
PPT on Python - illustrating Python for BBA, B.Tech
PPT on Python - illustrating Python for BBA, B.TechPPT on Python - illustrating Python for BBA, B.Tech
PPT on Python - illustrating Python for BBA, B.Techssuser2678ab
 

Semelhante a Machine Learning on Code - SF meetup (20)

Machine learning on Go Code
Machine learning on Go CodeMachine learning on Go Code
Machine learning on Go Code
 
Introduction to source{d} Engine and source{d} Lookout
Introduction to source{d} Engine and source{d} Lookout Introduction to source{d} Engine and source{d} Lookout
Introduction to source{d} Engine and source{d} Lookout
 
100 bugs in Open Source C/C++ projects
100 bugs in Open Source C/C++ projects 100 bugs in Open Source C/C++ projects
100 bugs in Open Source C/C++ projects
 
Code metrics in PHP
Code metrics in PHPCode metrics in PHP
Code metrics in PHP
 
Code Analysis-run time error prediction
Code Analysis-run time error predictionCode Analysis-run time error prediction
Code Analysis-run time error prediction
 
The operation principles of PVS-Studio static code analyzer
The operation principles of PVS-Studio static code analyzerThe operation principles of PVS-Studio static code analyzer
The operation principles of PVS-Studio static code analyzer
 
C Tutorials
C TutorialsC Tutorials
C Tutorials
 
Go 1.10 Release Party - PDX Go
Go 1.10 Release Party - PDX GoGo 1.10 Release Party - PDX Go
Go 1.10 Release Party - PDX Go
 
Tips And Tricks For Bioinformatics Software Engineering
Tips And Tricks For Bioinformatics Software EngineeringTips And Tricks For Bioinformatics Software Engineering
Tips And Tricks For Bioinformatics Software Engineering
 
.NET Foundation, Future of .NET and C#
.NET Foundation, Future of .NET and C#.NET Foundation, Future of .NET and C#
.NET Foundation, Future of .NET and C#
 
"The OpenCV Open Source Computer Vision Library: Latest Developments," a Pres...
"The OpenCV Open Source Computer Vision Library: Latest Developments," a Pres..."The OpenCV Open Source Computer Vision Library: Latest Developments," a Pres...
"The OpenCV Open Source Computer Vision Library: Latest Developments," a Pres...
 
Gae icc fall2011
Gae icc fall2011Gae icc fall2011
Gae icc fall2011
 
Making fitting in RooFit faster
Making fitting in RooFit fasterMaking fitting in RooFit faster
Making fitting in RooFit faster
 
Modern C++
Modern C++Modern C++
Modern C++
 
Standardizing on a single N-dimensional array API for Python
Standardizing on a single N-dimensional array API for PythonStandardizing on a single N-dimensional array API for Python
Standardizing on a single N-dimensional array API for Python
 
Kamil witecki asynchronous, yet readable, code
Kamil witecki asynchronous, yet readable, codeKamil witecki asynchronous, yet readable, code
Kamil witecki asynchronous, yet readable, code
 
Gcrc talk
Gcrc talkGcrc talk
Gcrc talk
 
App secforum2014 andrivet-cplusplus11-metaprogramming_applied_to_software_obf...
App secforum2014 andrivet-cplusplus11-metaprogramming_applied_to_software_obf...App secforum2014 andrivet-cplusplus11-metaprogramming_applied_to_software_obf...
App secforum2014 andrivet-cplusplus11-metaprogramming_applied_to_software_obf...
 
Overview Of Parallel Development - Ericnel
Overview Of Parallel Development -  EricnelOverview Of Parallel Development -  Ericnel
Overview Of Parallel Development - Ericnel
 
PPT on Python - illustrating Python for BBA, B.Tech
PPT on Python - illustrating Python for BBA, B.TechPPT on Python - illustrating Python for BBA, B.Tech
PPT on Python - illustrating Python for BBA, B.Tech
 

Mais de source{d}

Overton, Apple Flavored ML
Overton, Apple Flavored MLOverton, Apple Flavored ML
Overton, Apple Flavored MLsource{d}
 
Unlocking Engineering Observability with advanced IT analytics
Unlocking Engineering Observability with advanced IT analyticsUnlocking Engineering Observability with advanced IT analytics
Unlocking Engineering Observability with advanced IT analyticssource{d}
 
What's new in the latest source{d} releases!
What's new in the latest source{d} releases!What's new in the latest source{d} releases!
What's new in the latest source{d} releases!source{d}
 
Code as Data workshop: Using source{d} Engine to extract insights from git re...
Code as Data workshop: Using source{d} Engine to extract insights from git re...Code as Data workshop: Using source{d} Engine to extract insights from git re...
Code as Data workshop: Using source{d} Engine to extract insights from git re...source{d}
 
Gitbase, SQL interface to Git repositories
Gitbase, SQL interface to Git repositoriesGitbase, SQL interface to Git repositories
Gitbase, SQL interface to Git repositoriessource{d}
 
Deduplication on large amounts of code
Deduplication on large amounts of codeDeduplication on large amounts of code
Deduplication on large amounts of codesource{d}
 
Assisted code review with source{d} lookout
Assisted code review with source{d} lookoutAssisted code review with source{d} lookout
Assisted code review with source{d} lookoutsource{d}
 
Inextricably linked reproducibility and productivity in data science and ai ...
Inextricably linked reproducibility and productivity in data science and ai  ...Inextricably linked reproducibility and productivity in data science and ai  ...
Inextricably linked reproducibility and productivity in data science and ai ...source{d}
 
source{d} Engine - your code as data
source{d} Engine - your code as datasource{d} Engine - your code as data
source{d} Engine - your code as datasource{d}
 
Introduction to the source{d} Stack
Introduction to the source{d} Stack Introduction to the source{d} Stack
Introduction to the source{d} Stack source{d}
 
source{d} Engine: Exploring git repos with SQL
source{d} Engine: Exploring git repos with SQLsource{d} Engine: Exploring git repos with SQL
source{d} Engine: Exploring git repos with SQLsource{d}
 
Improving go-git performance
Improving go-git performanceImproving go-git performance
Improving go-git performancesource{d}
 
Machine learning on source code
Machine learning on source codeMachine learning on source code
Machine learning on source codesource{d}
 

Mais de source{d} (13)

Overton, Apple Flavored ML
Overton, Apple Flavored MLOverton, Apple Flavored ML
Overton, Apple Flavored ML
 
Unlocking Engineering Observability with advanced IT analytics
Unlocking Engineering Observability with advanced IT analyticsUnlocking Engineering Observability with advanced IT analytics
Unlocking Engineering Observability with advanced IT analytics
 
What's new in the latest source{d} releases!
What's new in the latest source{d} releases!What's new in the latest source{d} releases!
What's new in the latest source{d} releases!
 
Code as Data workshop: Using source{d} Engine to extract insights from git re...
Code as Data workshop: Using source{d} Engine to extract insights from git re...Code as Data workshop: Using source{d} Engine to extract insights from git re...
Code as Data workshop: Using source{d} Engine to extract insights from git re...
 
Gitbase, SQL interface to Git repositories
Gitbase, SQL interface to Git repositoriesGitbase, SQL interface to Git repositories
Gitbase, SQL interface to Git repositories
 
Deduplication on large amounts of code
Deduplication on large amounts of codeDeduplication on large amounts of code
Deduplication on large amounts of code
 
Assisted code review with source{d} lookout
Assisted code review with source{d} lookoutAssisted code review with source{d} lookout
Assisted code review with source{d} lookout
 
Inextricably linked reproducibility and productivity in data science and ai ...
Inextricably linked reproducibility and productivity in data science and ai  ...Inextricably linked reproducibility and productivity in data science and ai  ...
Inextricably linked reproducibility and productivity in data science and ai ...
 
source{d} Engine - your code as data
source{d} Engine - your code as datasource{d} Engine - your code as data
source{d} Engine - your code as data
 
Introduction to the source{d} Stack
Introduction to the source{d} Stack Introduction to the source{d} Stack
Introduction to the source{d} Stack
 
source{d} Engine: Exploring git repos with SQL
source{d} Engine: Exploring git repos with SQLsource{d} Engine: Exploring git repos with SQL
source{d} Engine: Exploring git repos with SQL
 
Improving go-git performance
Improving go-git performanceImproving go-git performance
Improving go-git performance
 
Machine learning on source code
Machine learning on source codeMachine learning on source code
Machine learning on source code
 

Último

TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demoHarshalMandlekar2
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
What is Artificial Intelligence?????????
What is Artificial Intelligence?????????What is Artificial Intelligence?????????
What is Artificial Intelligence?????????blackmambaettijean
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 

Último (20)

TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demo
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
What is Artificial Intelligence?????????
What is Artificial Intelligence?????????What is Artificial Intelligence?????????
What is Artificial Intelligence?????????
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 

Machine Learning on Code - SF meetup

  • 1. “Software is eating the world”
  • 8.
  • 9.
  • 10.
  • 11.
  • 12.
  • 13. Machine Learning on Source Code Francesc Campoy
  • 14. VP of Product & DevRel source{d} Machine Learning for Large Scale Code Analysis @francesc | #MLonCode Francesc Campoy
  • 15. Agenda ● Machine Learning on Source Code ● Research ● Use Cases ● The Future
  • 16. Machine Learning on Source Code
  • 17. Machine Learning on Source Code Field of Machine Learning where the input data is source code. MLonCode
  • 18. Machine Learning on Source Code Requires: ● Lots of data ● Really, lots and lots of data ● Fancy ML Algorithms ● A little bit of luck Related Fields: ● Data Mining ● Natural Language Processing ● Graph Based Machine Learning
  • 20. The datasets of ML on Code ● GH Archive: https://www.gharchive.org ● Public Git Archive https://pga.sourced.tech
  • 25. '112', '97', '99', '107', '97', '103', '101', '32', '109', '97', '105', '110', '10', '10', '105', '109', '112', '111', '114', '116', '32', '40', '10', '9', '34', '102', '109', '116', '34', '10', '41', '10', '10', '102', '117', '110', '99', '32', '109', '97', '105', '110', '40', '41', '32', '123', '10', '9', '102', '109', '116', '46', '80', '114', '105', '110', '116', '108', '110', '40', '34', '72', '101', '108', '108', '111', '44', '32', '112', '108', '97', '121', '103', '114', '111', '117', '110', '100', '34', '41', '10', '125', '10' package main import “fmt” func main() { fmt.Println(“Hello, Copenhagen”) } What is Source Code
  • 26. package package IDENT main ; import import STRING "fmt" ; func func IDENT main ( ) What is Source Code { IDENT fmt . IDENT Println ( STRING "Hello, Denver" ) ; } ; package main import “fmt” func main() { fmt.Println(“Hello, Copenhagen”) }
  • 27. What is Source Code package main import “fmt” func main() { fmt.Println(“Hello, Copenhagen”) }
  • 28. What is Source Code package main import “fmt” func main() { fmt.Println(“Hello, Copenhagen”) }
  • 29. What is Source Code ● A sequence of bytes ● A sequence of tokens ● An abstract syntax tree ● A graph (e.g. Control Flow Graph)
  • 30. Tasks ● Language Classification ● File Parsing ● Token Extraction ● History Analysis ● Reference Resolution Tools ● enry ● babelfish ● libuast & XPath selectors ● go-git ● kythe.io Analyzing Code
  • 34. Neural Networks Basically fancy linear regression machines Given an input of a constant length, they predict an output of constant length. Example: MNIST: Input: images with 28x28 px Output: a digit from zero to 9
  • 36. MLonCode: Predict the next token for i := 0 ; i < 10 ; i ++
  • 37. Recurrent Neural Networks Can process sequences of variable length. Uses its own output as a new input. Example: Natural Language Translation: Input: “bonjour, les gauffres” Output: “hi, waffles”
  • 38. MLonCode: Code Generation charRNN: Given n characters, predict the next one Trained over the Go standard library Achieved 61% accuracy on predictions.
  • 39. Before training r t, kp0t@pp kpktp 0p000 xS%%%?ttk?^@p0rk^@%ppp@ta#p^@ #pp}}%p^@?P%^@@k#%@P}}ta S?@}^@t%@% %%aNt i ^@SSt@@ pyikkp?%y ?t k L P0L t% ^@i%yy ^@p i? %L%LL tyLP?a ?L@Ly?tkk^@ @^@ykk^@i#P^@iL@??@%1tt%^@tPTta L ^@LL%% %i1::yyy^@^@t tP @?@a#Patt 1^@@ k^@k ? yt%L1^@tP%k1?k? % ^@i ^@ta1?1taktt1P?a^@^@Pkt?#^@t^@##1?## #^@t11#:^@%??t%1^@a 1?a at1P ^@^@Pt #%^@^@ ^@aaak^@#a#?P1Pa^@tt%?^@kt?#akP ?#^@i%%aa ^@1%t tt?a?% t^@k^@^@k^@ a : ^@1 P# % ^@^@#t% :% kkP ^@#?P: t^@a ?%##?kkPaP^@ #a k?t?? ai?i%PPk taP% P^@ k^@iiai#?^@# #t ?# P?P^@ i^@ttPt # 1%11 ti a^@k P^@k ^@kt %^@%y?#a a#% @? kt ^@t%k? ^@PtttkL tkLa1 ^@iaay?p1% Pta tt ik?ty k^@kpt%^@tktpkryyp^@?pP# %kt?ki? i @t^@k^@%#P} ?at}akP##Pa11%^@i% ^@?ia ia%##%tki % }i%%%}} a ay^@%yt }%t ^@tU%a% t}yi^@ ^@ @t yt%? aP @% ^@??^@%? ^@??k#% kk#%t?a: P}^@t :#^@#1t^@#: w^@P#%w:Pt t # t%aa%i@ak@@^@ka@^@a # y}^@# ^@? % tP i? ?tk ktPPt a tpprrpt? a^@ pP pt %p ? k? ^@^@ kP^@%%?tk a Pt^@# tP? P kkP1L1tP a%? t1P%PPti^@?%ytk %#%%t?@?^@ty^@iyk%1#^@@^@1#t a t@P^@^@ P@^@1P^@%%#@P:^@%^@ t 1:#P#@LtL#@L L1 %%dt??^@L ^@iBt yTk%p ^@i
  • 40. After one epoch (dataset seen once) if testingValuesIntering() { t.SetCaterCleen(time.SewsallSetrive(true) if weq := nil { t.Errorf("eshould: wont %v", touts anverals, prc.Strnared, error } t, err := ntr.Soare(cueper(err, err) if err != nil { t.Errorf("preveth dime il resetests:%d; want %#',", tr.test.Into } if err != nil { return } if err == nel { t.Errorf("LoconserrSathe foot %q::%q: %s;%want %d", d, err) }, defarenContateFule(temt.Canses) } if err != nil { return err } // Treters and restives of the sesconse stmpeletatareservet // This no to the result digares wheckader. Constate bytes alleal
  • 41. After two epochs if !ok { t.Errorf("%d: %v not %v", i, err) } if !ot.Close() if enr != nil { t.Fatal(err) } if !ers != nil { t.Fatal(err) } if err != nil { t.Fatal(err) } if err != nil { t.Errorf("error %q: %s not %v", i, err) } return nil }
  • 42. if got := t.struct(); !ok { t.Fatalf("Got %q: %q, %v, want %q", test, true } if !strings.Connig(t) { t.Fatalf("Got %q: %q", want %q", t, err) } if !ot { t.Errorf("%s < %v", x, y) } if !ok { t.Errorf("%d <= %d", err) } if !stricgs(); !ot { t.Errorf("!(%d <= %v", x, e) } } if !ot != nil { return "" } After many epochs
  • 43. Learning to Represent Programs with Graphs from, err := os.Open("a.txt") if err != nil { log.Fatal(err) } defer from.Close() to, err := os.Open("b.txt") if err != nil { log.Fatal(err) } defer ???.Close() io.Copy(to, from) Miltiadis Allamanis, Marc Brockschmidt, Mahmoud Khademi https://arxiv.org/abs/1711.00740 The VARMISUSE Task: Given a program and a gap in it, predict what variable is missing.
  • 44. code2vec: Learning Distributed Representations of Code Uri Alon, Meital Zilberstein, Omer Levy, Eran Yahav https://arxiv.org/abs/1803.09473 | https://code2vec.org/
  • 48. A G o PR An attention model for code reviews.
  • 49.
  • 50.
  • 51. Can you see the mistake? Prediction vs Expectation for i := 0; i < 10; i-- { if i %2 == 0 { fmt.Println("where's the mistake?") } }
  • 52. Can you see the mistake? Prediction vs Expectation for i := 0; i < 10; i-- { if i %2 == 0 { fmt.Println("where's the mistake?") } }
  • 53. Can you see the mistake? VARMISUSE from, err := os.Open("a.txt") if err != nil { log.Fatal(err) } defer from.Close() to, err := os.Open("b.txt") if err != nil { log.Fatal(err) } defer from.Close() io.Copy(to, from)
  • 54. Can you see the mistake? VARMISUSE from, err := os.Open("a.txt") if err != nil { log.Fatal(err) } defer from.Close() to, err := os.Open("b.txt") if err != nil { log.Fatal(err) } defer from.Close() ← s/from/to/ io.Copy(to, from)
  • 55. Is this a good name? func XXX(list []string, text string) bool { for _, s := range list { if s == text { return true } } return false } Suggestions: ● Contains ● Has func XXX(list []string, text string) int { for i, s := range list { if s == text { return i } } return -1 } Suggestions: ● Find ● Index code2vec: Learning Distributed Representations of Code
  • 56. Splitting millions of identifiers with Deep Learning https://blog.sourced.tech/post/idsplit/ isthisCorrect? → is this correct? → isThisCorrect?
  • 58. source: WOCinTech Assisted code review! src-d/lookout
  • 60. Coming up soon: ● Automated Style Guide Enforcing ● Bug Prediction Coming … later: ● Automated Code Review ● Code Generation: from unit tests, specification, natural language description. ● Natural Analysis: code description and conversational analysis. ● Education And so much more
  • 61. Will developers be replaced?
  • 62. Developers will be empowered.
  • 63. Want to know more? ● sourced.tech (pssh, we’re hiring) ● bit.ly/awesome-mloncode ● francesc@sourced.tech ● come say hi, I have stickers