SlideShare uma empresa Scribd logo
1 de 54
Baixar para ler offline
Naive application of
Machine Learning to
Software Development
Naive application of
Machine Learning to
Software Development
or... what developers don't
tell :)
What and why
42 Coffee Cups:
  completely distributed development
  team
What and why
42 Coffee Cups:
  completely distributed development
  team

Hard facts about how software is done
What and why
42 Coffee Cups:
  completely distributed development
  team

Hard facts about how software is done


LOTS OF THEM
What and why

Facts
What and why

Facts          Profit
What and why

Facts    ???   Profit
What and why
???
Toy problem:
 get ticket and predict how
 long it will take to close it
What and why
???
Toy problem:
 get ticket and predict how
 long it will take to close it
Bonus: learn scikit-
learn :)
Install scikit-learn
● sudo apt-get install python-
  dev
Install scikit-learn
● sudo apt-get install python-
  dev python-numpy python-numpy-
  dev
Install scikit-learn
● sudo apt-get install python-
  dev python-numpy python-numpy-
  dev
  python-scipy
Install scikit-learn
● sudo apt-get install python-
  dev python-numpy python-numpy-
  dev python-scipy python-
  setuptools
  libatlas-dev g++
Install scikit-learn
● sudo apt-get install python-
  dev python-numpy python-numpy-
  dev python-scipy python-
  setuptools
  libatlas-dev g++
● pip install -U scikit-learn
Data: closed tickets
import urllib2

url = 
'https://code.djangoproject.com/query?format=csv'
+
'&col=id&col=time&col=changetime&col=reporter' + 
'&col=summary&col=status&col=owner&col=type' + 
'&col=component&order=priority'

tickets = urllib2.urlopen(url).read()

open('2012-10-09.csv','w').write(tickets)
Data: closed tickets
id,time,changetime,reporter,summary,status,owner,
type,component
1,2005-07-13 12:03:27,2012-05-20 08:12:37,adrian,
Create architecture for anonymous sessions,closed,
jacob,enhancement,Core (Other)
2,2005-07-13 12:04:45,2007-07-03 16:04:18,
anonymous,Calendar popup - next/previous month
links close the popup in Safari,closed,jacob,
defect,contrib.admin
Data: closed date and description
def get_data(ticket):
  url = 'https://code.djangoproject.com/ticket/%s'
     % ticket
  ticket_html = urllib2.urlopen(url)
  bs = BeautifulSoup(ticket_html)
Data: closed date and description
# get closing date
d = bs.find_all('div','date')[0]
p = list(d.children)[3]
href = p.find('a')['href']
close_time_str = urlparse.parse_qs(href)
['/timeline?from'][0]
close_time = datetime.datetime.strptime
(close_time_str[:-6],
                           '%Y-%m-%dT%H:%M:%S')
# ... more black magic, see code
Data: closed date and description
def get_data(ticket):
    [...]

   # get description and return
   de = bs.find_all('div', 'description')[0]
   return close_time, de.text
Data: closed date and description
tickets_file = csv.reader(open('2012-10-09.csv'))
output = 
    csv.writer(open('2012-10-09.close.csv','w'))

for id, time, changetime, reporter, summary, 
    status, owner, type, component in tickets_file:
  closetime, descr = get_data(id)
  row = [id, time, changetime, closetime, reporter,
    summary, status, owner, type, component,
    descr.encode('utf-8'), ],)
  output.writerow(row)
Scoring: Train/Test set split
cross_validation.train_test_split
(tickets_train, tickets_test, times_train,
times_test) =

  cross_validation.train_test_split(
       tickets, times,
       test_size=0.2,
       random_state=0)
Scoring: Mean squared error
sklearn.metrics.mean_squared_error

train_error = metrics.mean_squared_error(
    times_train, times_train_predict)

test_error = metrics.mean_squared_error(
    times_test, times_test_predict)
Fun #1: just ticket number?
for number, created, ... in tickets_file:
    row = []
    created = dt.datetime.strptime(created,
                    time_format)
    closetime = dt.datetime.strptime(closetime,
                    time_format)
    time_to_fix = closetime - created

    row.append(float(number))
    tickets.append(row)
    times.append(total_seconds(time_to_fix))
Fun #1: just ticket number?
import numpy as np
from sklearn import preprocessing

scaler = preprocessing.Scaler().fit(
    np.array(tickets))
tickets = scaler.transform(tickets)
Fun #1: just ticket number?
clf = SVR()
clf.fit(tickets_train, times_train)

times_train_predict = clf.predict(tickets_train)
times_test_predict = clf.predict(tickets_test)
Fun #1: just ticket number?
train_error = metrics.mean_squared_error
(times_train, times_train_predict)

test_error = metrics.mean_squared_error(times_test,
times_test_predict)

print 'Train error: %.1fn Test error: %.2f' % (
    math.sqrt(train_error)/(24*3600),
    math.sqrt(test_error)/(24*3600))
# .. in days
Fun #1: just ticket number?

Train error: 363.4
Test error: 361.41
Finding best parameters
SVM C controls regularization:

larger C leads to
● closer fit to the train data
● with the risk of overfitting
Finding best parameters
Cs = np.logspace(-1, 10, 10)
for c in Cs:
    learn(c)
Finding best parameters
0.1: Train error: 363.4 Test error: 361.41
1.71: Train error: 363.4 Test error: 361.41
27.8: Train error: 363.4 Test error: 361.39
464.2: Train error: 363.2 Test error: 361.17
7742.6: Train error: 362.5 Test error: 360.41
129155.0: Train error: 362.1 Test error: 360.00
2154434.7: Train error: 362.0 Test error: 359.82
35938136.6: Train error: 361.7 Test error: 359.60
599484250.3: Train error: 361.5 Test error: 359.36
10000000000.0: Train error: 361.1 Test error:
358.91
Finding best parameters
sklearn.grid_search.GridSearchCV
     bonus: it can run in parallel

clf = GridSearchCV(estimator=SVR(
     param_grid=dict(C=np.logspace(-1,10,10)),
     n_jobs=-1)
clf.fit(tickets_train, times_train)
Finding best parameters
sklearn.grid_search.GridSearchCV
     bonus: it can run in parallel

clf = GridSearchCV(estimator=SVR(
     param_grid=dict(C=np.logspace(-1,10,10)),
     n_jobs=-1)
clf.fit(tickets_train, times_train)

  Train error: 361.1 Test error: 358.91
  Best C: 1.0e+10
Fun #2: creation date?

 row = []
 row.append(float(number))

 row.append(float(time.mktime(
     created.timetuple())))

 tickets.append(row)
Fun #2: creation date?

Train error: 360.6 Test error: 358.39
Best C: 1.0e+10
String vectorizer and Tfidf transform

from sklearn.feature_extraction.text   
   import CountVectorizer, 
          TfidfTransformer
String vectorizer and Tfidf transform
reporters = []

for number, ... in tickets_file:
    [...]
    reporters.append(reporter)
String vectorizer and Tfidf transform
CountVectorizer().fit_transform(reporters) ->

  TfidfTransformer().fit_transform( … ) ->

     hstack((tickets, …)


note: TF-IDF matrix is sparse!
String vectorizer and Tfidf transform
import scipy.sparse as sp

tickets = sp.hstack((
  tickets,
  TfidfTransformer().fit_transform(
     CountVectorizer().fit_transform(reporters))))

# remember to re-scale!
scaler = preprocessing.Scaler(with_mean=False
    ).fit(tickets)
tickets = scaler.transform(tickets)
Fun #3: reporter

Train error: 338.7 Test error: 353.38
Best C: 1.8e+07
Fun #3: subject
 subjects = []

 for number, created, ... in tickets_file:
 [...]
        subjects.append(summary)
 [...]

 tickets = sp.hstack((tickets,
    TfidfTransformer().fit_transform(
      CountVectorizer(ngram_range=(1,3)
 ).fit_transform(subjects))))
Fun #3: subject

Train error: 21.0 Test error: XXXX
Best C: 1.0e+10
Fun #3: subject

Train error: 21.0 Test error: 331.79
Best C: 1.0e+10
Different SVM kernels
def learn(kernel='rbf', param_grid=None,
verbose=False):

[...]
   clf = GridSearchCV(
      estimator=SVR(kernel=kernel,
                     verbose=verbose),
      param_grid=param_grid,
      n_jobs=-1)
[...]
Different SVM kernels
RBF
Train error: 21.0 Test error: 331.79
Best C: 1.0e+10

Linear
Train error: 343.1 Test error: 355.56
Best C: 1.0e+02
Fun #5: account for the
Component
components = []

for number, .. component, ... in tickets_file:
    [...]
    components.append(component)
    [...]

tickets = sp.hstack((tickets, TfidfTransformer().
fit_transform(
CountVectorizer().fit_transform(components))))
Fun #5: account for the
Component
RBF
Train error: 18.9 Test error: 327.79
Best C: 1.0e+10

Linear:
Train error: 342.2 Test error: 354.89
Best C: 1.0e+02
Fun #6: ticket Description
descriptions = []

for number, ... description in tickets_file:
    [...]
    descriptions.append(description)
    [...]

tickets = sp.hstack((tickets, TfidfTransformer().
fit_transform(    CountVectorizer(ngram_range=
(1,3)).fit_transform(
                    descriptions))))
Fun #6: ticket Description
RBF
Train error: 10.8 Test error: 328.44
Best C: 1.0e+10

Linear
Train error: 14.0 Test error: 331.52
Best C: 3.2e+03
Conclusions
● All steps of a simple machine learning algo
Conclusions
● All steps of a simple machine learning algo

● scikit-learn
Conclusions
● All steps of a simple machine learning algo

● scikit-learn

● data, explicitly available in tickets is NOT
  ENOUGH to predict closing date
Developers,
what are you hiding?
         :)
Questions?
Source code and dataset available at

https://github.com/42/django-trac-
learning.git

Contacts:
● @akhavr
● http://42coffeecups.com/

Mais conteúdo relacionado

Mais procurados

COMPUTER SCIENCE CLASS 12 PRACTICAL FILE
COMPUTER SCIENCE CLASS 12 PRACTICAL FILECOMPUTER SCIENCE CLASS 12 PRACTICAL FILE
COMPUTER SCIENCE CLASS 12 PRACTICAL FILEAnushka Rai
 
Let us C (by yashvant Kanetkar) chapter 3 Solution
Let us C   (by yashvant Kanetkar) chapter 3 SolutionLet us C   (by yashvant Kanetkar) chapter 3 Solution
Let us C (by yashvant Kanetkar) chapter 3 SolutionHazrat Bilal
 
C tech questions
C tech questionsC tech questions
C tech questionsvijay00791
 
12th CBSE Practical File
12th CBSE Practical File12th CBSE Practical File
12th CBSE Practical FileAshwin Francis
 
LET US C (5th EDITION) CHAPTER 1 ANSWERS
LET US C (5th EDITION) CHAPTER 1 ANSWERSLET US C (5th EDITION) CHAPTER 1 ANSWERS
LET US C (5th EDITION) CHAPTER 1 ANSWERSKavyaSharma65
 
Cn os-lp lab manual k.roshan
Cn os-lp lab manual k.roshanCn os-lp lab manual k.roshan
Cn os-lp lab manual k.roshanriturajj
 
Computer Science Practical Science C++ with SQL commands
Computer Science Practical Science C++ with SQL commandsComputer Science Practical Science C++ with SQL commands
Computer Science Practical Science C++ with SQL commandsVishvjeet Yadav
 
Practical Class 12th (c++programs+sql queries and output)
Practical Class 12th (c++programs+sql queries and output) Practical Class 12th (c++programs+sql queries and output)
Practical Class 12th (c++programs+sql queries and output) Aman Deep
 
Data Structure - 2nd Study
Data Structure - 2nd StudyData Structure - 2nd Study
Data Structure - 2nd StudyChris Ohk
 
lFuzzer - Learning Input Tokens for Effective Fuzzing
lFuzzer - Learning Input Tokens for Effective FuzzinglFuzzer - Learning Input Tokens for Effective Fuzzing
lFuzzer - Learning Input Tokens for Effective FuzzingBjörn Mathis
 
Imugi: Compiler made with Python
Imugi: Compiler made with PythonImugi: Compiler made with Python
Imugi: Compiler made with PythonHan Lee
 
Cd practical file (1) start se
Cd practical file (1) start seCd practical file (1) start se
Cd practical file (1) start sedalipkumar64
 
C++ Programming - 11th Study
C++ Programming - 11th StudyC++ Programming - 11th Study
C++ Programming - 11th StudyChris Ohk
 
C++ Programming - 4th Study
C++ Programming - 4th StudyC++ Programming - 4th Study
C++ Programming - 4th StudyChris Ohk
 

Mais procurados (19)

COMPUTER SCIENCE CLASS 12 PRACTICAL FILE
COMPUTER SCIENCE CLASS 12 PRACTICAL FILECOMPUTER SCIENCE CLASS 12 PRACTICAL FILE
COMPUTER SCIENCE CLASS 12 PRACTICAL FILE
 
Let us C (by yashvant Kanetkar) chapter 3 Solution
Let us C   (by yashvant Kanetkar) chapter 3 SolutionLet us C   (by yashvant Kanetkar) chapter 3 Solution
Let us C (by yashvant Kanetkar) chapter 3 Solution
 
C tech questions
C tech questionsC tech questions
C tech questions
 
12th CBSE Practical File
12th CBSE Practical File12th CBSE Practical File
12th CBSE Practical File
 
LET US C (5th EDITION) CHAPTER 1 ANSWERS
LET US C (5th EDITION) CHAPTER 1 ANSWERSLET US C (5th EDITION) CHAPTER 1 ANSWERS
LET US C (5th EDITION) CHAPTER 1 ANSWERS
 
Cn os-lp lab manual k.roshan
Cn os-lp lab manual k.roshanCn os-lp lab manual k.roshan
Cn os-lp lab manual k.roshan
 
Computer Science Practical Science C++ with SQL commands
Computer Science Practical Science C++ with SQL commandsComputer Science Practical Science C++ with SQL commands
Computer Science Practical Science C++ with SQL commands
 
VTU Data Structures Lab Manual
VTU Data Structures Lab ManualVTU Data Structures Lab Manual
VTU Data Structures Lab Manual
 
Practical Class 12th (c++programs+sql queries and output)
Practical Class 12th (c++programs+sql queries and output) Practical Class 12th (c++programs+sql queries and output)
Practical Class 12th (c++programs+sql queries and output)
 
Data Structure - 2nd Study
Data Structure - 2nd StudyData Structure - 2nd Study
Data Structure - 2nd Study
 
Railwaynew
RailwaynewRailwaynew
Railwaynew
 
C++ TUTORIAL 8
C++ TUTORIAL 8C++ TUTORIAL 8
C++ TUTORIAL 8
 
lFuzzer - Learning Input Tokens for Effective Fuzzing
lFuzzer - Learning Input Tokens for Effective FuzzinglFuzzer - Learning Input Tokens for Effective Fuzzing
lFuzzer - Learning Input Tokens for Effective Fuzzing
 
Imugi: Compiler made with Python
Imugi: Compiler made with PythonImugi: Compiler made with Python
Imugi: Compiler made with Python
 
Cd practical file (1) start se
Cd practical file (1) start seCd practical file (1) start se
Cd practical file (1) start se
 
C++ Programming - 11th Study
C++ Programming - 11th StudyC++ Programming - 11th Study
C++ Programming - 11th Study
 
Vcs16
Vcs16Vcs16
Vcs16
 
C++ Programming - 4th Study
C++ Programming - 4th StudyC++ Programming - 4th Study
C++ Programming - 4th Study
 
C exam
C examC exam
C exam
 

Destaque

Applying Machine Learning to Software Clustering
Applying Machine Learning to Software ClusteringApplying Machine Learning to Software Clustering
Applying Machine Learning to Software Clusteringbutest
 
Machine Learning Software Design Pattern with PredictionIO
Machine Learning Software Design Pattern with PredictionIOMachine Learning Software Design Pattern with PredictionIO
Machine Learning Software Design Pattern with PredictionIOTuri, Inc.
 
PredictionIO - Scalable Machine Learning Architecture
PredictionIO - Scalable Machine Learning ArchitecturePredictionIO - Scalable Machine Learning Architecture
PredictionIO - Scalable Machine Learning Architecturepredictionio
 
PredictionIO – A Machine Learning Server in Scala – SF Scala
PredictionIO – A Machine Learning Server in Scala – SF ScalaPredictionIO – A Machine Learning Server in Scala – SF Scala
PredictionIO – A Machine Learning Server in Scala – SF Scalapredictionio
 
Machine Learning system architecture – Microsoft Translator, a Case Study : ...
Machine Learning system architecture – Microsoft Translator, a Case Study :  ...Machine Learning system architecture – Microsoft Translator, a Case Study :  ...
Machine Learning system architecture – Microsoft Translator, a Case Study : ...Vishal Chowdhary
 

Destaque (6)

Applying Machine Learning to Software Clustering
Applying Machine Learning to Software ClusteringApplying Machine Learning to Software Clustering
Applying Machine Learning to Software Clustering
 
Machine Learning Software Design Pattern with PredictionIO
Machine Learning Software Design Pattern with PredictionIOMachine Learning Software Design Pattern with PredictionIO
Machine Learning Software Design Pattern with PredictionIO
 
PredictionIO - Scalable Machine Learning Architecture
PredictionIO - Scalable Machine Learning ArchitecturePredictionIO - Scalable Machine Learning Architecture
PredictionIO - Scalable Machine Learning Architecture
 
PredictionIO – A Machine Learning Server in Scala – SF Scala
PredictionIO – A Machine Learning Server in Scala – SF ScalaPredictionIO – A Machine Learning Server in Scala – SF Scala
PredictionIO – A Machine Learning Server in Scala – SF Scala
 
Machine Learning system architecture – Microsoft Translator, a Case Study : ...
Machine Learning system architecture – Microsoft Translator, a Case Study :  ...Machine Learning system architecture – Microsoft Translator, a Case Study :  ...
Machine Learning system architecture – Microsoft Translator, a Case Study : ...
 
201203 Adaptive Empathetic Software
201203 Adaptive Empathetic Software201203 Adaptive Empathetic Software
201203 Adaptive Empathetic Software
 

Semelhante a Naive application of Machine Learning to Software Development

Using the following code Install Packages pip install .pdf
Using the following code Install Packages   pip install .pdfUsing the following code Install Packages   pip install .pdf
Using the following code Install Packages pip install .pdfpicscamshoppe
 
Assignment 5.2.pdf
Assignment 5.2.pdfAssignment 5.2.pdf
Assignment 5.2.pdfdash41
 
[DevDay2019] Python Machine Learning with Jupyter Notebook - By Nguyen Huu Th...
[DevDay2019] Python Machine Learning with Jupyter Notebook - By Nguyen Huu Th...[DevDay2019] Python Machine Learning with Jupyter Notebook - By Nguyen Huu Th...
[DevDay2019] Python Machine Learning with Jupyter Notebook - By Nguyen Huu Th...DevDay.org
 
Pydata DC 2018 (Skorch - A Union of Scikit-learn and PyTorch)
Pydata DC 2018 (Skorch - A Union of Scikit-learn and PyTorch)Pydata DC 2018 (Skorch - A Union of Scikit-learn and PyTorch)
Pydata DC 2018 (Skorch - A Union of Scikit-learn and PyTorch)Thomas Fan
 
Numerical tour in the Python eco-system: Python, NumPy, scikit-learn
Numerical tour in the Python eco-system: Python, NumPy, scikit-learnNumerical tour in the Python eco-system: Python, NumPy, scikit-learn
Numerical tour in the Python eco-system: Python, NumPy, scikit-learnArnaud Joly
 
Developing a ML model using TF Estimator
Developing a ML model using TF EstimatorDeveloping a ML model using TF Estimator
Developing a ML model using TF EstimatorKarthik Murugesan
 
Mp24: The Bachelor, a facebook game
Mp24: The Bachelor, a facebook gameMp24: The Bachelor, a facebook game
Mp24: The Bachelor, a facebook gameMontreal Python
 
Time Series Analysis Sample Code
Time Series Analysis Sample CodeTime Series Analysis Sample Code
Time Series Analysis Sample CodeAiden Wu, FRM
 
Stat Design3 18 09
Stat Design3 18 09Stat Design3 18 09
Stat Design3 18 09stat
 
TensorFlow Quantization Tour
TensorFlow Quantization TourTensorFlow Quantization Tour
TensorFlow Quantization TourKSuzukiii
 
Python Cheat Sheet 2.0.pdf
Python Cheat Sheet 2.0.pdfPython Cheat Sheet 2.0.pdf
Python Cheat Sheet 2.0.pdfRahul Jain
 
Efficient equity portfolios using mean variance optimisation in R
Efficient equity portfolios using mean variance optimisation in REfficient equity portfolios using mean variance optimisation in R
Efficient equity portfolios using mean variance optimisation in RGregg Barrett
 
Forecast stock prices python
Forecast stock prices pythonForecast stock prices python
Forecast stock prices pythonUtkarsh Asthana
 
Matúš Cimerman: Building AI data pipelines using PySpark, PyData Bratislava M...
Matúš Cimerman: Building AI data pipelines using PySpark, PyData Bratislava M...Matúš Cimerman: Building AI data pipelines using PySpark, PyData Bratislava M...
Matúš Cimerman: Building AI data pipelines using PySpark, PyData Bratislava M...GapData Institute
 
Planet-HTML5-Game-Engine Javascript Performance Enhancement
Planet-HTML5-Game-Engine Javascript Performance EnhancementPlanet-HTML5-Game-Engine Javascript Performance Enhancement
Planet-HTML5-Game-Engine Javascript Performance Enhancementup2soul
 
Time Series Analysis:Basic Stochastic Signal Recovery
Time Series Analysis:Basic Stochastic Signal RecoveryTime Series Analysis:Basic Stochastic Signal Recovery
Time Series Analysis:Basic Stochastic Signal RecoveryDaniel Cuneo
 

Semelhante a Naive application of Machine Learning to Software Development (20)

Using the following code Install Packages pip install .pdf
Using the following code Install Packages   pip install .pdfUsing the following code Install Packages   pip install .pdf
Using the following code Install Packages pip install .pdf
 
Assignment 5.2.pdf
Assignment 5.2.pdfAssignment 5.2.pdf
Assignment 5.2.pdf
 
Python Manuel-R2021.pdf
Python Manuel-R2021.pdfPython Manuel-R2021.pdf
Python Manuel-R2021.pdf
 
[DevDay2019] Python Machine Learning with Jupyter Notebook - By Nguyen Huu Th...
[DevDay2019] Python Machine Learning with Jupyter Notebook - By Nguyen Huu Th...[DevDay2019] Python Machine Learning with Jupyter Notebook - By Nguyen Huu Th...
[DevDay2019] Python Machine Learning with Jupyter Notebook - By Nguyen Huu Th...
 
Pydata DC 2018 (Skorch - A Union of Scikit-learn and PyTorch)
Pydata DC 2018 (Skorch - A Union of Scikit-learn and PyTorch)Pydata DC 2018 (Skorch - A Union of Scikit-learn and PyTorch)
Pydata DC 2018 (Skorch - A Union of Scikit-learn and PyTorch)
 
Numerical tour in the Python eco-system: Python, NumPy, scikit-learn
Numerical tour in the Python eco-system: Python, NumPy, scikit-learnNumerical tour in the Python eco-system: Python, NumPy, scikit-learn
Numerical tour in the Python eco-system: Python, NumPy, scikit-learn
 
Developing a ML model using TF Estimator
Developing a ML model using TF EstimatorDeveloping a ML model using TF Estimator
Developing a ML model using TF Estimator
 
Mp24: The Bachelor, a facebook game
Mp24: The Bachelor, a facebook gameMp24: The Bachelor, a facebook game
Mp24: The Bachelor, a facebook game
 
C lab-programs
C lab-programsC lab-programs
C lab-programs
 
Time Series Analysis Sample Code
Time Series Analysis Sample CodeTime Series Analysis Sample Code
Time Series Analysis Sample Code
 
Stat Design3 18 09
Stat Design3 18 09Stat Design3 18 09
Stat Design3 18 09
 
TensorFlow Quantization Tour
TensorFlow Quantization TourTensorFlow Quantization Tour
TensorFlow Quantization Tour
 
Python Cheat Sheet 2.0.pdf
Python Cheat Sheet 2.0.pdfPython Cheat Sheet 2.0.pdf
Python Cheat Sheet 2.0.pdf
 
Final project kijtorntham n
Final project kijtorntham nFinal project kijtorntham n
Final project kijtorntham n
 
Efficient equity portfolios using mean variance optimisation in R
Efficient equity portfolios using mean variance optimisation in REfficient equity portfolios using mean variance optimisation in R
Efficient equity portfolios using mean variance optimisation in R
 
Forecast stock prices python
Forecast stock prices pythonForecast stock prices python
Forecast stock prices python
 
Matúš Cimerman: Building AI data pipelines using PySpark, PyData Bratislava M...
Matúš Cimerman: Building AI data pipelines using PySpark, PyData Bratislava M...Matúš Cimerman: Building AI data pipelines using PySpark, PyData Bratislava M...
Matúš Cimerman: Building AI data pipelines using PySpark, PyData Bratislava M...
 
report
reportreport
report
 
Planet-HTML5-Game-Engine Javascript Performance Enhancement
Planet-HTML5-Game-Engine Javascript Performance EnhancementPlanet-HTML5-Game-Engine Javascript Performance Enhancement
Planet-HTML5-Game-Engine Javascript Performance Enhancement
 
Time Series Analysis:Basic Stochastic Signal Recovery
Time Series Analysis:Basic Stochastic Signal RecoveryTime Series Analysis:Basic Stochastic Signal Recovery
Time Series Analysis:Basic Stochastic Signal Recovery
 

Último

Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Zilliz
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Victor Rentea
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Bhuvaneswari Subramani
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...apidays
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityWSO2
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Zilliz
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelDeepika Singh
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusZilliz
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 

Último (20)

Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 

Naive application of Machine Learning to Software Development

  • 1. Naive application of Machine Learning to Software Development
  • 2. Naive application of Machine Learning to Software Development or... what developers don't tell :)
  • 3. What and why 42 Coffee Cups: completely distributed development team
  • 4. What and why 42 Coffee Cups: completely distributed development team Hard facts about how software is done
  • 5. What and why 42 Coffee Cups: completely distributed development team Hard facts about how software is done LOTS OF THEM
  • 8. What and why Facts ??? Profit
  • 9. What and why ??? Toy problem: get ticket and predict how long it will take to close it
  • 10. What and why ??? Toy problem: get ticket and predict how long it will take to close it Bonus: learn scikit- learn :)
  • 11. Install scikit-learn ● sudo apt-get install python- dev
  • 12. Install scikit-learn ● sudo apt-get install python- dev python-numpy python-numpy- dev
  • 13. Install scikit-learn ● sudo apt-get install python- dev python-numpy python-numpy- dev python-scipy
  • 14. Install scikit-learn ● sudo apt-get install python- dev python-numpy python-numpy- dev python-scipy python- setuptools libatlas-dev g++
  • 15. Install scikit-learn ● sudo apt-get install python- dev python-numpy python-numpy- dev python-scipy python- setuptools libatlas-dev g++ ● pip install -U scikit-learn
  • 16. Data: closed tickets import urllib2 url = 'https://code.djangoproject.com/query?format=csv' + '&col=id&col=time&col=changetime&col=reporter' + '&col=summary&col=status&col=owner&col=type' + '&col=component&order=priority' tickets = urllib2.urlopen(url).read() open('2012-10-09.csv','w').write(tickets)
  • 17. Data: closed tickets id,time,changetime,reporter,summary,status,owner, type,component 1,2005-07-13 12:03:27,2012-05-20 08:12:37,adrian, Create architecture for anonymous sessions,closed, jacob,enhancement,Core (Other) 2,2005-07-13 12:04:45,2007-07-03 16:04:18, anonymous,Calendar popup - next/previous month links close the popup in Safari,closed,jacob, defect,contrib.admin
  • 18. Data: closed date and description def get_data(ticket): url = 'https://code.djangoproject.com/ticket/%s' % ticket ticket_html = urllib2.urlopen(url) bs = BeautifulSoup(ticket_html)
  • 19. Data: closed date and description # get closing date d = bs.find_all('div','date')[0] p = list(d.children)[3] href = p.find('a')['href'] close_time_str = urlparse.parse_qs(href) ['/timeline?from'][0] close_time = datetime.datetime.strptime (close_time_str[:-6], '%Y-%m-%dT%H:%M:%S') # ... more black magic, see code
  • 20. Data: closed date and description def get_data(ticket): [...] # get description and return de = bs.find_all('div', 'description')[0] return close_time, de.text
  • 21. Data: closed date and description tickets_file = csv.reader(open('2012-10-09.csv')) output = csv.writer(open('2012-10-09.close.csv','w')) for id, time, changetime, reporter, summary, status, owner, type, component in tickets_file: closetime, descr = get_data(id) row = [id, time, changetime, closetime, reporter, summary, status, owner, type, component, descr.encode('utf-8'), ],) output.writerow(row)
  • 22. Scoring: Train/Test set split cross_validation.train_test_split (tickets_train, tickets_test, times_train, times_test) = cross_validation.train_test_split( tickets, times, test_size=0.2, random_state=0)
  • 23. Scoring: Mean squared error sklearn.metrics.mean_squared_error train_error = metrics.mean_squared_error( times_train, times_train_predict) test_error = metrics.mean_squared_error( times_test, times_test_predict)
  • 24. Fun #1: just ticket number? for number, created, ... in tickets_file: row = [] created = dt.datetime.strptime(created, time_format) closetime = dt.datetime.strptime(closetime, time_format) time_to_fix = closetime - created row.append(float(number)) tickets.append(row) times.append(total_seconds(time_to_fix))
  • 25. Fun #1: just ticket number? import numpy as np from sklearn import preprocessing scaler = preprocessing.Scaler().fit( np.array(tickets)) tickets = scaler.transform(tickets)
  • 26. Fun #1: just ticket number? clf = SVR() clf.fit(tickets_train, times_train) times_train_predict = clf.predict(tickets_train) times_test_predict = clf.predict(tickets_test)
  • 27. Fun #1: just ticket number? train_error = metrics.mean_squared_error (times_train, times_train_predict) test_error = metrics.mean_squared_error(times_test, times_test_predict) print 'Train error: %.1fn Test error: %.2f' % ( math.sqrt(train_error)/(24*3600), math.sqrt(test_error)/(24*3600)) # .. in days
  • 28. Fun #1: just ticket number? Train error: 363.4 Test error: 361.41
  • 29. Finding best parameters SVM C controls regularization: larger C leads to ● closer fit to the train data ● with the risk of overfitting
  • 30. Finding best parameters Cs = np.logspace(-1, 10, 10) for c in Cs: learn(c)
  • 31. Finding best parameters 0.1: Train error: 363.4 Test error: 361.41 1.71: Train error: 363.4 Test error: 361.41 27.8: Train error: 363.4 Test error: 361.39 464.2: Train error: 363.2 Test error: 361.17 7742.6: Train error: 362.5 Test error: 360.41 129155.0: Train error: 362.1 Test error: 360.00 2154434.7: Train error: 362.0 Test error: 359.82 35938136.6: Train error: 361.7 Test error: 359.60 599484250.3: Train error: 361.5 Test error: 359.36 10000000000.0: Train error: 361.1 Test error: 358.91
  • 32. Finding best parameters sklearn.grid_search.GridSearchCV bonus: it can run in parallel clf = GridSearchCV(estimator=SVR( param_grid=dict(C=np.logspace(-1,10,10)), n_jobs=-1) clf.fit(tickets_train, times_train)
  • 33. Finding best parameters sklearn.grid_search.GridSearchCV bonus: it can run in parallel clf = GridSearchCV(estimator=SVR( param_grid=dict(C=np.logspace(-1,10,10)), n_jobs=-1) clf.fit(tickets_train, times_train) Train error: 361.1 Test error: 358.91 Best C: 1.0e+10
  • 34. Fun #2: creation date? row = [] row.append(float(number)) row.append(float(time.mktime( created.timetuple()))) tickets.append(row)
  • 35. Fun #2: creation date? Train error: 360.6 Test error: 358.39 Best C: 1.0e+10
  • 36. String vectorizer and Tfidf transform from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
  • 37. String vectorizer and Tfidf transform reporters = [] for number, ... in tickets_file: [...] reporters.append(reporter)
  • 38. String vectorizer and Tfidf transform CountVectorizer().fit_transform(reporters) -> TfidfTransformer().fit_transform( … ) -> hstack((tickets, …) note: TF-IDF matrix is sparse!
  • 39. String vectorizer and Tfidf transform import scipy.sparse as sp tickets = sp.hstack(( tickets, TfidfTransformer().fit_transform( CountVectorizer().fit_transform(reporters)))) # remember to re-scale! scaler = preprocessing.Scaler(with_mean=False ).fit(tickets) tickets = scaler.transform(tickets)
  • 40. Fun #3: reporter Train error: 338.7 Test error: 353.38 Best C: 1.8e+07
  • 41. Fun #3: subject subjects = [] for number, created, ... in tickets_file: [...] subjects.append(summary) [...] tickets = sp.hstack((tickets, TfidfTransformer().fit_transform( CountVectorizer(ngram_range=(1,3) ).fit_transform(subjects))))
  • 42. Fun #3: subject Train error: 21.0 Test error: XXXX Best C: 1.0e+10
  • 43. Fun #3: subject Train error: 21.0 Test error: 331.79 Best C: 1.0e+10
  • 44. Different SVM kernels def learn(kernel='rbf', param_grid=None, verbose=False): [...] clf = GridSearchCV( estimator=SVR(kernel=kernel, verbose=verbose), param_grid=param_grid, n_jobs=-1) [...]
  • 45. Different SVM kernels RBF Train error: 21.0 Test error: 331.79 Best C: 1.0e+10 Linear Train error: 343.1 Test error: 355.56 Best C: 1.0e+02
  • 46. Fun #5: account for the Component components = [] for number, .. component, ... in tickets_file: [...] components.append(component) [...] tickets = sp.hstack((tickets, TfidfTransformer(). fit_transform( CountVectorizer().fit_transform(components))))
  • 47. Fun #5: account for the Component RBF Train error: 18.9 Test error: 327.79 Best C: 1.0e+10 Linear: Train error: 342.2 Test error: 354.89 Best C: 1.0e+02
  • 48. Fun #6: ticket Description descriptions = [] for number, ... description in tickets_file: [...] descriptions.append(description) [...] tickets = sp.hstack((tickets, TfidfTransformer(). fit_transform( CountVectorizer(ngram_range= (1,3)).fit_transform( descriptions))))
  • 49. Fun #6: ticket Description RBF Train error: 10.8 Test error: 328.44 Best C: 1.0e+10 Linear Train error: 14.0 Test error: 331.52 Best C: 3.2e+03
  • 50. Conclusions ● All steps of a simple machine learning algo
  • 51. Conclusions ● All steps of a simple machine learning algo ● scikit-learn
  • 52. Conclusions ● All steps of a simple machine learning algo ● scikit-learn ● data, explicitly available in tickets is NOT ENOUGH to predict closing date
  • 54. Questions? Source code and dataset available at https://github.com/42/django-trac- learning.git Contacts: ● @akhavr ● http://42coffeecups.com/