Naive application of Machine Learning to Software Development

Naive application of
Machine Learning to
Software Development

Naive application of
Machine Learning to
Software Development
or... what developers don't
tell :)

What and why
42 Coffee Cups:
completely distributed development
team

What and why
42 Coffee Cups:
team

Hard facts about how software is done

What and why
42 Coffee Cups:
team

Hard facts about how software is done

LOTS OF THEM

What and why

Facts Profit

What and why

Facts ??? Profit

What and why
???
Toy problem:
get ticket and predict how
long it will take to close it

What and why
???
Toy problem:
get ticket and predict how
long it will take to close it
Bonus: learn scikit-
learn :)

Install scikit-learn
● sudo apt-get install python-
dev

dev python-numpy python-numpy-
dev

dev
python-scipy

dev python-scipy python-
setuptools
libatlas-dev g++

dev python-scipy python-
setuptools
libatlas-dev g++
● pip install -U scikit-learn

Data: closed tickets
import urllib2

url =
'https://code.djangoproject.com/query?format=csv'
+
'&col=id&col=time&col=changetime&col=reporter' +
'&col=summary&col=status&col=owner&col=type' +
'&col=component&order=priority'

tickets = urllib2.urlopen(url).read()

open('2012-10-09.csv','w').write(tickets)

Data: closed tickets
id,time,changetime,reporter,summary,status,owner,
type,component
1,2005-07-13 12:03:27,2012-05-20 08:12:37,adrian,
Create architecture for anonymous sessions,closed,
jacob,enhancement,Core (Other)
2,2005-07-13 12:04:45,2007-07-03 16:04:18,
anonymous,Calendar popup - next/previous month
links close the popup in Safari,closed,jacob,
defect,contrib.admin

Data: closed date and description
def get_data(ticket):
url = 'https://code.djangoproject.com/ticket/%s'
% ticket
ticket_html = urllib2.urlopen(url)
bs = BeautifulSoup(ticket_html)

# get closing date
d = bs.find_all('div','date')[0]
p = list(d.children)[3]
href = p.find('a')['href']
close_time_str = urlparse.parse_qs(href)
['/timeline?from'][0]
close_time = datetime.datetime.strptime
(close_time_str[:-6],
'%Y-%m-%dT%H:%M:%S')
# ... more black magic, see code

def get_data(ticket):
[...]

# get description and return
de = bs.find_all('div', 'description')[0]
return close_time, de.text

tickets_file = csv.reader(open('2012-10-09.csv'))
output =
csv.writer(open('2012-10-09.close.csv','w'))

for id, time, changetime, reporter, summary,
status, owner, type, component in tickets_file:
closetime, descr = get_data(id)
row = [id, time, changetime, closetime, reporter,
summary, status, owner, type, component,
descr.encode('utf-8'), ],)
output.writerow(row)

Scoring: Train/Test set split
cross_validation.train_test_split
(tickets_train, tickets_test, times_train,
times_test) =

cross_validation.train_test_split(
tickets, times,
test_size=0.2,
random_state=0)

Scoring: Mean squared error
sklearn.metrics.mean_squared_error

train_error = metrics.mean_squared_error(
times_train, times_train_predict)

test_error = metrics.mean_squared_error(
times_test, times_test_predict)

Fun #1: just ticket number?
for number, created, ... in tickets_file:
row = []
created = dt.datetime.strptime(created,
time_format)
closetime = dt.datetime.strptime(closetime,
time_format)
time_to_fix = closetime - created

row.append(float(number))
tickets.append(row)
times.append(total_seconds(time_to_fix))

import numpy as np
from sklearn import preprocessing

scaler = preprocessing.Scaler().fit(
np.array(tickets))
tickets = scaler.transform(tickets)

clf = SVR()
clf.fit(tickets_train, times_train)

times_train_predict = clf.predict(tickets_train)
times_test_predict = clf.predict(tickets_test)

train_error = metrics.mean_squared_error
(times_train, times_train_predict)

test_error = metrics.mean_squared_error(times_test,
times_test_predict)

print 'Train error: %.1fn Test error: %.2f' % (
math.sqrt(train_error)/(24*3600),
math.sqrt(test_error)/(24*3600))
# .. in days


Train error: 363.4
Test error: 361.41

Finding best parameters
SVM C controls regularization:

larger C leads to
● closer fit to the train data
● with the risk of overfitting

Cs = np.logspace(-1, 10, 10)
for c in Cs:
learn(c)

0.1: Train error: 363.4 Test error: 361.41
10000000000.0: Train error: 361.1 Test error:
358.91

sklearn.grid_search.GridSearchCV
bonus: it can run in parallel

clf = GridSearchCV(estimator=SVR(
param_grid=dict(C=np.logspace(-1,10,10)),
n_jobs=-1)

sklearn.grid_search.GridSearchCV
bonus: it can run in parallel

clf = GridSearchCV(estimator=SVR(
param_grid=dict(C=np.logspace(-1,10,10)),
n_jobs=-1)

Train error: 361.1 Test error: 358.91
Best C: 1.0e+10

Fun #2: creation date?

row = []
row.append(float(number))

row.append(float(time.mktime(
created.timetuple())))

tickets.append(row)

Fun #2: creation date?

Best C: 1.0e+10

String vectorizer and Tfidf transform

from sklearn.feature_extraction.text
import CountVectorizer,
TfidfTransformer

reporters = []

for number, ... in tickets_file:
[...]
reporters.append(reporter)

CountVectorizer().fit_transform(reporters) ->

TfidfTransformer().fit_transform( … ) ->

hstack((tickets, …)

note: TF-IDF matrix is sparse!

import scipy.sparse as sp

tickets = sp.hstack((
tickets,
TfidfTransformer().fit_transform(
CountVectorizer().fit_transform(reporters))))

# remember to re-scale!
scaler = preprocessing.Scaler(with_mean=False
).fit(tickets)
tickets = scaler.transform(tickets)

Fun #3: reporter

Best C: 1.8e+07

Fun #3: subject
subjects = []

for number, created, ... in tickets_file:
[...]
subjects.append(summary)
[...]

tickets = sp.hstack((tickets,
TfidfTransformer().fit_transform(
CountVectorizer(ngram_range=(1,3)
).fit_transform(subjects))))

Fun #3: subject

Train error: 21.0 Test error: XXXX
Best C: 1.0e+10

Fun #3: subject

Best C: 1.0e+10

Different SVM kernels
def learn(kernel='rbf', param_grid=None,
verbose=False):

[...]
clf = GridSearchCV(
estimator=SVR(kernel=kernel,
verbose=verbose),
param_grid=param_grid,
n_jobs=-1)
[...]

Different SVM kernels
RBF
Best C: 1.0e+10

Linear
Best C: 1.0e+02

Fun #5: account for the
Component
components = []

for number, .. component, ... in tickets_file:
[...]
components.append(component)
[...]

tickets = sp.hstack((tickets, TfidfTransformer().
fit_transform(
CountVectorizer().fit_transform(components))))

Fun #5: account for the
Component
RBF
Best C: 1.0e+10

Linear:
Best C: 1.0e+02

Fun #6: ticket Description
descriptions = []

for number, ... description in tickets_file:
[...]
descriptions.append(description)
[...]

tickets = sp.hstack((tickets, TfidfTransformer().
fit_transform( CountVectorizer(ngram_range=
(1,3)).fit_transform(
descriptions))))

Fun #6: ticket Description
RBF
Best C: 1.0e+10

Linear
Best C: 3.2e+03

Conclusions
● All steps of a simple machine learning algo

Conclusions

● scikit-learn

Conclusions

● scikit-learn

● data, explicitly available in tickets is NOT
ENOUGH to predict closing date

Developers,
what are you hiding?
:)

Questions?
Source code and dataset available at

https://github.com/42/django-trac-
learning.git

Contacts:
● @akhavr
● http://42coffeecups.com/

Naive application of Machine Learning to Software Development

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (19)

Destaque

Destaque (6)

Semelhante a Naive application of Machine Learning to Software Development

Semelhante a Naive application of Machine Learning to Software Development (20)

Último

Último (20)

Naive application of Machine Learning to Software Development