Naive application of Machine Learning to Software Development: get tickets from Django trac ticket tracking system and try to predict how long it will take to close the ticket.
Facts that developers aren't putting RIGHT information into their tracking systems :)
18. Data: closed date and description
def get_data(ticket):
url = 'https://code.djangoproject.com/ticket/%s'
% ticket
ticket_html = urllib2.urlopen(url)
bs = BeautifulSoup(ticket_html)
19. Data: closed date and description
# get closing date
d = bs.find_all('div','date')[0]
p = list(d.children)[3]
href = p.find('a')['href']
close_time_str = urlparse.parse_qs(href)
['/timeline?from'][0]
close_time = datetime.datetime.strptime
(close_time_str[:-6],
'%Y-%m-%dT%H:%M:%S')
# ... more black magic, see code
20. Data: closed date and description
def get_data(ticket):
[...]
# get description and return
de = bs.find_all('div', 'description')[0]
return close_time, de.text
21. Data: closed date and description
tickets_file = csv.reader(open('2012-10-09.csv'))
output =
csv.writer(open('2012-10-09.close.csv','w'))
for id, time, changetime, reporter, summary,
status, owner, type, component in tickets_file:
closetime, descr = get_data(id)
row = [id, time, changetime, closetime, reporter,
summary, status, owner, type, component,
descr.encode('utf-8'), ],)
output.writerow(row)
33. Finding best parameters
sklearn.grid_search.GridSearchCV
bonus: it can run in parallel
clf = GridSearchCV(estimator=SVR(
param_grid=dict(C=np.logspace(-1,10,10)),
n_jobs=-1)
clf.fit(tickets_train, times_train)
Train error: 361.1 Test error: 358.91
Best C: 1.0e+10
45. Different SVM kernels
RBF
Train error: 21.0 Test error: 331.79
Best C: 1.0e+10
Linear
Train error: 343.1 Test error: 355.56
Best C: 1.0e+02
46. Fun #5: account for the
Component
components = []
for number, .. component, ... in tickets_file:
[...]
components.append(component)
[...]
tickets = sp.hstack((tickets, TfidfTransformer().
fit_transform(
CountVectorizer().fit_transform(components))))
47. Fun #5: account for the
Component
RBF
Train error: 18.9 Test error: 327.79
Best C: 1.0e+10
Linear:
Train error: 342.2 Test error: 354.89
Best C: 1.0e+02
48. Fun #6: ticket Description
descriptions = []
for number, ... description in tickets_file:
[...]
descriptions.append(description)
[...]
tickets = sp.hstack((tickets, TfidfTransformer().
fit_transform( CountVectorizer(ngram_range=
(1,3)).fit_transform(
descriptions))))
49. Fun #6: ticket Description
RBF
Train error: 10.8 Test error: 328.44
Best C: 1.0e+10
Linear
Train error: 14.0 Test error: 331.52
Best C: 3.2e+03
52. Conclusions
● All steps of a simple machine learning algo
● scikit-learn
● data, explicitly available in tickets is NOT
ENOUGH to predict closing date