Artificial Neural Networks Workshop

ARTIFICIAL NEURAL NETWORKS PROBLEM
YAKUP GÖRÜR
DATE NAME
13 DECEMBER 2016

• Talking Data
• Introduction
• Datasets
• Sample submission
• Some analysis
• Basic solution
• To Do
• Predicting A Biological Response
• Introduction
• Datasets
• Solution
• Model
• Code
• To Do
TOPICS

PROJECT
TALKINGDATA MOBILE USER DEMOGRAPHICS
A KAGGLE COMPETITION
YAKUP GÖRÜRDATE NAME
13 DECEMBER 2016

INTRODUCTION
• TalkingData, China’s largest third-party mobile data platform,
• TalkingData is seeking to leverage behavioral data from more than
70% of the 500 million mobile devices active daily in China to help
its clients better understand and interact with their audiences
• In this competition, challenged to build a model predicting users’
demographic characteristics based on
• Their app usage
• Geolocations
• Mobile device properties

• The data was obtained from the kaggle.com as a .csv file.
• Test Data:
• gender_age_test.csv
• Training Datas:
• gender_age_train.csv
• events.csv
• phone_brand_device.csv
• app_events.csv
• app_labels.csv
• label_categories.csv

Gender_age_train.csv Gender_age_test.csv
•The two files are our
training and test data.
• Our target variable is group
that we are going to predict
74645 Train Data(%40)
112071 Test Data (%60)

Events.csv App_events.csv
When a user uses
TalkingData SDK,
the event gets
logged in the events
data.
The event
corresponds to a
list of apps in
app_events.

App_Labels.csv Label_Categories.csv
apps
and
their labels

Phone_Brand_Device_Model.Csv English_Phone_Brand_Device_Model.Csv

INVESTIGATING TIME AND DAY AND GENDER

USER PORTRAITS
10 largest
positive (red)
negative (blue)
coefficients

def read_train_test():
# App events
print('Read app events...')
ape = pd.read_csv("/Users/yakup/Downloads/TalkingData/app_events.csv")
ape['installed'] = ape.groupby(['event_id'])['is_installed'].transform('sum')
ape['active'] = ape.groupby(['event_id'])['is_active'].transform('sum')
ape.drop(['is_installed', 'is_active'], axis=1, inplace=True)
ape.drop_duplicates('event_id', keep='first', inplace=True)
ape.drop(['app_id'], axis=1, inplace=True)
# Events
print('Read events...')
events = pd.read_csv("/Users/yakup/Downloads/TalkingData/events.csv", dtype={'device_id': np.str})
events['counts'] = events.groupby(['device_id'])['event_id'].transform('count')
# The idea here is to count the number of installed apps using the data
# from app_events.csv above. Also to count the number of active apps.
events = pd.merge(events, ape, how='left', on='event_id', left_index=True)
# Below is the original events_small table
# events_small = events[['device_id', 'counts']].drop_duplicates('device_id', keep='first')
# And this is the new events_small table with two extra features
events_small = events[['device_id', 'counts', 'installed', 'active']].drop_duplicates('device_id', keep='first')
# Phone brand
print('Read brands...')
pbd = pd.read_csv("/Users/yakup/Downloads/TalkingData/phone_brand_device_model.csv", dtype={'device_id': np.str})
pbd.drop_duplicates('device_id', keep='first', inplace=True)
pbd = map_column(pbd, 'phone_brand')
pbd = map_column(pbd, 'device_model')
# Train
print('Read train...')
train = pd.read_csv("/Users/yakup/Downloads/TalkingData/gender_age_train.csv", dtype={'device_id': np.str})
train = map_column(train, 'group')
train = train.drop(['age'], axis=1)
train = train.drop(['gender'], axis=1)
train = pd.merge(train, pbd, how='left', on='device_id', left_index=True)
train = pd.merge(train, events_small, how='left', on='device_id', left_index=True)
train.fillna(-1, inplace=True)
# Test
print('Read test...')
test = pd.read_csv("/Users/yakup/Downloads/TalkingData/gender_age_test.csv", dtype={'device_id': np.str})
test = pd.merge(test, pbd, how='left', on='device_id', left_index=True)
test = pd.merge(test, events_small, how='left', on='device_id', left_index=True)
test.fillna(-1, inplace=True)
# Features
features = list(test.columns.values)
features.remove('device_id')
return train, test, features Thanks to @ZFTurbo
XGBOOST SUBMISSION SAMPLE
just using users’ telephone model and their application and labels

XGBOOST SUBMISSION SAMPLE
just using users’ telephone model and their application and labels

TO DO
• Use also latitude/longitude
• Use also Female/Male events hours
• Re-train model and re-test

PROJECT
PREDICTING A BIOLOGICAL RESPONSE
A KAGGLE COMPETITION
YAKUP GÖRÜRDATE NAME
13 DECEMBER 2016

INTRODUCTION
• The development of a new drug largely depends on trial and
error.
• It typically involves synthesizing thousands of compounds that
finally becomes a drug.
• As a result, this process is extremely expensive and slow.
• Therefore, the ability to accurately predict the biological activity
of molecules, and understand the rationale behind those
predictions are of great value.

COMPETITION AND DATA
• The objective of the competition is to help us build as good a
model as possible so that we can, as optimally as this data allows,
relate molecular information, to an actual biological response.
• Purpose: Predict a biological response of molecules from their
chemical properties
• The competition was from the Kaggle.com’s competition:
“Predicting a Biological Response” held between March16, 2012
and June 15, 2012 and re-enabled with new data 2013.

• The data was obtained from the kaggle.com as a .csv file.
• train.csv
• test.csv
• svm_bencmark.csv

• The first column contains
experimental data describing
an actual biological response
(Active/Inactive).
• The remaining columns
represent molecular
descriptors (D1 through
D1776) e. g. Size, shape, etc.
TRAIN DATA

TEST DATA DESIRED_OUTPUT

TRAINING ERROR IN EVERY ITERATION
Alphas/
Iterati
0.00001 0.0001 0,001 0.01 1 10 100
999 0.4970816592 0.496532726679 0.488427223939 0.13242254511 0.45774460 0.45774460 0.45774460
1999 0.49675409750 0.496383899255 0.253194405576 0.0957865492231 0.45774460 0.45774460 0.45774460
3999 0.49667002752 0.496179916176 0.106717625619 0.0611895628345 0.45774460 0.45774460 0.45774460
4999 0.49664307167 0.496062878454 0.0820233558756 0.0511976563565 0.45774460 0.45774460 0.45774460
5999 0.49661803106 0.495889077621 0.070172349899 0.0626155080386 0.45774460 0.45774460 0.45774460
6999 0.49659461636 0.495579874905 0.062604915027 0.050769160817 0.45774460 0.45774460 0.45774460
11999 0.49649684869 0.435855005691 0.0438280987103 0.0350799920535 0.45774460 0.45774460 0.45774460
12999 0.49648036118 0.397982602659 0.04226643999 0.0347661317427 0.45774460 0.45774460 0.45774460
18999 0.49639600726 0.268281553859 0.0378180439635 0.0347022410019 0.45774460 0.45774460 0.45774460
19999 0.49638381501 0.252608236007 0.0373610189016 0.0358672248921 0.45774460 0.45774460 0.45774460
LEARNING

TESTING
alpha=0.00001iterate:20000 error: 0.247995007025 standart error= 0,147995070209
alpha=0.0001 iterate:20000 error: 0.155827287349 standart error= 0,085592708521
alpha=0.001 iterate:20000 error: 0.302561724981 standart error= 0.366689370858
alpha=0.01 iterate:20000 error: 0.299910036583 standart error= 0.367160158942
alpha=1 iterate:20000 error: 0.453635315074 standart error= 0.530927445033
MY OUTPUT’S ERROR

0
0,125
0,25
0,375
0,5
0.00001 0.0001 0.001 0.01 1 10 100
MY OUTPUT’S ERROR

output
Hidden Layer 1 Hidden Layer 2 Hidden Layer 3 OutputInput
MODEL

# convert output of sigmoid function to its derivative
def sigmoid_output_to_derivative(output):
return output * (1 - output)
X = traindatai.values
y = traindatao
f = open('Error.txt', 'w')
for alpha in alphas:
print "nTraining With Alpha:" + str(alpha)
np.random.seed(1)
#randomly initialize our weights with mean 0
synapse_0 = 2 * np.random.random((1776, 11)) - 1
for j in xrange(20000):
# Feed forward through layers 0,1,2,3,4
layer_0=X
layer_1=sigmoid(np.dot(layer_0,synapse_0))
#how much did we miss the target value?
layer_4_error= layer_4 - y
if(j%1000) == 999:
print "Error After:" + str(j)+ " iterations:" + str(np.mean(np.abs(layer_4_error)))
# in what direction is the target value?
# were we really sure? if so, don't change too much
layer_4_delta = layer_4_error * sigmoid_output_to_derivative(layer_4)
#layer_2_delta=layer_2_error*sigmoid_output_to_derivative(layer_2)
#how much did each l3 value contribute to the l4 error (according to the weights)?
layer_3_error = layer_4_delta.dot(synapse_3.T)
#layer_1_error=layer_2_delta.dot(synapse_1.T)
#in what direction is the target l3?
#layer_1_delta=layer_1_error*sigmoid_output_to_derivative(layer_1)
# how much did each l2 value contribute to the l3 error (according to the weights)?
# in what direction is the target l2?
# how much did each l1 value contribute to the l2 error (according to the weights)?
# in what direction is the target l1?
synapse_3 -= alpha * (layer_3.T.dot(layer_4_delta))
test_data = pd.read_csv('/Users/yakup/Downloads/Predicting a Biological Response/test.csv') # Open file
x = test_data.values
layer_0 = x
https://github.com/ykpgrr/Artificial_Neural_Network

TO DO
• Use gradient adaptive methods
• Clean dataset
• Try different ANN model

Artificial Neural Networks Workshop

Recomendados

Recomendados

Mais conteúdo relacionado

Semelhante a Artificial Neural Networks Workshop

Semelhante a Artificial Neural Networks Workshop (20)

Último

Último (20)

Artificial Neural Networks Workshop