SlideShare uma empresa Scribd logo
1 de 78
Baixar para ler offline
Case Studies
Yi-Shin Chen
Institute of Information Systems and Applications
Department of Computer Science
National Tsing Hua University
yishin@gmail.com
Many slides provided by Tan, Steinbach, Kumar for book “Introduction to Data Mining” are adapted in this presentation
Case: Mining Reddit Data
不指定特定目標的Cases
2
Reddit Data
https://drive.google.com/open?id=0BwpI8947eCyuRFVDLU4tT2
5JbFE
3
Reddit: The Front Page of the
Internet
50k+ on
this set
Subreddit Categories
▷Reddit’s structure may already provide a
baseline similarity
Provided Data
Recover Structure
Data Exploration
8
9
Data Mining Final Project
Role-Playing Games Sales Prediction
Group 6
Dataset
• Use Reddit comment dataset
• Data Selection
– Choose 30 Role-playing games from the subreddits.
– Choose useful attributes form their comment
• Body
• Score
• Subreddit
11
12
Preprocessing
• Common Game Features
– Drop trivial words
• We manually build the trivial words dictionary by
ourselves.
– Ex : “is” “the” “I” “you” “we” “a”.
– TF-IDF
• We use Tf-IDF to calculate the importance of each
words.
13
Preprocessing
– TF-IDF result
• But the TF-IDF value are too small to find the significant
importance.
– Define Comment Feature
• We manually define the common gaming words in five
categories by referring several game website.
Game-Play | Criticize | Graphic | Music | Story
14
Preprocessing
• Filtering useful comments
– Common keywords of 5 categories:
– Filtering useful comments
• Using the frequent keywords to find out the useful
comments in each category.
15
Preprocessing
– Using FP-Growth to find other feature
• To Find other frequent keywords
{time, play, people, gt, limited}
{time, up, good, now, damage, gear, way, need,
build, better, d3, something, right, being, gt, limited}
• Then, manually choose some frequent keywords
16
Preprocessing
• Comment Emotion
– Filtering outliers
• First filter out those comment whose “score” separate
in top and bottom 2.5% which indicate that they might
be outliers.
17
Preprocessing
– Emotion detection
• We use LIWC dictionary to calculate each comment’s positive
and negative emotion percentage.
Ex: I like this game. -> positive
EX: I hate this character. -> negative
• Then find each category’s positive and negative emotion
score.
– 𝑆𝑐𝑜𝑟𝑒 𝑒𝑎𝑐ℎ_𝑐𝑎𝑡𝑒𝑔𝑜𝑟𝑦_𝑃= 𝑗=0
𝑛
𝑠𝑐𝑜𝑟𝑒_𝑒𝑎𝑐ℎ_𝑐𝑜𝑚𝑚𝑒𝑛𝑡
𝑗
* (positive words)
– 𝑆𝑐𝑜𝑟𝑒 𝑒𝑎𝑐ℎ_𝑐𝑎𝑡𝑒𝑔𝑜𝑟𝑦_𝑁= 𝑗=0
𝑛
𝑠𝑐𝑜𝑟𝑒_𝑒𝑎𝑐ℎ_𝑐𝑜𝑚𝑚𝑒𝑛𝑡
𝑗
* (negative words)
– 𝑇𝑜𝑡𝑎𝑙𝑆𝑐𝑜𝑟𝑒 𝑒𝑎𝑐ℎ_𝑐𝑎𝑡𝑒𝑔𝑜𝑟𝑦= 𝑖=0
𝑚
𝑠𝑐𝑜𝑟𝑒_𝑒𝑎𝑐ℎ_𝑐𝑜𝑚𝑚𝑒𝑛𝑡 𝑚
– 𝐹𝑖𝑛𝑎𝑙𝑆𝑐𝑜𝑟𝑒 𝑒𝑎𝑐ℎ_𝑐𝑎𝑡𝑒𝑔𝑜𝑟𝑦_𝑃
= 𝑆𝑐𝑜𝑟𝑒 𝑒𝑎𝑐ℎ_𝑐𝑎𝑡𝑒𝑔𝑜𝑟𝑦_𝑃/
𝑇𝑜𝑡𝑎𝑙𝑆𝑐𝑜𝑟𝑒 𝑒𝑎𝑐ℎ_𝑐𝑎𝑡𝑒𝑔𝑜𝑟𝑦
– 𝐹𝑖𝑛𝑎𝑙𝑆𝑐𝑜𝑟𝑒 𝑒𝑎𝑐ℎ_𝑐𝑎𝑡𝑒𝑔𝑜𝑟𝑦_𝑁
= 𝑆𝑐𝑜𝑟𝑒 𝑒𝑎𝑐ℎ_𝑐𝑎𝑡𝑒𝑔𝑜𝑟𝑦_𝑁/
𝑇𝑜𝑡𝑎𝑙𝑆𝑐𝑜𝑟𝑒 𝑒𝑎𝑐ℎ_𝑐𝑎𝑡𝑒𝑔𝑜𝑟𝑦 18
Preprocessing
– Emotion detection
• We use LIWC dictionary to calculate each comment’s positive
and negative emotion percentage.
Ex: I like this game. -> positive
EX: I hate this character. -> negative
• Then find each category’s positive and negative emotion
score.
19
The meaning is :
Finding each category’s emotion score for each game
Calculating TotalScore of each category’s comments
Having the average emotion score FinalScore of each category
Preprocessing
• Sales Data Extraction
– Crawling website’s data
– Find the games’ sales on each platform
20
Preprocessing
– Annually sales for each game
– Median of each platform
– Mean of all platform
Median : 0.17 -> 30 games (26H 4L)
Mean : 0.533 -> 30 games (18H 12L)
– Try both of Median/ Mean to do the prediction.
21
Sales Prediction
• Model Construction
– We use the 4 outputs from pre-processing step to
be the input of the Naïve Bayes classifier
22
Evaluation
We use naïve Bayes to evaluate our result
1. training 70% & test30%
2. Leave-one-out
23
Evaluation (train70% & test30% )
0.67
0.78
0.67
0.56
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Accuracy
Median 0.17
Test1 Total_score
Test1 No Total_score
Test2 Total_score
Test2 No Total_score
0.78
0.44
0.56
0.22
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Accuracy
Mean 0.533
Test1 Total_score
Test1 No Total_score
Test2 Total_score
Test2 No Total_score
24
Evaluation(Leave-one-out)
Choose 29 games as training data and the other
one as test data, and totally run 30times to get
the accuracy.
Total 30 times Accuracy
Median Total_score 7 times wrong 77%
Median NO Total_score 5 times wrong 83%
Mean Total_score 6 times wrong 80%
Mean No Total_score 29 times wrong 3%
25
Attributes: Original Features Scores
HL
Median: 0.17M Mean: 0.53M
Attribute distribution to target(sales) domain
Sales boundary for H and L class
H, L
Sales class
Error
Times
Accuracy
Median 5 83%
Mean 29 3%
Evaluation: Leave-one-Out
Total Sample size: 30
26
Attributes: Transformed Features Scores
Boundary
Error
Times
Accuracy
Median 7 77%
Mean 6 80%
HL
Median:
0.17M
Mean:
0.53M
Attribute distribution project to target(sales) domain
Sales boundary for H and L class
H, L Sales class
Evaluation: Leave-one-Out
Total Sample size: 30
27
Finding related subreddits based
on the gamers’ participation
Data Mining Final Project
Group#4
Goal & Hypothesis
- To suggest new subreddits to the Gaming subreddit users based on the
participation of other users
29
Data Exploration
We extracted the Top 20 gaming related subreddits to be focused on.
30
leagueoflegends
DestinyTheGame
DotA2
GlobalOffensive
Gaming
GlobalOffensiveTrade
witcher
hearthstone
Games
2007scape
smashbros
wow
Smite
heroesofthestorm
EliteDangerous
FIFA
Guildwars2
tf2
summonerswar
runescape
31
Data Exploration (Cont.)
We sorted the subreddits by users’
comment activities.
We found out that the most active
subreddit is Leagueoflegends.
32
Data Exploration (Cont.)
Data Pre-processing
- Removed comments from moderators and bots (>2,000 comments)
- Removed comments from default subreddits
- Extracted all the users that commented on at least 3 subreddits
- Transformed data into “Market Basket” transactions: 159,475 users
33
34
Processing
Sorted the frequent items (sort.js)
Eg. DotA2, Neverwinter, atheism → DotA2, atheism, Neverwinter
35
Processing
- Choosing the minimum support from 159,475 Transactions
- Max possible support 25.71% (at least in 41,004 Transactions)
- Min possible support 0.0000062% (at least in 1 Transaction)
Min_Support as
0.05 % = 79.73
transactions
36
Processing
- How a FP-Growth tree looks like
37
Processing
- Our FP-Growth (minimum support = 1% at least in 1,595 transactions )
38
Processing
- Our FP-Growth (minimum support = 5% at least in 7,974 transactions)
39
Processing
NumberofGeneratedTransactions
40
Processing
NumberofGeneratedTransactions
41
Processing
Our minimum support = 0.8% (at least in 1,276 transactions)
Games -> politics Support : 1.30% (2,073 users) Confidence : 8.30%
We classified the rules in 4 groups based on their confidence
Post-processing
42
1% - 20% 21% - 40% 41% - 60% 61% - 100%
Group #1 Group #2 Group #3 Group #4
- {Antecedent} → {Consequent}
eg. {FIFA, Soccer} → {NFL}
- Lift ratio = confidence / Benchmark confidence
- Benchmark confidence = # consequent element in data set / total
transactions in data set
Post-processing
43
1.00
Post-processing
44
We created 8 surveys with 64
rules (2 rules per group) and 2
questions per rules.
Post-processing
Q1: Do you think subreddit A and subreddit B are related?
[ yes | no ]
Q2: If you are subscriber of subreddit A, will you also be interested in subreddit
B?
Definitely No [ 1 , 2 , 3 , 4 , 5 ] Definitely Yes
45
Dependency
Expected Confidence
●We got response from 52 persons
●Data Confidence vs Expected Confidence → Lift Over Expected
Post-processing
46
●% Non-related was proportion of the “No” answer to entire response in first
question.
●Interestingness = Average of Expected confidence * % Non-related.
Post-processing
47
Post-processing
48
Interestingness value per Group
Results
49
●How we can suggest new subreddits then?
1% - 20%61% - 100%
Group #1Group #4
Less Confidence
High Interestingness
High Confidence
Less Interestingness
Results
50
●Suggest them As:
1% - 20%61% - 100%
Group #1Group #4
Maybe you could be
Interested in...
Other People also
talks about….
Results
51
●Results from Group# 4 ( 9 Rules )
Results
52
●Results from Group# 1 ( 173 Rules )
Results
53
Group #4
Group #1
Challenges
- Reducing scope of data for decreasing computational time
- Defining and calculating the interestingness value
- How to suggest the rules that we got to reddit users
54
Conclusion
- We cannot be sure that our recommendation system will be 100% useful to
user since the interestingness can vary depending on the purpose of the
experiment
- To getting more accurate result, we need to ask all generated association
rules, more than 300 rules in survey
55
References
- Liu, B. (2000). Analyzing the subjective interestingness of association rules. 15(5), 47-55.
doi:10.1109/5254.889106
- Calculating Lift, How We Make Smart Online Product Recommendations
(https://www.youtube.com/watch?v=DeZFe1LerAQ)
- Reddit website (https://www.reddit.com/)
56
Application of Data Mining Techniques on Active
Users in Reddit
57
Raw data
Preprocessing
Clustering &
Evaluation
Knowledge
Data Mining Process
58
Facts - Why Active Users?
** http://www.pewinternet.org/2013/07/03/6-of-online-adults-are-reddit-users/
59
Facts - Why Active Users?
** http://www.snoosecret.com/statistics-about-reddit.html
60
Active Users Definitions
We define active users as people who :
1. who had posted or commented in at least 5 subreddits
2. who has at least 5 Posts or Comments in each of the
subreddits
3. # of Users’ comments above Q3
4. Average Score of the User, who satisfies three criteria
above > Q3
61
Preprocessing
Total posts in May2015: 54,504,410
Total distinct authors in May2015: 2,611,449
After deleting BOTs, [deleted], non-English, only-URL posts, and length < 3 posts,
we got 46,936,791 rows and 2,514,642 distinct authors.
Finally, we extracted 25,298 “active users” (0.97%)
and 5,007,845 posts (9.18%) by our active users
definitions
62
Clustering:
# of clusters (K) = 10 , k = √(n/2), k
= √(√(n/2))
using python sklearn module -
KMeans(), open source -
KPrototypes()
K-means, K-
prototype
63
Attributes
author = 1
subreddit = C (frequency: 27>others)
activity = 27/(11+6+27+17+3) = 0.42
assurance = 3/1 = 3
64
Clustering: K-means, K-prototype
ratio
65
Clustering: K-means, K-prototype
nominal
nominal
66
Evaluation
Separate data into 3 parts, A, B, C.
K-means && K-prototype for AB, AC
Compare labels of A which was from clustering result of AB, AC.
Measurement:
Adjusted Rand index: measure similarity between two list. (Ex. AB-A & AC-
A)
Homogeneity: each cluster contains only members of a single class
Completeness: all members of a given class are assigned to the same
cluster.
V-measure: harmonic mean of homogeneity & completeness
data A B C
data A B C
67
Evaluation: K-means v.s. K-prototype
K-means K-prototype
Adjusted Rand index 0.510904 0.193399
Homogeneity 0.803911 0.326116
Completeness 0.681669 0.298049
V-measure 0.737761 0.311451
68
VISUALIZATION
69
70
Visualization Part
71
72
73
74
RapidMiner
75
Weka
76
Orange
77
Kibana
78

Mais conteúdo relacionado

Mais procurados

Moving Your Machine Learning Models to Production with TensorFlow Extended
Moving Your Machine Learning Models to Production with TensorFlow ExtendedMoving Your Machine Learning Models to Production with TensorFlow Extended
Moving Your Machine Learning Models to Production with TensorFlow ExtendedJonathan Mugan
 
Machine Learning Applications in NLP.ppt
Machine Learning Applications in NLP.pptMachine Learning Applications in NLP.ppt
Machine Learning Applications in NLP.pptbutest
 
Using the search engine as recommendation engine
Using the search engine as recommendation engineUsing the search engine as recommendation engine
Using the search engine as recommendation engineLars Marius Garshol
 
Module 1.2 data preparation
Module 1.2  data preparationModule 1.2  data preparation
Module 1.2 data preparationSara Hooker
 
Python for Data Science with Anaconda
Python for Data Science with AnacondaPython for Data Science with Anaconda
Python for Data Science with AnacondaTravis Oliphant
 
Barga Data Science lecture 9
Barga Data Science lecture 9Barga Data Science lecture 9
Barga Data Science lecture 9Roger Barga
 
Data Science With Python
Data Science With PythonData Science With Python
Data Science With PythonMosky Liu
 
Module 4: Model Selection and Evaluation
Module 4: Model Selection and EvaluationModule 4: Model Selection and Evaluation
Module 4: Model Selection and EvaluationSara Hooker
 
Data Tactics Data Science Brown Bag (April 2014)
Data Tactics Data Science Brown Bag (April 2014)Data Tactics Data Science Brown Bag (April 2014)
Data Tactics Data Science Brown Bag (April 2014)Rich Heimann
 
Feature Engineering for Machine Learning at QConSP
Feature Engineering for Machine Learning at QConSPFeature Engineering for Machine Learning at QConSP
Feature Engineering for Machine Learning at QConSPAmanda Casari
 
How Can Software Engineering Support AI
How Can Software Engineering Support AIHow Can Software Engineering Support AI
How Can Software Engineering Support AIWalid Maalej
 
Tweet sentiment analysis (Data mining)
Tweet sentiment analysis (Data mining)Tweet sentiment analysis (Data mining)
Tweet sentiment analysis (Data mining)Anil Shrestha
 
Data Science and Analytics Brown Bag
Data Science and Analytics Brown BagData Science and Analytics Brown Bag
Data Science and Analytics Brown BagDataTactics
 
Applied Artificial Intelligence Unit 4 Semester 3 MSc IT Part 2 Mumbai Univer...
Applied Artificial Intelligence Unit 4 Semester 3 MSc IT Part 2 Mumbai Univer...Applied Artificial Intelligence Unit 4 Semester 3 MSc IT Part 2 Mumbai Univer...
Applied Artificial Intelligence Unit 4 Semester 3 MSc IT Part 2 Mumbai Univer...Madhav Mishra
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine LearningShahar Cohen
 
Mathematical Background for Artificial Intelligence
Mathematical Background for Artificial IntelligenceMathematical Background for Artificial Intelligence
Mathematical Background for Artificial Intelligenceananth
 
Active learning lecture
Active learning lectureActive learning lecture
Active learning lectureazuring
 
An introduction to Machine Learning
An introduction to Machine LearningAn introduction to Machine Learning
An introduction to Machine Learningbutest
 

Mais procurados (20)

Moving Your Machine Learning Models to Production with TensorFlow Extended
Moving Your Machine Learning Models to Production with TensorFlow ExtendedMoving Your Machine Learning Models to Production with TensorFlow Extended
Moving Your Machine Learning Models to Production with TensorFlow Extended
 
Active learning
Active learningActive learning
Active learning
 
Eric Smidth
Eric SmidthEric Smidth
Eric Smidth
 
Machine Learning Applications in NLP.ppt
Machine Learning Applications in NLP.pptMachine Learning Applications in NLP.ppt
Machine Learning Applications in NLP.ppt
 
Using the search engine as recommendation engine
Using the search engine as recommendation engineUsing the search engine as recommendation engine
Using the search engine as recommendation engine
 
Module 1.2 data preparation
Module 1.2  data preparationModule 1.2  data preparation
Module 1.2 data preparation
 
Python for Data Science with Anaconda
Python for Data Science with AnacondaPython for Data Science with Anaconda
Python for Data Science with Anaconda
 
Barga Data Science lecture 9
Barga Data Science lecture 9Barga Data Science lecture 9
Barga Data Science lecture 9
 
Data Science With Python
Data Science With PythonData Science With Python
Data Science With Python
 
Module 4: Model Selection and Evaluation
Module 4: Model Selection and EvaluationModule 4: Model Selection and Evaluation
Module 4: Model Selection and Evaluation
 
Data Tactics Data Science Brown Bag (April 2014)
Data Tactics Data Science Brown Bag (April 2014)Data Tactics Data Science Brown Bag (April 2014)
Data Tactics Data Science Brown Bag (April 2014)
 
Feature Engineering for Machine Learning at QConSP
Feature Engineering for Machine Learning at QConSPFeature Engineering for Machine Learning at QConSP
Feature Engineering for Machine Learning at QConSP
 
How Can Software Engineering Support AI
How Can Software Engineering Support AIHow Can Software Engineering Support AI
How Can Software Engineering Support AI
 
Tweet sentiment analysis (Data mining)
Tweet sentiment analysis (Data mining)Tweet sentiment analysis (Data mining)
Tweet sentiment analysis (Data mining)
 
Data Science and Analytics Brown Bag
Data Science and Analytics Brown BagData Science and Analytics Brown Bag
Data Science and Analytics Brown Bag
 
Applied Artificial Intelligence Unit 4 Semester 3 MSc IT Part 2 Mumbai Univer...
Applied Artificial Intelligence Unit 4 Semester 3 MSc IT Part 2 Mumbai Univer...Applied Artificial Intelligence Unit 4 Semester 3 MSc IT Part 2 Mumbai Univer...
Applied Artificial Intelligence Unit 4 Semester 3 MSc IT Part 2 Mumbai Univer...
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
 
Mathematical Background for Artificial Intelligence
Mathematical Background for Artificial IntelligenceMathematical Background for Artificial Intelligence
Mathematical Background for Artificial Intelligence
 
Active learning lecture
Active learning lectureActive learning lecture
Active learning lecture
 
An introduction to Machine Learning
An introduction to Machine LearningAn introduction to Machine Learning
An introduction to Machine Learning
 

Destaque

[系列活動] 智慧製造與生產線上的資料科學 (製造資料科學:從預測性思維到處方性決策)
[系列活動] 智慧製造與生產線上的資料科學 (製造資料科學:從預測性思維到處方性決策)[系列活動] 智慧製造與生產線上的資料科學 (製造資料科學:從預測性思維到處方性決策)
[系列活動] 智慧製造與生產線上的資料科學 (製造資料科學:從預測性思維到處方性決策)台灣資料科學年會
 
[系列活動] 無所不在的自然語言處理—基礎概念、技術與工具介紹
[系列活動] 無所不在的自然語言處理—基礎概念、技術與工具介紹[系列活動] 無所不在的自然語言處理—基礎概念、技術與工具介紹
[系列活動] 無所不在的自然語言處理—基礎概念、技術與工具介紹台灣資料科學年會
 
[系列活動] 智慧城市中的時空大數據應用
[系列活動] 智慧城市中的時空大數據應用[系列活動] 智慧城市中的時空大數據應用
[系列活動] 智慧城市中的時空大數據應用台灣資料科學年會
 
「資料視覺化」有志一同場次 at 2016 台灣資料科學年會
「資料視覺化」有志一同場次 at 2016 台灣資料科學年會「資料視覺化」有志一同場次 at 2016 台灣資料科學年會
「資料視覺化」有志一同場次 at 2016 台灣資料科學年會台灣資料科學年會
 
[系列活動] 手把手教你R語言資料分析實務
[系列活動] 手把手教你R語言資料分析實務[系列活動] 手把手教你R語言資料分析實務
[系列活動] 手把手教你R語言資料分析實務台灣資料科學年會
 
[系列活動] Machine Learning 機器學習課程
[系列活動] Machine Learning 機器學習課程[系列活動] Machine Learning 機器學習課程
[系列活動] Machine Learning 機器學習課程台灣資料科學年會
 
手把手教你 R 語言資料分析實務/張毓倫&陳柏亨
手把手教你 R 語言資料分析實務/張毓倫&陳柏亨手把手教你 R 語言資料分析實務/張毓倫&陳柏亨
手把手教你 R 語言資料分析實務/張毓倫&陳柏亨台灣資料科學年會
 
[系列活動] Data exploration with modern R
[系列活動] Data exploration with modern R[系列活動] Data exploration with modern R
[系列活動] Data exploration with modern R台灣資料科學年會
 
[系列活動] 給工程師的統計學及資料分析 123
[系列活動] 給工程師的統計學及資料分析 123[系列活動] 給工程師的統計學及資料分析 123
[系列活動] 給工程師的統計學及資料分析 123台灣資料科學年會
 
機率統計 -- 使用 R 軟體
機率統計 -- 使用 R 軟體機率統計 -- 使用 R 軟體
機率統計 -- 使用 R 軟體鍾誠 陳鍾誠
 
R統計軟體 -安裝與使用
R統計軟體 -安裝與使用R統計軟體 -安裝與使用
R統計軟體 -安裝與使用Person Lin
 
不會寫程式的人友善上手機器學習-淺談 Azure machine learning studio
不會寫程式的人友善上手機器學習-淺談 Azure machine learning studio不會寫程式的人友善上手機器學習-淺談 Azure machine learning studio
不會寫程式的人友善上手機器學習-淺談 Azure machine learning studioR Ladies Taipei
 
R統計軟體簡介
R統計軟體簡介R統計軟體簡介
R統計軟體簡介Person Lin
 
吳齊軒/漫談 R 的學習挑戰與 R 語言翻轉教室
吳齊軒/漫談 R 的學習挑戰與 R 語言翻轉教室吳齊軒/漫談 R 的學習挑戰與 R 語言翻轉教室
吳齊軒/漫談 R 的學習挑戰與 R 語言翻轉教室台灣資料科學年會
 
Big-data analytics: challenges and opportunities
Big-data analytics: challenges and opportunitiesBig-data analytics: challenges and opportunities
Big-data analytics: challenges and opportunities台灣資料科學年會
 
資料科學的第一堂課 Data Science Orientation
資料科學的第一堂課 Data Science Orientation資料科學的第一堂課 Data Science Orientation
資料科學的第一堂課 Data Science OrientationRyan Chung
 

Destaque (20)

[系列活動] 智慧製造與生產線上的資料科學 (製造資料科學:從預測性思維到處方性決策)
[系列活動] 智慧製造與生產線上的資料科學 (製造資料科學:從預測性思維到處方性決策)[系列活動] 智慧製造與生產線上的資料科學 (製造資料科學:從預測性思維到處方性決策)
[系列活動] 智慧製造與生產線上的資料科學 (製造資料科學:從預測性思維到處方性決策)
 
[系列活動] 機器學習速遊
[系列活動] 機器學習速遊[系列活動] 機器學習速遊
[系列活動] 機器學習速遊
 
[系列活動] 無所不在的自然語言處理—基礎概念、技術與工具介紹
[系列活動] 無所不在的自然語言處理—基礎概念、技術與工具介紹[系列活動] 無所不在的自然語言處理—基礎概念、技術與工具介紹
[系列活動] 無所不在的自然語言處理—基礎概念、技術與工具介紹
 
[系列活動] 智慧城市中的時空大數據應用
[系列活動] 智慧城市中的時空大數據應用[系列活動] 智慧城市中的時空大數據應用
[系列活動] 智慧城市中的時空大數據應用
 
李育杰/The Growth of a Data Scientist
李育杰/The Growth of a Data Scientist李育杰/The Growth of a Data Scientist
李育杰/The Growth of a Data Scientist
 
「資料視覺化」有志一同場次 at 2016 台灣資料科學年會
「資料視覺化」有志一同場次 at 2016 台灣資料科學年會「資料視覺化」有志一同場次 at 2016 台灣資料科學年會
「資料視覺化」有志一同場次 at 2016 台灣資料科學年會
 
[系列活動] 手把手教你R語言資料分析實務
[系列活動] 手把手教你R語言資料分析實務[系列活動] 手把手教你R語言資料分析實務
[系列活動] 手把手教你R語言資料分析實務
 
[系列活動] Machine Learning 機器學習課程
[系列活動] Machine Learning 機器學習課程[系列活動] Machine Learning 機器學習課程
[系列活動] Machine Learning 機器學習課程
 
手把手教你 R 語言資料分析實務/張毓倫&陳柏亨
手把手教你 R 語言資料分析實務/張毓倫&陳柏亨手把手教你 R 語言資料分析實務/張毓倫&陳柏亨
手把手教你 R 語言資料分析實務/張毓倫&陳柏亨
 
[系列活動] Data exploration with modern R
[系列活動] Data exploration with modern R[系列活動] Data exploration with modern R
[系列活動] Data exploration with modern R
 
[系列活動] 給工程師的統計學及資料分析 123
[系列活動] 給工程師的統計學及資料分析 123[系列活動] 給工程師的統計學及資料分析 123
[系列活動] 給工程師的統計學及資料分析 123
 
第一場預測
第一場預測第一場預測
第一場預測
 
機率統計 -- 使用 R 軟體
機率統計 -- 使用 R 軟體機率統計 -- 使用 R 軟體
機率統計 -- 使用 R 軟體
 
R統計軟體 -安裝與使用
R統計軟體 -安裝與使用R統計軟體 -安裝與使用
R統計軟體 -安裝與使用
 
不會寫程式的人友善上手機器學習-淺談 Azure machine learning studio
不會寫程式的人友善上手機器學習-淺談 Azure machine learning studio不會寫程式的人友善上手機器學習-淺談 Azure machine learning studio
不會寫程式的人友善上手機器學習-淺談 Azure machine learning studio
 
新手村-資料探索
新手村-資料探索新手村-資料探索
新手村-資料探索
 
R統計軟體簡介
R統計軟體簡介R統計軟體簡介
R統計軟體簡介
 
吳齊軒/漫談 R 的學習挑戰與 R 語言翻轉教室
吳齊軒/漫談 R 的學習挑戰與 R 語言翻轉教室吳齊軒/漫談 R 的學習挑戰與 R 語言翻轉教室
吳齊軒/漫談 R 的學習挑戰與 R 語言翻轉教室
 
Big-data analytics: challenges and opportunities
Big-data analytics: challenges and opportunitiesBig-data analytics: challenges and opportunities
Big-data analytics: challenges and opportunities
 
資料科學的第一堂課 Data Science Orientation
資料科學的第一堂課 Data Science Orientation資料科學的第一堂課 Data Science Orientation
資料科學的第一堂課 Data Science Orientation
 

Semelhante a [系列活動] 資料探勘速遊 - Session4 case-studies

Free2 play soft launch obtaining tangible results through action-oriented a...
Free2 play soft launch   obtaining tangible results through action-oriented a...Free2 play soft launch   obtaining tangible results through action-oriented a...
Free2 play soft launch obtaining tangible results through action-oriented a...Mary Chan
 
Analyzing Video Game Sales
Analyzing Video Game SalesAnalyzing Video Game Sales
Analyzing Video Game SalesSuzanne Simmons
 
Game analytics - The challenges of mobile free-to-play games
Game analytics - The challenges of mobile free-to-play gamesGame analytics - The challenges of mobile free-to-play games
Game analytics - The challenges of mobile free-to-play gamesChristian Beckers
 
A business level introduction to Artificial Intelligence - Louis Dorard @ PAP...
A business level introduction to Artificial Intelligence - Louis Dorard @ PAP...A business level introduction to Artificial Intelligence - Louis Dorard @ PAP...
A business level introduction to Artificial Intelligence - Louis Dorard @ PAP...PAPIs.io
 
Collaborative Filtering 1: User-based CF
Collaborative Filtering 1: User-based CFCollaborative Filtering 1: User-based CF
Collaborative Filtering 1: User-based CFYusuke Yamamoto
 
Spatz.ai for Teams - A referee toolkit for fair play
Spatz.ai for Teams - A referee toolkit for fair playSpatz.ai for Teams - A referee toolkit for fair play
Spatz.ai for Teams - A referee toolkit for fair playDesmond Sherlock
 
Spatz.ai for Teams - A referee toolkit for unfair idea-challenges
Spatz.ai for Teams - A referee toolkit for unfair idea-challengesSpatz.ai for Teams - A referee toolkit for unfair idea-challenges
Spatz.ai for Teams - A referee toolkit for unfair idea-challengesDesmond Sherlock
 
Spatz.ai for Teams - A referee toolkit for unfair idea-challenges
Spatz.ai for Teams - A referee toolkit for unfair idea-challengesSpatz.ai for Teams - A referee toolkit for unfair idea-challenges
Spatz.ai for Teams - A referee toolkit for unfair idea-challengesDesmond Sherlock
 
SpatzAI - A referee toolkit for challenging bold ideas in teams
SpatzAI - A referee toolkit for challenging bold ideas in teamsSpatzAI - A referee toolkit for challenging bold ideas in teams
SpatzAI - A referee toolkit for challenging bold ideas in teamsDesmond Sherlock
 
Figuring out the right metrics for your game
Figuring out the right metrics for your gameFiguring out the right metrics for your game
Figuring out the right metrics for your gameSaurav Sahu
 
Владимир Гулин, Mail.Ru Group, Learning to rank using clickthrough data
Владимир Гулин, Mail.Ru Group, Learning to rank using clickthrough dataВладимир Гулин, Mail.Ru Group, Learning to rank using clickthrough data
Владимир Гулин, Mail.Ru Group, Learning to rank using clickthrough dataMail.ru Group
 
SpatzAI - A Referee toolkit to resolve spats in teams
SpatzAI - A Referee toolkit to resolve spats in teamsSpatzAI - A Referee toolkit to resolve spats in teams
SpatzAI - A Referee toolkit to resolve spats in teamsDesmond Sherlock
 
SpatsAI - A Referee toolkit to resolve spats in teams
SpatsAI - A Referee toolkit to resolve spats in teamsSpatsAI - A Referee toolkit to resolve spats in teams
SpatsAI - A Referee toolkit to resolve spats in teamsDesmond Sherlock
 
Why Successful Games Need Analytics
Why Successful Games Need AnalyticsWhy Successful Games Need Analytics
Why Successful Games Need AnalyticsData Science Club
 
Future of AI-powered automation in business
Future of AI-powered automation in businessFuture of AI-powered automation in business
Future of AI-powered automation in businessLouis Dorard
 
How to Run Discrete Choice Conjoint Analysis
How to Run Discrete Choice Conjoint AnalysisHow to Run Discrete Choice Conjoint Analysis
How to Run Discrete Choice Conjoint AnalysisQuestionPro
 
Search Engine Results: The Best Measure?
Search Engine Results: The Best Measure? Search Engine Results: The Best Measure?
Search Engine Results: The Best Measure? Fan Foundry
 
Learning Analytics Design in Game-based Learning
Learning Analytics Design in Game-based LearningLearning Analytics Design in Game-based Learning
Learning Analytics Design in Game-based LearningMIT
 
Learning Games Design Framework
Learning Games Design FrameworkLearning Games Design Framework
Learning Games Design FrameworkKris Lee
 

Semelhante a [系列活動] 資料探勘速遊 - Session4 case-studies (20)

Free2 play soft launch obtaining tangible results through action-oriented a...
Free2 play soft launch   obtaining tangible results through action-oriented a...Free2 play soft launch   obtaining tangible results through action-oriented a...
Free2 play soft launch obtaining tangible results through action-oriented a...
 
Analyzing Video Game Sales
Analyzing Video Game SalesAnalyzing Video Game Sales
Analyzing Video Game Sales
 
Game analytics - The challenges of mobile free-to-play games
Game analytics - The challenges of mobile free-to-play gamesGame analytics - The challenges of mobile free-to-play games
Game analytics - The challenges of mobile free-to-play games
 
A business level introduction to Artificial Intelligence - Louis Dorard @ PAP...
A business level introduction to Artificial Intelligence - Louis Dorard @ PAP...A business level introduction to Artificial Intelligence - Louis Dorard @ PAP...
A business level introduction to Artificial Intelligence - Louis Dorard @ PAP...
 
Collaborative Filtering 1: User-based CF
Collaborative Filtering 1: User-based CFCollaborative Filtering 1: User-based CF
Collaborative Filtering 1: User-based CF
 
Spatz.ai for Teams - A referee toolkit for fair play
Spatz.ai for Teams - A referee toolkit for fair playSpatz.ai for Teams - A referee toolkit for fair play
Spatz.ai for Teams - A referee toolkit for fair play
 
Spatz.ai for Teams - A referee toolkit for unfair idea-challenges
Spatz.ai for Teams - A referee toolkit for unfair idea-challengesSpatz.ai for Teams - A referee toolkit for unfair idea-challenges
Spatz.ai for Teams - A referee toolkit for unfair idea-challenges
 
Spatz.ai for Teams - A referee toolkit for unfair idea-challenges
Spatz.ai for Teams - A referee toolkit for unfair idea-challengesSpatz.ai for Teams - A referee toolkit for unfair idea-challenges
Spatz.ai for Teams - A referee toolkit for unfair idea-challenges
 
SpatzAI - A referee toolkit for challenging bold ideas in teams
SpatzAI - A referee toolkit for challenging bold ideas in teamsSpatzAI - A referee toolkit for challenging bold ideas in teams
SpatzAI - A referee toolkit for challenging bold ideas in teams
 
Figuring out the right metrics for your game
Figuring out the right metrics for your gameFiguring out the right metrics for your game
Figuring out the right metrics for your game
 
Владимир Гулин, Mail.Ru Group, Learning to rank using clickthrough data
Владимир Гулин, Mail.Ru Group, Learning to rank using clickthrough dataВладимир Гулин, Mail.Ru Group, Learning to rank using clickthrough data
Владимир Гулин, Mail.Ru Group, Learning to rank using clickthrough data
 
SpatzAI - A Referee toolkit to resolve spats in teams
SpatzAI - A Referee toolkit to resolve spats in teamsSpatzAI - A Referee toolkit to resolve spats in teams
SpatzAI - A Referee toolkit to resolve spats in teams
 
SpatsAI - A Referee toolkit to resolve spats in teams
SpatsAI - A Referee toolkit to resolve spats in teamsSpatsAI - A Referee toolkit to resolve spats in teams
SpatsAI - A Referee toolkit to resolve spats in teams
 
Why Successful Games Need Analytics
Why Successful Games Need AnalyticsWhy Successful Games Need Analytics
Why Successful Games Need Analytics
 
Future of AI-powered automation in business
Future of AI-powered automation in businessFuture of AI-powered automation in business
Future of AI-powered automation in business
 
How to Run Discrete Choice Conjoint Analysis
How to Run Discrete Choice Conjoint AnalysisHow to Run Discrete Choice Conjoint Analysis
How to Run Discrete Choice Conjoint Analysis
 
Search Engine Results: The Best Measure?
Search Engine Results: The Best Measure? Search Engine Results: The Best Measure?
Search Engine Results: The Best Measure?
 
DECISION SUPPORT SYSTEMS
DECISION SUPPORT SYSTEMSDECISION SUPPORT SYSTEMS
DECISION SUPPORT SYSTEMS
 
Learning Analytics Design in Game-based Learning
Learning Analytics Design in Game-based LearningLearning Analytics Design in Game-based Learning
Learning Analytics Design in Game-based Learning
 
Learning Games Design Framework
Learning Games Design FrameworkLearning Games Design Framework
Learning Games Design Framework
 

Mais de 台灣資料科學年會

[台灣人工智慧學校] 人工智慧技術發展與應用
[台灣人工智慧學校] 人工智慧技術發展與應用[台灣人工智慧學校] 人工智慧技術發展與應用
[台灣人工智慧學校] 人工智慧技術發展與應用台灣資料科學年會
 
[台灣人工智慧學校] 執行長報告
[台灣人工智慧學校] 執行長報告[台灣人工智慧學校] 執行長報告
[台灣人工智慧學校] 執行長報告台灣資料科學年會
 
[台灣人工智慧學校] 工業 4.0 與智慧製造的發展趨勢與挑戰
[台灣人工智慧學校] 工業 4.0 與智慧製造的發展趨勢與挑戰[台灣人工智慧學校] 工業 4.0 與智慧製造的發展趨勢與挑戰
[台灣人工智慧學校] 工業 4.0 與智慧製造的發展趨勢與挑戰台灣資料科學年會
 
[台灣人工智慧學校] 開創台灣產業智慧轉型的新契機
[台灣人工智慧學校] 開創台灣產業智慧轉型的新契機[台灣人工智慧學校] 開創台灣產業智慧轉型的新契機
[台灣人工智慧學校] 開創台灣產業智慧轉型的新契機台灣資料科學年會
 
[台灣人工智慧學校] 開創台灣產業智慧轉型的新契機
[台灣人工智慧學校] 開創台灣產業智慧轉型的新契機[台灣人工智慧學校] 開創台灣產業智慧轉型的新契機
[台灣人工智慧學校] 開創台灣產業智慧轉型的新契機台灣資料科學年會
 
[台灣人工智慧學校] 台北總校第三期結業典禮 - 執行長談話
[台灣人工智慧學校] 台北總校第三期結業典禮 - 執行長談話[台灣人工智慧學校] 台北總校第三期結業典禮 - 執行長談話
[台灣人工智慧學校] 台北總校第三期結業典禮 - 執行長談話台灣資料科學年會
 
[TOxAIA台中分校] AI 引爆新工業革命,智慧機械首都台中轉型論壇
[TOxAIA台中分校] AI 引爆新工業革命,智慧機械首都台中轉型論壇[TOxAIA台中分校] AI 引爆新工業革命,智慧機械首都台中轉型論壇
[TOxAIA台中分校] AI 引爆新工業革命,智慧機械首都台中轉型論壇台灣資料科學年會
 
[TOxAIA台中分校] 2019 台灣數位轉型 與產業升級趨勢觀察
[TOxAIA台中分校] 2019 台灣數位轉型 與產業升級趨勢觀察 [TOxAIA台中分校] 2019 台灣數位轉型 與產業升級趨勢觀察
[TOxAIA台中分校] 2019 台灣數位轉型 與產業升級趨勢觀察 台灣資料科學年會
 
[TOxAIA台中分校] 智慧製造成真! 產線導入AI的致勝關鍵
[TOxAIA台中分校] 智慧製造成真! 產線導入AI的致勝關鍵[TOxAIA台中分校] 智慧製造成真! 產線導入AI的致勝關鍵
[TOxAIA台中分校] 智慧製造成真! 產線導入AI的致勝關鍵台灣資料科學年會
 
[台灣人工智慧學校] 從經濟學看人工智慧產業應用
[台灣人工智慧學校] 從經濟學看人工智慧產業應用[台灣人工智慧學校] 從經濟學看人工智慧產業應用
[台灣人工智慧學校] 從經濟學看人工智慧產業應用台灣資料科學年會
 
[台灣人工智慧學校] 台中分校第二期開學典禮 - 執行長報告
[台灣人工智慧學校] 台中分校第二期開學典禮 - 執行長報告[台灣人工智慧學校] 台中分校第二期開學典禮 - 執行長報告
[台灣人工智慧學校] 台中分校第二期開學典禮 - 執行長報告台灣資料科學年會
 
[台中分校] 第一期結業典禮 - 執行長談話
[台中分校] 第一期結業典禮 - 執行長談話[台中分校] 第一期結業典禮 - 執行長談話
[台中分校] 第一期結業典禮 - 執行長談話台灣資料科學年會
 
[TOxAIA新竹分校] 工業4.0潛力新應用! 多模式對話機器人
[TOxAIA新竹分校] 工業4.0潛力新應用! 多模式對話機器人[TOxAIA新竹分校] 工業4.0潛力新應用! 多模式對話機器人
[TOxAIA新竹分校] 工業4.0潛力新應用! 多模式對話機器人台灣資料科學年會
 
[TOxAIA新竹分校] AI整合是重點! 竹科的關鍵轉型思維
[TOxAIA新竹分校] AI整合是重點! 竹科的關鍵轉型思維[TOxAIA新竹分校] AI整合是重點! 竹科的關鍵轉型思維
[TOxAIA新竹分校] AI整合是重點! 竹科的關鍵轉型思維台灣資料科學年會
 
[TOxAIA新竹分校] 2019 台灣數位轉型與產業升級趨勢觀察
[TOxAIA新竹分校] 2019 台灣數位轉型與產業升級趨勢觀察[TOxAIA新竹分校] 2019 台灣數位轉型與產業升級趨勢觀察
[TOxAIA新竹分校] 2019 台灣數位轉型與產業升級趨勢觀察台灣資料科學年會
 
[TOxAIA新竹分校] 深度學習與Kaggle實戰
[TOxAIA新竹分校] 深度學習與Kaggle實戰[TOxAIA新竹分校] 深度學習與Kaggle實戰
[TOxAIA新竹分校] 深度學習與Kaggle實戰台灣資料科學年會
 
[台灣人工智慧學校] Bridging AI to Precision Agriculture through IoT
[台灣人工智慧學校] Bridging AI to Precision Agriculture through IoT[台灣人工智慧學校] Bridging AI to Precision Agriculture through IoT
[台灣人工智慧學校] Bridging AI to Precision Agriculture through IoT台灣資料科學年會
 
[2018 台灣人工智慧學校校友年會] 產業經驗分享: 如何用最少的訓練樣本,得到最好的深度學習影像分析結果,減少一半人力,提升一倍品質 / 李明達
[2018 台灣人工智慧學校校友年會] 產業經驗分享: 如何用最少的訓練樣本,得到最好的深度學習影像分析結果,減少一半人力,提升一倍品質 / 李明達[2018 台灣人工智慧學校校友年會] 產業經驗分享: 如何用最少的訓練樣本,得到最好的深度學習影像分析結果,減少一半人力,提升一倍品質 / 李明達
[2018 台灣人工智慧學校校友年會] 產業經驗分享: 如何用最少的訓練樣本,得到最好的深度學習影像分析結果,減少一半人力,提升一倍品質 / 李明達台灣資料科學年會
 
[2018 台灣人工智慧學校校友年會] 啟動物聯網新關鍵 - 未來由你「喚」醒 / 沈品勳
[2018 台灣人工智慧學校校友年會] 啟動物聯網新關鍵 - 未來由你「喚」醒 / 沈品勳[2018 台灣人工智慧學校校友年會] 啟動物聯網新關鍵 - 未來由你「喚」醒 / 沈品勳
[2018 台灣人工智慧學校校友年會] 啟動物聯網新關鍵 - 未來由你「喚」醒 / 沈品勳台灣資料科學年會
 

Mais de 台灣資料科學年會 (20)

[台灣人工智慧學校] 人工智慧技術發展與應用
[台灣人工智慧學校] 人工智慧技術發展與應用[台灣人工智慧學校] 人工智慧技術發展與應用
[台灣人工智慧學校] 人工智慧技術發展與應用
 
[台灣人工智慧學校] 執行長報告
[台灣人工智慧學校] 執行長報告[台灣人工智慧學校] 執行長報告
[台灣人工智慧學校] 執行長報告
 
[台灣人工智慧學校] 工業 4.0 與智慧製造的發展趨勢與挑戰
[台灣人工智慧學校] 工業 4.0 與智慧製造的發展趨勢與挑戰[台灣人工智慧學校] 工業 4.0 與智慧製造的發展趨勢與挑戰
[台灣人工智慧學校] 工業 4.0 與智慧製造的發展趨勢與挑戰
 
[台灣人工智慧學校] 開創台灣產業智慧轉型的新契機
[台灣人工智慧學校] 開創台灣產業智慧轉型的新契機[台灣人工智慧學校] 開創台灣產業智慧轉型的新契機
[台灣人工智慧學校] 開創台灣產業智慧轉型的新契機
 
[台灣人工智慧學校] 開創台灣產業智慧轉型的新契機
[台灣人工智慧學校] 開創台灣產業智慧轉型的新契機[台灣人工智慧學校] 開創台灣產業智慧轉型的新契機
[台灣人工智慧學校] 開創台灣產業智慧轉型的新契機
 
[台灣人工智慧學校] 台北總校第三期結業典禮 - 執行長談話
[台灣人工智慧學校] 台北總校第三期結業典禮 - 執行長談話[台灣人工智慧學校] 台北總校第三期結業典禮 - 執行長談話
[台灣人工智慧學校] 台北總校第三期結業典禮 - 執行長談話
 
[TOxAIA台中分校] AI 引爆新工業革命,智慧機械首都台中轉型論壇
[TOxAIA台中分校] AI 引爆新工業革命,智慧機械首都台中轉型論壇[TOxAIA台中分校] AI 引爆新工業革命,智慧機械首都台中轉型論壇
[TOxAIA台中分校] AI 引爆新工業革命,智慧機械首都台中轉型論壇
 
[TOxAIA台中分校] 2019 台灣數位轉型 與產業升級趨勢觀察
[TOxAIA台中分校] 2019 台灣數位轉型 與產業升級趨勢觀察 [TOxAIA台中分校] 2019 台灣數位轉型 與產業升級趨勢觀察
[TOxAIA台中分校] 2019 台灣數位轉型 與產業升級趨勢觀察
 
[TOxAIA台中分校] 智慧製造成真! 產線導入AI的致勝關鍵
[TOxAIA台中分校] 智慧製造成真! 產線導入AI的致勝關鍵[TOxAIA台中分校] 智慧製造成真! 產線導入AI的致勝關鍵
[TOxAIA台中分校] 智慧製造成真! 產線導入AI的致勝關鍵
 
[台灣人工智慧學校] 從經濟學看人工智慧產業應用
[台灣人工智慧學校] 從經濟學看人工智慧產業應用[台灣人工智慧學校] 從經濟學看人工智慧產業應用
[台灣人工智慧學校] 從經濟學看人工智慧產業應用
 
[台灣人工智慧學校] 台中分校第二期開學典禮 - 執行長報告
[台灣人工智慧學校] 台中分校第二期開學典禮 - 執行長報告[台灣人工智慧學校] 台中分校第二期開學典禮 - 執行長報告
[台灣人工智慧學校] 台中分校第二期開學典禮 - 執行長報告
 
台灣人工智慧學校成果發表會
台灣人工智慧學校成果發表會台灣人工智慧學校成果發表會
台灣人工智慧學校成果發表會
 
[台中分校] 第一期結業典禮 - 執行長談話
[台中分校] 第一期結業典禮 - 執行長談話[台中分校] 第一期結業典禮 - 執行長談話
[台中分校] 第一期結業典禮 - 執行長談話
 
[TOxAIA新竹分校] 工業4.0潛力新應用! 多模式對話機器人
[TOxAIA新竹分校] 工業4.0潛力新應用! 多模式對話機器人[TOxAIA新竹分校] 工業4.0潛力新應用! 多模式對話機器人
[TOxAIA新竹分校] 工業4.0潛力新應用! 多模式對話機器人
 
[TOxAIA新竹分校] AI整合是重點! 竹科的關鍵轉型思維
[TOxAIA新竹分校] AI整合是重點! 竹科的關鍵轉型思維[TOxAIA新竹分校] AI整合是重點! 竹科的關鍵轉型思維
[TOxAIA新竹分校] AI整合是重點! 竹科的關鍵轉型思維
 
[TOxAIA新竹分校] 2019 台灣數位轉型與產業升級趨勢觀察
[TOxAIA新竹分校] 2019 台灣數位轉型與產業升級趨勢觀察[TOxAIA新竹分校] 2019 台灣數位轉型與產業升級趨勢觀察
[TOxAIA新竹分校] 2019 台灣數位轉型與產業升級趨勢觀察
 
[TOxAIA新竹分校] 深度學習與Kaggle實戰
[TOxAIA新竹分校] 深度學習與Kaggle實戰[TOxAIA新竹分校] 深度學習與Kaggle實戰
[TOxAIA新竹分校] 深度學習與Kaggle實戰
 
[台灣人工智慧學校] Bridging AI to Precision Agriculture through IoT
[台灣人工智慧學校] Bridging AI to Precision Agriculture through IoT[台灣人工智慧學校] Bridging AI to Precision Agriculture through IoT
[台灣人工智慧學校] Bridging AI to Precision Agriculture through IoT
 
[2018 台灣人工智慧學校校友年會] 產業經驗分享: 如何用最少的訓練樣本,得到最好的深度學習影像分析結果,減少一半人力,提升一倍品質 / 李明達
[2018 台灣人工智慧學校校友年會] 產業經驗分享: 如何用最少的訓練樣本,得到最好的深度學習影像分析結果,減少一半人力,提升一倍品質 / 李明達[2018 台灣人工智慧學校校友年會] 產業經驗分享: 如何用最少的訓練樣本,得到最好的深度學習影像分析結果,減少一半人力,提升一倍品質 / 李明達
[2018 台灣人工智慧學校校友年會] 產業經驗分享: 如何用最少的訓練樣本,得到最好的深度學習影像分析結果,減少一半人力,提升一倍品質 / 李明達
 
[2018 台灣人工智慧學校校友年會] 啟動物聯網新關鍵 - 未來由你「喚」醒 / 沈品勳
[2018 台灣人工智慧學校校友年會] 啟動物聯網新關鍵 - 未來由你「喚」醒 / 沈品勳[2018 台灣人工智慧學校校友年會] 啟動物聯網新關鍵 - 未來由你「喚」醒 / 沈品勳
[2018 台灣人工智慧學校校友年會] 啟動物聯網新關鍵 - 未來由你「喚」醒 / 沈品勳
 

Último

NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...limedy534
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxaleedritatuxx
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Seán Kennedy
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max PrincetonTimothy Spann
 
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...GQ Research
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Seán Kennedy
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesTimothy Spann
 
detection and classification of knee osteoarthritis.pptx
detection and classification of knee osteoarthritis.pptxdetection and classification of knee osteoarthritis.pptx
detection and classification of knee osteoarthritis.pptxAleenaJamil4
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...ssuserf63bd7
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanMYRABACSAFRA2
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfBoston Institute of Analytics
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Cantervoginip
 
LLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGILLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGIThomas Poetter
 
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024Timothy Spann
 

Último (20)

NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max Princeton
 
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
 
detection and classification of knee osteoarthritis.pptx
detection and classification of knee osteoarthritis.pptxdetection and classification of knee osteoarthritis.pptx
detection and classification of knee osteoarthritis.pptx
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population Mean
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Canter
 
LLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGILLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGI
 
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024
 

[系列活動] 資料探勘速遊 - Session4 case-studies

  • 1. Case Studies Yi-Shin Chen Institute of Information Systems and Applications Department of Computer Science National Tsing Hua University yishin@gmail.com Many slides provided by Tan, Steinbach, Kumar for book “Introduction to Data Mining” are adapted in this presentation
  • 2. Case: Mining Reddit Data 不指定特定目標的Cases 2
  • 4. Reddit: The Front Page of the Internet 50k+ on this set
  • 5. Subreddit Categories ▷Reddit’s structure may already provide a baseline similarity
  • 9. 9
  • 10. Data Mining Final Project Role-Playing Games Sales Prediction Group 6
  • 11. Dataset • Use Reddit comment dataset • Data Selection – Choose 30 Role-playing games from the subreddits. – Choose useful attributes form their comment • Body • Score • Subreddit 11
  • 12. 12
  • 13. Preprocessing • Common Game Features – Drop trivial words • We manually build the trivial words dictionary by ourselves. – Ex : “is” “the” “I” “you” “we” “a”. – TF-IDF • We use Tf-IDF to calculate the importance of each words. 13
  • 14. Preprocessing – TF-IDF result • But the TF-IDF value are too small to find the significant importance. – Define Comment Feature • We manually define the common gaming words in five categories by referring several game website. Game-Play | Criticize | Graphic | Music | Story 14
  • 15. Preprocessing • Filtering useful comments – Common keywords of 5 categories: – Filtering useful comments • Using the frequent keywords to find out the useful comments in each category. 15
  • 16. Preprocessing – Using FP-Growth to find other feature • To Find other frequent keywords {time, play, people, gt, limited} {time, up, good, now, damage, gear, way, need, build, better, d3, something, right, being, gt, limited} • Then, manually choose some frequent keywords 16
  • 17. Preprocessing • Comment Emotion – Filtering outliers • First filter out those comment whose “score” separate in top and bottom 2.5% which indicate that they might be outliers. 17
  • 18. Preprocessing – Emotion detection • We use LIWC dictionary to calculate each comment’s positive and negative emotion percentage. Ex: I like this game. -> positive EX: I hate this character. -> negative • Then find each category’s positive and negative emotion score. – 𝑆𝑐𝑜𝑟𝑒 𝑒𝑎𝑐ℎ_𝑐𝑎𝑡𝑒𝑔𝑜𝑟𝑦_𝑃= 𝑗=0 𝑛 𝑠𝑐𝑜𝑟𝑒_𝑒𝑎𝑐ℎ_𝑐𝑜𝑚𝑚𝑒𝑛𝑡 𝑗 * (positive words) – 𝑆𝑐𝑜𝑟𝑒 𝑒𝑎𝑐ℎ_𝑐𝑎𝑡𝑒𝑔𝑜𝑟𝑦_𝑁= 𝑗=0 𝑛 𝑠𝑐𝑜𝑟𝑒_𝑒𝑎𝑐ℎ_𝑐𝑜𝑚𝑚𝑒𝑛𝑡 𝑗 * (negative words) – 𝑇𝑜𝑡𝑎𝑙𝑆𝑐𝑜𝑟𝑒 𝑒𝑎𝑐ℎ_𝑐𝑎𝑡𝑒𝑔𝑜𝑟𝑦= 𝑖=0 𝑚 𝑠𝑐𝑜𝑟𝑒_𝑒𝑎𝑐ℎ_𝑐𝑜𝑚𝑚𝑒𝑛𝑡 𝑚 – 𝐹𝑖𝑛𝑎𝑙𝑆𝑐𝑜𝑟𝑒 𝑒𝑎𝑐ℎ_𝑐𝑎𝑡𝑒𝑔𝑜𝑟𝑦_𝑃 = 𝑆𝑐𝑜𝑟𝑒 𝑒𝑎𝑐ℎ_𝑐𝑎𝑡𝑒𝑔𝑜𝑟𝑦_𝑃/ 𝑇𝑜𝑡𝑎𝑙𝑆𝑐𝑜𝑟𝑒 𝑒𝑎𝑐ℎ_𝑐𝑎𝑡𝑒𝑔𝑜𝑟𝑦 – 𝐹𝑖𝑛𝑎𝑙𝑆𝑐𝑜𝑟𝑒 𝑒𝑎𝑐ℎ_𝑐𝑎𝑡𝑒𝑔𝑜𝑟𝑦_𝑁 = 𝑆𝑐𝑜𝑟𝑒 𝑒𝑎𝑐ℎ_𝑐𝑎𝑡𝑒𝑔𝑜𝑟𝑦_𝑁/ 𝑇𝑜𝑡𝑎𝑙𝑆𝑐𝑜𝑟𝑒 𝑒𝑎𝑐ℎ_𝑐𝑎𝑡𝑒𝑔𝑜𝑟𝑦 18
  • 19. Preprocessing – Emotion detection • We use LIWC dictionary to calculate each comment’s positive and negative emotion percentage. Ex: I like this game. -> positive EX: I hate this character. -> negative • Then find each category’s positive and negative emotion score. 19 The meaning is : Finding each category’s emotion score for each game Calculating TotalScore of each category’s comments Having the average emotion score FinalScore of each category
  • 20. Preprocessing • Sales Data Extraction – Crawling website’s data – Find the games’ sales on each platform 20
  • 21. Preprocessing – Annually sales for each game – Median of each platform – Mean of all platform Median : 0.17 -> 30 games (26H 4L) Mean : 0.533 -> 30 games (18H 12L) – Try both of Median/ Mean to do the prediction. 21
  • 22. Sales Prediction • Model Construction – We use the 4 outputs from pre-processing step to be the input of the Naïve Bayes classifier 22
  • 23. Evaluation We use naïve Bayes to evaluate our result 1. training 70% & test30% 2. Leave-one-out 23
  • 24. Evaluation (train70% & test30% ) 0.67 0.78 0.67 0.56 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Accuracy Median 0.17 Test1 Total_score Test1 No Total_score Test2 Total_score Test2 No Total_score 0.78 0.44 0.56 0.22 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Accuracy Mean 0.533 Test1 Total_score Test1 No Total_score Test2 Total_score Test2 No Total_score 24
  • 25. Evaluation(Leave-one-out) Choose 29 games as training data and the other one as test data, and totally run 30times to get the accuracy. Total 30 times Accuracy Median Total_score 7 times wrong 77% Median NO Total_score 5 times wrong 83% Mean Total_score 6 times wrong 80% Mean No Total_score 29 times wrong 3% 25
  • 26. Attributes: Original Features Scores HL Median: 0.17M Mean: 0.53M Attribute distribution to target(sales) domain Sales boundary for H and L class H, L Sales class Error Times Accuracy Median 5 83% Mean 29 3% Evaluation: Leave-one-Out Total Sample size: 30 26
  • 27. Attributes: Transformed Features Scores Boundary Error Times Accuracy Median 7 77% Mean 6 80% HL Median: 0.17M Mean: 0.53M Attribute distribution project to target(sales) domain Sales boundary for H and L class H, L Sales class Evaluation: Leave-one-Out Total Sample size: 30 27
  • 28. Finding related subreddits based on the gamers’ participation Data Mining Final Project Group#4
  • 29. Goal & Hypothesis - To suggest new subreddits to the Gaming subreddit users based on the participation of other users 29
  • 30. Data Exploration We extracted the Top 20 gaming related subreddits to be focused on. 30 leagueoflegends DestinyTheGame DotA2 GlobalOffensive Gaming GlobalOffensiveTrade witcher hearthstone Games 2007scape smashbros wow Smite heroesofthestorm EliteDangerous FIFA Guildwars2 tf2 summonerswar runescape
  • 31. 31 Data Exploration (Cont.) We sorted the subreddits by users’ comment activities. We found out that the most active subreddit is Leagueoflegends.
  • 33. Data Pre-processing - Removed comments from moderators and bots (>2,000 comments) - Removed comments from default subreddits - Extracted all the users that commented on at least 3 subreddits - Transformed data into “Market Basket” transactions: 159,475 users 33
  • 34. 34 Processing Sorted the frequent items (sort.js) Eg. DotA2, Neverwinter, atheism → DotA2, atheism, Neverwinter
  • 35. 35 Processing - Choosing the minimum support from 159,475 Transactions - Max possible support 25.71% (at least in 41,004 Transactions) - Min possible support 0.0000062% (at least in 1 Transaction) Min_Support as 0.05 % = 79.73 transactions
  • 36. 36 Processing - How a FP-Growth tree looks like
  • 37. 37 Processing - Our FP-Growth (minimum support = 1% at least in 1,595 transactions )
  • 38. 38 Processing - Our FP-Growth (minimum support = 5% at least in 7,974 transactions)
  • 41. 41 Processing Our minimum support = 0.8% (at least in 1,276 transactions) Games -> politics Support : 1.30% (2,073 users) Confidence : 8.30%
  • 42. We classified the rules in 4 groups based on their confidence Post-processing 42 1% - 20% 21% - 40% 41% - 60% 61% - 100% Group #1 Group #2 Group #3 Group #4
  • 43. - {Antecedent} → {Consequent} eg. {FIFA, Soccer} → {NFL} - Lift ratio = confidence / Benchmark confidence - Benchmark confidence = # consequent element in data set / total transactions in data set Post-processing 43 1.00
  • 44. Post-processing 44 We created 8 surveys with 64 rules (2 rules per group) and 2 questions per rules.
  • 45. Post-processing Q1: Do you think subreddit A and subreddit B are related? [ yes | no ] Q2: If you are subscriber of subreddit A, will you also be interested in subreddit B? Definitely No [ 1 , 2 , 3 , 4 , 5 ] Definitely Yes 45 Dependency Expected Confidence
  • 46. ●We got response from 52 persons ●Data Confidence vs Expected Confidence → Lift Over Expected Post-processing 46
  • 47. ●% Non-related was proportion of the “No” answer to entire response in first question. ●Interestingness = Average of Expected confidence * % Non-related. Post-processing 47
  • 49. Results 49 ●How we can suggest new subreddits then? 1% - 20%61% - 100% Group #1Group #4 Less Confidence High Interestingness High Confidence Less Interestingness
  • 50. Results 50 ●Suggest them As: 1% - 20%61% - 100% Group #1Group #4 Maybe you could be Interested in... Other People also talks about….
  • 54. Challenges - Reducing scope of data for decreasing computational time - Defining and calculating the interestingness value - How to suggest the rules that we got to reddit users 54
  • 55. Conclusion - We cannot be sure that our recommendation system will be 100% useful to user since the interestingness can vary depending on the purpose of the experiment - To getting more accurate result, we need to ask all generated association rules, more than 300 rules in survey 55
  • 56. References - Liu, B. (2000). Analyzing the subjective interestingness of association rules. 15(5), 47-55. doi:10.1109/5254.889106 - Calculating Lift, How We Make Smart Online Product Recommendations (https://www.youtube.com/watch?v=DeZFe1LerAQ) - Reddit website (https://www.reddit.com/) 56
  • 57. Application of Data Mining Techniques on Active Users in Reddit 57
  • 59. Facts - Why Active Users? ** http://www.pewinternet.org/2013/07/03/6-of-online-adults-are-reddit-users/ 59
  • 60. Facts - Why Active Users? ** http://www.snoosecret.com/statistics-about-reddit.html 60
  • 61. Active Users Definitions We define active users as people who : 1. who had posted or commented in at least 5 subreddits 2. who has at least 5 Posts or Comments in each of the subreddits 3. # of Users’ comments above Q3 4. Average Score of the User, who satisfies three criteria above > Q3 61
  • 62. Preprocessing Total posts in May2015: 54,504,410 Total distinct authors in May2015: 2,611,449 After deleting BOTs, [deleted], non-English, only-URL posts, and length < 3 posts, we got 46,936,791 rows and 2,514,642 distinct authors. Finally, we extracted 25,298 “active users” (0.97%) and 5,007,845 posts (9.18%) by our active users definitions 62
  • 63. Clustering: # of clusters (K) = 10 , k = √(n/2), k = √(√(n/2)) using python sklearn module - KMeans(), open source - KPrototypes() K-means, K- prototype 63
  • 64. Attributes author = 1 subreddit = C (frequency: 27>others) activity = 27/(11+6+27+17+3) = 0.42 assurance = 3/1 = 3 64
  • 67. Evaluation Separate data into 3 parts, A, B, C. K-means && K-prototype for AB, AC Compare labels of A which was from clustering result of AB, AC. Measurement: Adjusted Rand index: measure similarity between two list. (Ex. AB-A & AC- A) Homogeneity: each cluster contains only members of a single class Completeness: all members of a given class are assigned to the same cluster. V-measure: harmonic mean of homogeneity & completeness data A B C data A B C 67
  • 68. Evaluation: K-means v.s. K-prototype K-means K-prototype Adjusted Rand index 0.510904 0.193399 Homogeneity 0.803911 0.326116 Completeness 0.681669 0.298049 V-measure 0.737761 0.311451 68
  • 70. 70
  • 72. 72
  • 73. 73
  • 74. 74