10. MACHINE LEARNING
• computer programs that automatically improve with experience."
• interdisciplinary in nature
• employs techniques from the fields of computer science, statistics, and
artificial intelligence, among others.
• algorithms which facilitate automatic improvement from
• machine learning is a central aspect of data science.
• pattern recognition Machine learning has a complex relationship with data
mining.
15. WHAT DOES
FACEBOOK DO TO
YOUR DATA?
• learning what consumers prefer
• emotional contagion study
• Cookies on your browser predicts who you
are
• Social plugins ("like", subscribe" or
"recommend" buttons.)
• information that Facebook sells to advertisers
16. we’ve agreed to a huge
amount of data being turned
over and signing off on the
social network’s seemingly
limitless ability to do with it
whatever it wants
17. FACEBOOK DATA SCIENCE
crawled or scraped data will be valuable and
constructive for commercial, scientific, and many
other fields of prediction and analysis
18. FACEBOOK’S DATA PRIVACY
POLICY:
• …in addition to helping people see and find things that you do and share,
we may use the information we receive about you… for internal operations,
including troubleshooting, data analysis, testing, research and service
improvement.
19. OCTOPARSE
Octoparse is a powerful web scraper that
can scrape both static and dynamic
websites with AJAX, JavaScript, cookies
and etc
21. VISUAL SCRAPER
• Visual Scraper is another great free
web scraper with simple point-and-
click interface
• collect data from the web
• export the extracted data as CSV,
XML, JSON or SQL files.
• scrape data from up to 50,000 web
pages for only one
user.
23. FACEBOOK DATA SCIENCE USING R
• R is a data mining
software
application used
to analyze big
data.
• Data science in FB
using R.pdf
24. • Rfacebook Package provides an interface
to the Facebook API. For mining Facebook
using R, the Rfacebook package provides
functions that allow R to access Facebook’s
API to get information about posts,
comments, likes, group that mention
specific keywords & much more.
25.
26.
27.
28. But: it is not much different from
what we, especially statisticians,
have been doing for many years
29. Much more data is digitally available than
was before
Inexpensive computing + Cloud + Easy-to-
use programming frameworks = Much
easier to analyze it
Often: large-scale data + simple
algorithms > small data + complex
algorithms
Changes how you do analysis
dramatically
30. •Causation --> Correlation Goal of
analysis often to figure out what
caused what. Causation very hard
to figure out
What causes breast cancer and other diseases
Data Science correlates what causes things to happen:
When will earthquake come
Why students fail and pass board exam
job after graduation and why
Using data understanding and computer science algorithms
31. Datafication":
•Process of converting abstract
things into concrete data e.g.,
what you like represented as a
stream of your likes;
•your "sitting posture" captured
using 100's of sensors placed in
a car seat
32. • Google Flu Trends
• Early warning of flu outbreaks by analyzing search
queries
• Up to 1 or 2 weeks ahead of CDC
• Analyzed 50M search queries to see which of them fit
the physician visits for flu
• 45 search terms used to create a single model
34. DATA SCIENCE PROJECTS
•Determining Rice Bug Epidemic Using Decision
Trees
•Prediction Model for Students’ Performance in
Java Programming with Course-content
Recommendation System
•Predicting IT Employability Using Data Mining
Techniques
35. DETERMINING RICE BUG EPIDEMIC USING
DECISION TREES
• Roland Calderon, Menchita Dumlao et. al (2016)
• data mining techniques in agriculture for predicting future trends such as
bug epidemic.
• Insect Epidemiology Data Mining (IEDM).
• IEDM - Discrete Mathematics and Theoretical Computer Science (DIMACS)
that aims to provide an opportunity to develop and test problem instances
and other methods of testing and comparing performance of algorithms
Data Science Projects:
36. DETERMINING RICE BUG EPIDEMIC USING
DECISION TREES
• uses decision tree .
• classification and prediction
• represents rules
• CRISP-DM methodology
Data Science Projects:
37. • Rice Field Insect Light Trap (RFILT) mass traps both the sexes
of insect pests
• insect distribution, abundance, flight patterns, timing of the
application of pesticide
DETERMINING RICE BUG EPIDEMIC USING
DECISION TREES
Data Science Projects:
38. • forecasting precision of a predictive model: confusion
matrix
Data Science Projects:
DETERMINING RICE BUG EPIDEMIC USING
DECISION TREES
42. • Lunar Cycle level is the best predictor of epidemic status
• followed by Vegetative level
• In Vegetative stage level, 100% resulted in outbreak status
Data Science Projects:
DETERMINING RICE BUG EPIDEMIC USING
DECISION TREES
43. • For the Ripening stage, the next best predictor is temperature.
• Over 82% bugs occurred in the outbreak status if the temperature is lesser or
equal to 32 to 38 temperatures
• 97.3% if the temperature greater than to 32 temperatures.
• For Reproduction and Resting stage, 52.7% bugs occurred in the infested
status and this is also considered a terminal node.
Data Science Projects:
DETERMINING RICE BUG EPIDEMIC USING
DECISION TREES
44. • Evale, Digna, Menchita F. Dumlao, et.al (2016)
• Comparative analysis among different data mining algorithm for attribute
selection and classification
• a two-phase study which aimed to predict the students’ performance in
Java Programming and be able to generate recommendations
Prediction Model for Students’ Performance in Java
Programming with Course-content Recommendation
System
Data Science Projects:
45. • Knowledge Discovery in Database (KDD)
• Logistic Regression and Correlation-based Feature Selection was used for
finding significant predictors
• Classifiers such as CHAID, Exhaustive CHAID, CRT, QUEST, J48, BayesNet,
NaïveBayes and JRip were implemented
Prediction Model for Students’ Performance in Java
Programming with Course-content Recommendation
System
Data Science Projects:
46. • J48, has the highest percentage of prediction.
• For the second phase evolutionary prototyping implemented
• Ruby on Rails : a web-based examination module that will determine the
students’ index of learning style and to assess their prior knowledge in Java
Prediction Model for Students’ Performance in Java
Programming with Course-content Recommendation
System
Data Science Projects:
47. • A course-content recommendation presenting the learners’ strengths and
weaknesses in the subject with suggested method of learning style will be
automatically generated by the system.
Prediction Model for Students’ Performance in Java
Programming with Course-content Recommendation
System
Data Science Projects:
48. • KDD: selection, pre-processing, transformation, mining and interpretation.
• Selection- possible attributes is collected for data set
• pre-processing - filtering and removing of irrelevant data.
• Transformation- determining the most suited data mining technique to
provide the best prediction algorithm.
• Mining -discovering the pattern captured through classification rules,
regression models or decision tree. Evaluation or interpretation is the process
of visualization extracted from models.
Prediction Model for Students’ Performance in Java
Programming with Course-content Recommendation
System
Data Science Projects:
49. • Waikato Environment for Knowledge Analysis (WEKA) data mining tool and
IBM Statistical Package for the Social Science (SPSS).
• There were 8 attributes namely gender, age, course, section, schedule and 3
academic performance for programming languages.
Prediction Model for Students’ Performance in Java
Programming with Course-content Recommendation
System
Data Science Projects:
50. • Attribute selection was done using Standard Regression Analysis, Forward
and Backward Conditional Regression, Likelihood Ratio, and WALD
• WEKA was also used to conduct pre-processing thru filtering by
AttributeSelection
Prediction Model for Students’ Performance in Java
Programming with Course-content Recommendation
System
Data Science Projects:
51. • Summary of Attribute Selection Result
Prediction Model for Students’ Performance in Java
Programming with Course-content Recommendation
System
Data Science Projects:
52. • 2 significant attribute out of eight original attributes
• With a critical p value of .05 (significant predictors should have smaller
critical p value),
• Binary Logistic Regression (SPSS)
• section and course as highly insignificant with .747 and .221 p value
respectively.
Prediction Model for Students’ Performance in Java
Programming with Course-content Recommendation
System
Data Science Projects:
53. • Pre-processing using attribute selection (SPSS and WEKA)
• course and section was automatically removed (highly insignificant)
Prediction Model for Students’ Performance in Java
Programming with Course-content Recommendation
System
Data Science Projects:
54. • CfsSubsetEvaluation - to further verify the significance of attribute gender
• BestFirst method -gender was found significant with 0.239 value of merit of
best subset (0 to 1,incorrectly classified instance)
• 76.1% of correctly classified instances
Prediction Model for Students’ Performance in Java
Programming with Course-content Recommendation
System
Data Science Projects:
55. • GreedyStepWise search method (through Cross Validation)
• , course and section are not found in any of the ten folds while gender
appeared in 7 out of 10 folds (70%).
• significant predictors: age, gender, schedule, grade in Programming 1,
grade in rogramming 2, and grade in Programming 3.
Prediction Model for Students’ Performance in Java
Programming with Course-content Recommendation
System
Data Science Projects:
56. Summary of Accuracy of Different Algorithms tested
Prediction Model for Students’ Performance in Java
Programming with Course-content Recommendation
System
Data Science Projects:
57. • J48 is the best algorithm
• J48 has highest accuracy in making predictions
• Also has the highest Cohen’s Kappa value which means that the prediction
is strongly reliable with 64% to 81% reliability
Prediction Model for Students’ Performance in Java
Programming with Course-content Recommendation
System
Data Science Projects:
58. • Piad, Keno, Menchita F. Dumlao, et.al.(2016)
• Knowledge discovery of databases (KDD)
• CRISP-DM (CROSS-Industry Standard Process for Data Mining)
• Naive Bayes
• Decision Tree
• Ensemble
Data Science Projects:
Predicting IT Employability Using Data Mining Techniques
59. • pre-processing
• data sets : training and testing data sets
• training datasets: used to generate model
• testing datasets: used to determine the acceptability of the model.
Data Science Projects:
Predicting IT Employability Using Data Mining Techniques
60. •Apriori Algorithm -determine
associated attributes frequently
occurred in the data sets
•decision tree and naive bayes
algorithm – used to design the
predictive model
•predictive model = equation or rule
sets for prediction
Data Science Projects:
Predicting IT Employability Using Data Mining Techniques
61. Data Science Projects:
Predicting IT Employability Using Data Mining Techniques
Rule set or equation learning instances of the testing sets
WEKA AND SPSS
graduate tracer
student’s biographic profile
cumulative grade point average (CGPA)
685 instances (tuples) SY 2011-2015
training and testing sets of
data.
62. Algorithm Accuracy Result Error Estimation
Rate %
Naive Bayes 75.33 24.47
J48 74.95 25.05
SimpleCart 73.01 26.99
Logistic regression 78.4 22.60
Chaid 76.3 23.70
Data Science Projects:
Predicting IT Employability Using Data Mining Techniques
Accuracy Result in Predicting IT Graduate Employability
63. • Logistic regression measures the relationship between the categorical
dependent logistic function
Data Science Projects:
Predicting IT Employability Using Data Mining Techniques
64. Algorithm Accuracy Result Error Estimation Rate
%
Chaid 70.1 29.9
Quest 40 60
CRT 70.2 29.8
Exhaustive Chaid 70.1 29.9
ID3 67 33
J48 70 30
Data Science Projects:
Predicting IT Employability Using Data Mining Techniques
Accuracy Result in Predicting IT Specific Profession
65. •Classification and Regression Trees.
•CRT splits the data into segments that are as
homogeneous :dependent variable.
•all cases have the same value for the
dependent variable is a homogeneous,
"pure" node.
Data Science Projects:
Predicting IT Employability Using Data Mining Techniques
66. • The CRT growing method: maximize within-node homogeneity.
• node that do not represent a homogenous subset of cases:impurity.
• a terminal node in which all cases have the same value for the dependent
variable is a homogenous node that requires no further splitting because it is
“pure.”
Data Science Projects:
Predicting IT Employability Using Data Mining Techniques
67. Observed Value
Predicted
Percentage
Corrected
Not
Related
Related
Target
Related 22 48 68.5
Not
Related
72 28 72
Average Percentage 70.5
Data Science Projects:
Predicting IT Employability Using Data Mining Techniques
Classification Table of Logistic Regression in Testing Data (N=170)
Results of Testing the Accuracy of Logistic Regression in Predicting Employability
68. IT Classifications IT Specific
Career
Correct
Classificaiton
Error Rate
1 (IT Software) 34 23 (67.64) 11 (32.35)
2 (IT Network/ Sys/
DB Admin)
25 16 (64.00) 9 (36.00)
3 (other IT related
field.)
11 5 (45.45) 16. (54.54)
Data Science Projects:
Predicting IT Employability Using Data Mining Techniques
Classification Table of CRT in Testing Data (N=70)
Results of Testing the Accuracy of CRT in Predicting Specific IT Field/Job to be Employed
69. • RapidMiner - https://rapidminer.com/
• BigML - https://bigml.com/
• Google Cloud AutoML - https://cloud.google.com/automl/
• Paxata - https://www.paxata.com/
• Trifacta -https://www.trifacta.com/
• MLBase - http://mlbase.org/
• Auto-WEKA -http://www.cs.ubc.ca/labs/beta/Projects/autoweka/
• Driverless AI - https://www.h2o.ai/driverless-ai/
DATA SCIENCE AND MACHINE LEARNING TOOLS
FOR PEOPLE WHO DON’T KNOW PROGRAMMING
70. • https://studio.azureml.net/ - https://studio.azureml.net/
• MLJar - https://mljar.com/
• Amazon Lex - https://aws.amazon.com/lex/
• IBM Watson Studio - https://www.ibm.com/cloud/watson-studio
• Automatic Statistician - https://www.automaticstatistician.com/index/
• KNIME - https://www.knime.com/
• FeatureLab - http://www.featurelab.co/
• MarketSwitch - http://www.experian.com/decision-analytics/marketswitch-optimization.html
• Logical Glue - http://www.logicalglue.com/
• Pure Predictive - http://www.purepredictive.com/
DATA SCIENCE AND MACHINE LEARNING TOOLS
FOR PEOPLE WHO DON’T KNOW PROGRAMMING
71. DO YOU THINK DATA
SCIENCE CAN DEVELOP
YOUR RESEARCH SKILLS?
AND HELP YOU DEVELOP
AMAZING RESEARCH?