Lec 1 integrating data science and data analytics in various research thrust

INTEGRATING DATA SCIENCE AND
DATA ANALYTICS IN VARIOUS
RESEARCH TRUST OF THE
UNIVERSITY
Menchita F. Dumlao, Ph.D.

MACHINE LEARNING
• computer programs that automatically improve with experience."
• interdisciplinary in nature
• employs techniques from the fields of computer science, statistics, and
artificial intelligence, among others.
• algorithms which facilitate automatic improvement from
• machine learning is a central aspect of data science.
• pattern recognition Machine learning has a complex relationship with data
mining.

WHAT DOES
FACEBOOK DO TO
YOUR DATA?
• learning what consumers prefer
• emotional contagion study
• Cookies on your browser predicts who you
are
• Social plugins ("like", subscribe" or
"recommend" buttons.)
• information that Facebook sells to advertisers

we’ve agreed to a huge
amount of data being turned
over and signing off on the
social network’s seemingly
limitless ability to do with it
whatever it wants

FACEBOOK DATA SCIENCE
crawled or scraped data will be valuable and
constructive for commercial, scientific, and many
other fields of prediction and analysis

FACEBOOK’S DATA PRIVACY
POLICY:
• …in addition to helping people see and find things that you do and share,
we may use the information we receive about you… for internal operations,
including troubleshooting, data analysis, testing, research and service
improvement.

OCTOPARSE
Octoparse is a powerful web scraper that
can scrape both static and dynamic
websites with AJAX, JavaScript, cookies
and etc

https://www.octoparse.com/blog/facebook-data-mining/

VISUAL SCRAPER
• Visual Scraper is another great free
web scraper with simple point-and-
click interface
• collect data from the web
• export the extracted data as CSV,
XML, JSON or SQL files.
• scrape data from up to 50,000 web
pages for only one
user.

FACEBOOK DATA SCIENCE USING R
• R is a data mining
software
application used
to analyze big
data.
• Data science in FB
using R.pdf

• Rfacebook Package provides an interface
to the Facebook API. For mining Facebook
using R, the Rfacebook package provides
functions that allow R to access Facebook’s
API to get information about posts,
comments, likes, group that mention
specific keywords & much more.

But: it is not much different from
what we, especially statisticians,
have been doing for many years

Much more data is digitally available than
was before
Inexpensive computing + Cloud + Easy-to-
use programming frameworks = Much
easier to analyze it
Often: large-scale data + simple
algorithms > small data + complex
algorithms
Changes how you do analysis
dramatically

•Causation --> Correlation Goal of
analysis often to figure out what
caused what. Causation very hard
to figure out
 What causes breast cancer and other diseases
Data Science correlates what causes things to happen:
 When will earthquake come
 Why students fail and pass board exam
 job after graduation and why
Using data understanding and computer science algorithms

Datafication":
•Process of converting abstract
things into concrete data e.g.,
what you like represented as a
stream of your likes;
•your "sitting posture" captured
using 100's of sensors placed in
a car seat

• Google Flu Trends
• Early warning of flu outbreaks by analyzing search
queries
• Up to 1 or 2 weeks ahead of CDC
• Analyzed 50M search queries to see which of them fit
the physician visits for flu
• 45 search terms used to create a single model

DATA SCIENCE PROJECTS:
ALGORITHMS, SIMULATION
AND APPLICATIONS

DATA SCIENCE PROJECTS
•Determining Rice Bug Epidemic Using Decision
Trees
•Prediction Model for Students’ Performance in
Java Programming with Course-content
Recommendation System
•Predicting IT Employability Using Data Mining
Techniques

DETERMINING RICE BUG EPIDEMIC USING
DECISION TREES
• Roland Calderon, Menchita Dumlao et. al (2016)
• data mining techniques in agriculture for predicting future trends such as
bug epidemic.
• Insect Epidemiology Data Mining (IEDM).
• IEDM - Discrete Mathematics and Theoretical Computer Science (DIMACS)
that aims to provide an opportunity to develop and test problem instances
and other methods of testing and comparing performance of algorithms
Data Science Projects:

DECISION TREES
• uses decision tree .
• classification and prediction
• represents rules
• CRISP-DM methodology

• Rice Field Insect Light Trap (RFILT) mass traps both the sexes
of insect pests
• insect distribution, abundance, flight patterns, timing of the
application of pesticide
DECISION TREES

• forecasting precision of a predictive model: confusion
matrix
DECISION TREES

DECISION TREES

• Lunar Cycle level is the best predictor of epidemic status
• followed by Vegetative level
• In Vegetative stage level, 100% resulted in outbreak status
DECISION TREES

• For the Ripening stage, the next best predictor is temperature.
• Over 82% bugs occurred in the outbreak status if the temperature is lesser or
equal to 32 to 38 temperatures
• 97.3% if the temperature greater than to 32 temperatures.
• For Reproduction and Resting stage, 52.7% bugs occurred in the infested
status and this is also considered a terminal node.
DECISION TREES

• Evale, Digna, Menchita F. Dumlao, et.al (2016)
• Comparative analysis among different data mining algorithm for attribute
selection and classification
• a two-phase study which aimed to predict the students’ performance in
Java Programming and be able to generate recommendations
Prediction Model for Students’ Performance in Java
Programming with Course-content Recommendation
System

• Knowledge Discovery in Database (KDD)
• Logistic Regression and Correlation-based Feature Selection was used for
finding significant predictors
• Classifiers such as CHAID, Exhaustive CHAID, CRT, QUEST, J48, BayesNet,
NaïveBayes and JRip were implemented
System

• J48, has the highest percentage of prediction.
• For the second phase evolutionary prototyping implemented
• Ruby on Rails : a web-based examination module that will determine the
students’ index of learning style and to assess their prior knowledge in Java
System

• A course-content recommendation presenting the learners’ strengths and
weaknesses in the subject with suggested method of learning style will be
automatically generated by the system.
System

• KDD: selection, pre-processing, transformation, mining and interpretation.
• Selection- possible attributes is collected for data set
• pre-processing - filtering and removing of irrelevant data.
• Transformation- determining the most suited data mining technique to
provide the best prediction algorithm.
• Mining -discovering the pattern captured through classification rules,
regression models or decision tree. Evaluation or interpretation is the process
of visualization extracted from models.
System

• Waikato Environment for Knowledge Analysis (WEKA) data mining tool and
IBM Statistical Package for the Social Science (SPSS).
• There were 8 attributes namely gender, age, course, section, schedule and 3
academic performance for programming languages.
System

• Attribute selection was done using Standard Regression Analysis, Forward
and Backward Conditional Regression, Likelihood Ratio, and WALD
• WEKA was also used to conduct pre-processing thru filtering by
AttributeSelection
System

• Summary of Attribute Selection Result
System

• 2 significant attribute out of eight original attributes
• With a critical p value of .05 (significant predictors should have smaller
critical p value),
• Binary Logistic Regression (SPSS)
• section and course as highly insignificant with .747 and .221 p value
respectively.
System

• Pre-processing using attribute selection (SPSS and WEKA)
• course and section was automatically removed (highly insignificant)
System

• CfsSubsetEvaluation - to further verify the significance of attribute gender
• BestFirst method -gender was found significant with 0.239 value of merit of
best subset (0 to 1,incorrectly classified instance)
• 76.1% of correctly classified instances
System

• GreedyStepWise search method (through Cross Validation)
• , course and section are not found in any of the ten folds while gender
appeared in 7 out of 10 folds (70%).
• significant predictors: age, gender, schedule, grade in Programming 1,
grade in rogramming 2, and grade in Programming 3.
System

Summary of Accuracy of Different Algorithms tested
System

• J48 is the best algorithm
• J48 has highest accuracy in making predictions
• Also has the highest Cohen’s Kappa value which means that the prediction
is strongly reliable with 64% to 81% reliability
System

• Piad, Keno, Menchita F. Dumlao, et.al.(2016)
• Knowledge discovery of databases (KDD)
• CRISP-DM (CROSS-Industry Standard Process for Data Mining)
• Naive Bayes
• Decision Tree
• Ensemble
Predicting IT Employability Using Data Mining Techniques

• pre-processing
• data sets : training and testing data sets
• training datasets: used to generate model
• testing datasets: used to determine the acceptability of the model.

•Apriori Algorithm -determine
associated attributes frequently
occurred in the data sets
•decision tree and naive bayes
algorithm – used to design the
predictive model
•predictive model = equation or rule
sets for prediction

Rule set or equation learning instances of the testing sets
WEKA AND SPSS
graduate tracer
student’s biographic profile
cumulative grade point average (CGPA)
685 instances (tuples) SY 2011-2015
training and testing sets of
data.

Algorithm Accuracy Result Error Estimation
Rate %
Naive Bayes 75.33 24.47
J48 74.95 25.05
SimpleCart 73.01 26.99
Logistic regression 78.4 22.60
Chaid 76.3 23.70
Accuracy Result in Predicting IT Graduate Employability

• Logistic regression measures the relationship between the categorical
dependent logistic function

Algorithm Accuracy Result Error Estimation Rate
%
Chaid 70.1 29.9
Quest 40 60
CRT 70.2 29.8
Exhaustive Chaid 70.1 29.9
ID3 67 33
J48 70 30
Accuracy Result in Predicting IT Specific Profession

•Classification and Regression Trees.
•CRT splits the data into segments that are as
homogeneous :dependent variable.
•all cases have the same value for the
dependent variable is a homogeneous,
"pure" node.

• The CRT growing method: maximize within-node homogeneity.
• node that do not represent a homogenous subset of cases:impurity.
• a terminal node in which all cases have the same value for the dependent
variable is a homogenous node that requires no further splitting because it is
“pure.”

Observed Value
Predicted
Percentage
Corrected
Not
Related
Related
Target
Related 22 48 68.5
Not
Related
72 28 72
Average Percentage 70.5
Classification Table of Logistic Regression in Testing Data (N=170)
Results of Testing the Accuracy of Logistic Regression in Predicting Employability

IT Classifications IT Specific
Career
Correct
Classificaiton
Error Rate
1 (IT Software) 34 23 (67.64) 11 (32.35)
2 (IT Network/ Sys/
DB Admin)
25 16 (64.00) 9 (36.00)
3 (other IT related
field.)
11 5 (45.45) 16. (54.54)
Classification Table of CRT in Testing Data (N=70)
Results of Testing the Accuracy of CRT in Predicting Specific IT Field/Job to be Employed

• RapidMiner - https://rapidminer.com/
• BigML - https://bigml.com/
• Google Cloud AutoML - https://cloud.google.com/automl/
• Paxata - https://www.paxata.com/
• Trifacta -https://www.trifacta.com/
• MLBase - http://mlbase.org/
• Auto-WEKA -http://www.cs.ubc.ca/labs/beta/Projects/autoweka/
• Driverless AI - https://www.h2o.ai/driverless-ai/
DATA SCIENCE AND MACHINE LEARNING TOOLS
FOR PEOPLE WHO DON’T KNOW PROGRAMMING

• https://studio.azureml.net/ - https://studio.azureml.net/
• MLJar - https://mljar.com/
• Amazon Lex - https://aws.amazon.com/lex/
• IBM Watson Studio - https://www.ibm.com/cloud/watson-studio
• Automatic Statistician - https://www.automaticstatistician.com/index/
• KNIME - https://www.knime.com/
• FeatureLab - http://www.featurelab.co/
• MarketSwitch - http://www.experian.com/decision-analytics/marketswitch-optimization.html
• Logical Glue - http://www.logicalglue.com/
• Pure Predictive - http://www.purepredictive.com/
DATA SCIENCE AND MACHINE LEARNING TOOLS
FOR PEOPLE WHO DON’T KNOW PROGRAMMING

DO YOU THINK DATA
SCIENCE CAN DEVELOP
YOUR RESEARCH SKILLS?
AND HELP YOU DEVELOP
AMAZING RESEARCH?

Lec 1 integrating data science and data analytics in various research thrust

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Lec 1 integrating data science and data analytics in various research thrust

Semelhante a Lec 1 integrating data science and data analytics in various research thrust (20)

Último

Último (20)

Lec 1 integrating data science and data analytics in various research thrust