SlideShare uma empresa Scribd logo
1 de 71
Baixar para ler offline
INTEGRATING DATA SCIENCE AND
DATA ANALYTICS IN VARIOUS
RESEARCH TRUST OF THE
UNIVERSITY
Menchita F. Dumlao, Ph.D.
DATA SCIENCE IS FOR BIG DATA
MACHINE LEARNING
• computer programs that automatically improve with experience."
• interdisciplinary in nature
• employs techniques from the fields of computer science, statistics, and
artificial intelligence, among others.
• algorithms which facilitate automatic improvement from
• machine learning is a central aspect of data science.
• pattern recognition Machine learning has a complex relationship with data
mining.
DATA SCIENCE : DAY T0 DAY
WHAT DOES
FACEBOOK DO TO
YOUR DATA?
• learning what consumers prefer
• emotional contagion study
• Cookies on your browser predicts who you
are
• Social plugins ("like", subscribe" or
"recommend" buttons.)
• information that Facebook sells to advertisers
we’ve agreed to a huge
amount of data being turned
over and signing off on the
social network’s seemingly
limitless ability to do with it
whatever it wants
FACEBOOK DATA SCIENCE
crawled or scraped data will be valuable and
constructive for commercial, scientific, and many
other fields of prediction and analysis
FACEBOOK’S DATA PRIVACY
POLICY:
• …in addition to helping people see and find things that you do and share,
we may use the information we receive about you… for internal operations,
including troubleshooting, data analysis, testing, research and service
improvement.
OCTOPARSE
Octoparse is a powerful web scraper that
can scrape both static and dynamic
websites with AJAX, JavaScript, cookies
and etc
https://www.octoparse.com/blog/facebook-data-mining/
VISUAL SCRAPER
• Visual Scraper is another great free
web scraper with simple point-and-
click interface
• collect data from the web
• export the extracted data as CSV,
XML, JSON or SQL files.
• scrape data from up to 50,000 web
pages for only one
user.
http://www.visualscraper.com/
FACEBOOK DATA SCIENCE USING R
• R is a data mining
software
application used
to analyze big
data.
• Data science in FB
using R.pdf
• Rfacebook Package provides an interface
to the Facebook API. For mining Facebook
using R, the Rfacebook package provides
functions that allow R to access Facebook’s
API to get information about posts,
comments, likes, group that mention
specific keywords & much more.
But: it is not much different from
what we, especially statisticians,
have been doing for many years
Much more data is digitally available than
was before
Inexpensive computing + Cloud + Easy-to-
use programming frameworks = Much
easier to analyze it
Often: large-scale data + simple
algorithms > small data + complex
algorithms
Changes how you do analysis
dramatically
•Causation --> Correlation Goal of
analysis often to figure out what
caused what. Causation very hard
to figure out
 What causes breast cancer and other diseases
Data Science correlates what causes things to happen:
 When will earthquake come
 Why students fail and pass board exam
 job after graduation and why
Using data understanding and computer science algorithms
Datafication":
•Process of converting abstract
things into concrete data e.g.,
what you like represented as a
stream of your likes;
•your "sitting posture" captured
using 100's of sensors placed in
a car seat
• Google Flu Trends
• Early warning of flu outbreaks by analyzing search
queries
• Up to 1 or 2 weeks ahead of CDC
• Analyzed 50M search queries to see which of them fit
the physician visits for flu
• 45 search terms used to create a single model
DATA SCIENCE PROJECTS:
ALGORITHMS, SIMULATION
AND APPLICATIONS
DATA SCIENCE PROJECTS
•Determining Rice Bug Epidemic Using Decision
Trees
•Prediction Model for Students’ Performance in
Java Programming with Course-content
Recommendation System
•Predicting IT Employability Using Data Mining
Techniques
DETERMINING RICE BUG EPIDEMIC USING
DECISION TREES
• Roland Calderon, Menchita Dumlao et. al (2016)
• data mining techniques in agriculture for predicting future trends such as
bug epidemic.
• Insect Epidemiology Data Mining (IEDM).
• IEDM - Discrete Mathematics and Theoretical Computer Science (DIMACS)
that aims to provide an opportunity to develop and test problem instances
and other methods of testing and comparing performance of algorithms
Data Science Projects:
DETERMINING RICE BUG EPIDEMIC USING
DECISION TREES
• uses decision tree .
• classification and prediction
• represents rules
• CRISP-DM methodology
Data Science Projects:
• Rice Field Insect Light Trap (RFILT) mass traps both the sexes
of insect pests
• insect distribution, abundance, flight patterns, timing of the
application of pesticide
DETERMINING RICE BUG EPIDEMIC USING
DECISION TREES
Data Science Projects:
• forecasting precision of a predictive model: confusion
matrix
Data Science Projects:
DETERMINING RICE BUG EPIDEMIC USING
DECISION TREES
Data Science Projects:
DETERMINING RICE BUG EPIDEMIC USING
DECISION TREES
Data Science Projects:
DETERMINING RICE BUG EPIDEMIC USING
DECISION TREES
Data Science Projects:
DETERMINING RICE BUG EPIDEMIC USING
DECISION TREES
• Lunar Cycle level is the best predictor of epidemic status
• followed by Vegetative level
• In Vegetative stage level, 100% resulted in outbreak status
Data Science Projects:
DETERMINING RICE BUG EPIDEMIC USING
DECISION TREES
• For the Ripening stage, the next best predictor is temperature.
• Over 82% bugs occurred in the outbreak status if the temperature is lesser or
equal to 32 to 38 temperatures
• 97.3% if the temperature greater than to 32 temperatures.
• For Reproduction and Resting stage, 52.7% bugs occurred in the infested
status and this is also considered a terminal node.
Data Science Projects:
DETERMINING RICE BUG EPIDEMIC USING
DECISION TREES
• Evale, Digna, Menchita F. Dumlao, et.al (2016)
• Comparative analysis among different data mining algorithm for attribute
selection and classification
• a two-phase study which aimed to predict the students’ performance in
Java Programming and be able to generate recommendations
Prediction Model for Students’ Performance in Java
Programming with Course-content Recommendation
System
Data Science Projects:
• Knowledge Discovery in Database (KDD)
• Logistic Regression and Correlation-based Feature Selection was used for
finding significant predictors
• Classifiers such as CHAID, Exhaustive CHAID, CRT, QUEST, J48, BayesNet,
NaïveBayes and JRip were implemented
Prediction Model for Students’ Performance in Java
Programming with Course-content Recommendation
System
Data Science Projects:
• J48, has the highest percentage of prediction.
• For the second phase evolutionary prototyping implemented
• Ruby on Rails : a web-based examination module that will determine the
students’ index of learning style and to assess their prior knowledge in Java
Prediction Model for Students’ Performance in Java
Programming with Course-content Recommendation
System
Data Science Projects:
• A course-content recommendation presenting the learners’ strengths and
weaknesses in the subject with suggested method of learning style will be
automatically generated by the system.
Prediction Model for Students’ Performance in Java
Programming with Course-content Recommendation
System
Data Science Projects:
• KDD: selection, pre-processing, transformation, mining and interpretation.
• Selection- possible attributes is collected for data set
• pre-processing - filtering and removing of irrelevant data.
• Transformation- determining the most suited data mining technique to
provide the best prediction algorithm.
• Mining -discovering the pattern captured through classification rules,
regression models or decision tree. Evaluation or interpretation is the process
of visualization extracted from models.
Prediction Model for Students’ Performance in Java
Programming with Course-content Recommendation
System
Data Science Projects:
• Waikato Environment for Knowledge Analysis (WEKA) data mining tool and
IBM Statistical Package for the Social Science (SPSS).
• There were 8 attributes namely gender, age, course, section, schedule and 3
academic performance for programming languages.
Prediction Model for Students’ Performance in Java
Programming with Course-content Recommendation
System
Data Science Projects:
• Attribute selection was done using Standard Regression Analysis, Forward
and Backward Conditional Regression, Likelihood Ratio, and WALD
• WEKA was also used to conduct pre-processing thru filtering by
AttributeSelection
Prediction Model for Students’ Performance in Java
Programming with Course-content Recommendation
System
Data Science Projects:
• Summary of Attribute Selection Result
Prediction Model for Students’ Performance in Java
Programming with Course-content Recommendation
System
Data Science Projects:
• 2 significant attribute out of eight original attributes
• With a critical p value of .05 (significant predictors should have smaller
critical p value),
• Binary Logistic Regression (SPSS)
• section and course as highly insignificant with .747 and .221 p value
respectively.
Prediction Model for Students’ Performance in Java
Programming with Course-content Recommendation
System
Data Science Projects:
• Pre-processing using attribute selection (SPSS and WEKA)
• course and section was automatically removed (highly insignificant)
Prediction Model for Students’ Performance in Java
Programming with Course-content Recommendation
System
Data Science Projects:
• CfsSubsetEvaluation - to further verify the significance of attribute gender
• BestFirst method -gender was found significant with 0.239 value of merit of
best subset (0 to 1,incorrectly classified instance)
• 76.1% of correctly classified instances
Prediction Model for Students’ Performance in Java
Programming with Course-content Recommendation
System
Data Science Projects:
• GreedyStepWise search method (through Cross Validation)
• , course and section are not found in any of the ten folds while gender
appeared in 7 out of 10 folds (70%).
• significant predictors: age, gender, schedule, grade in Programming 1,
grade in rogramming 2, and grade in Programming 3.
Prediction Model for Students’ Performance in Java
Programming with Course-content Recommendation
System
Data Science Projects:
Summary of Accuracy of Different Algorithms tested
Prediction Model for Students’ Performance in Java
Programming with Course-content Recommendation
System
Data Science Projects:
• J48 is the best algorithm
• J48 has highest accuracy in making predictions
• Also has the highest Cohen’s Kappa value which means that the prediction
is strongly reliable with 64% to 81% reliability
Prediction Model for Students’ Performance in Java
Programming with Course-content Recommendation
System
Data Science Projects:
• Piad, Keno, Menchita F. Dumlao, et.al.(2016)
• Knowledge discovery of databases (KDD)
• CRISP-DM (CROSS-Industry Standard Process for Data Mining)
• Naive Bayes
• Decision Tree
• Ensemble
Data Science Projects:
Predicting IT Employability Using Data Mining Techniques
• pre-processing
• data sets : training and testing data sets
• training datasets: used to generate model
• testing datasets: used to determine the acceptability of the model.
Data Science Projects:
Predicting IT Employability Using Data Mining Techniques
•Apriori Algorithm -determine
associated attributes frequently
occurred in the data sets
•decision tree and naive bayes
algorithm – used to design the
predictive model
•predictive model = equation or rule
sets for prediction
Data Science Projects:
Predicting IT Employability Using Data Mining Techniques
Data Science Projects:
Predicting IT Employability Using Data Mining Techniques
Rule set or equation learning instances of the testing sets
WEKA AND SPSS
graduate tracer
student’s biographic profile
cumulative grade point average (CGPA)
685 instances (tuples) SY 2011-2015
training and testing sets of
data.
Algorithm Accuracy Result Error Estimation
Rate %
Naive Bayes 75.33 24.47
J48 74.95 25.05
SimpleCart 73.01 26.99
Logistic regression 78.4 22.60
Chaid 76.3 23.70
Data Science Projects:
Predicting IT Employability Using Data Mining Techniques
Accuracy Result in Predicting IT Graduate Employability
• Logistic regression measures the relationship between the categorical
dependent logistic function
Data Science Projects:
Predicting IT Employability Using Data Mining Techniques
Algorithm Accuracy Result Error Estimation Rate
%
Chaid 70.1 29.9
Quest 40 60
CRT 70.2 29.8
Exhaustive Chaid 70.1 29.9
ID3 67 33
J48 70 30
Data Science Projects:
Predicting IT Employability Using Data Mining Techniques
Accuracy Result in Predicting IT Specific Profession
•Classification and Regression Trees.
•CRT splits the data into segments that are as
homogeneous :dependent variable.
•all cases have the same value for the
dependent variable is a homogeneous,
"pure" node.
Data Science Projects:
Predicting IT Employability Using Data Mining Techniques
• The CRT growing method: maximize within-node homogeneity.
• node that do not represent a homogenous subset of cases:impurity.
• a terminal node in which all cases have the same value for the dependent
variable is a homogenous node that requires no further splitting because it is
“pure.”
Data Science Projects:
Predicting IT Employability Using Data Mining Techniques
Observed Value
Predicted
Percentage
Corrected
Not
Related
Related
Target
Related 22 48 68.5
Not
Related
72 28 72
Average Percentage 70.5
Data Science Projects:
Predicting IT Employability Using Data Mining Techniques
Classification Table of Logistic Regression in Testing Data (N=170)
Results of Testing the Accuracy of Logistic Regression in Predicting Employability
IT Classifications IT Specific
Career
Correct
Classificaiton
Error Rate
1 (IT Software) 34 23 (67.64) 11 (32.35)
2 (IT Network/ Sys/
DB Admin)
25 16 (64.00) 9 (36.00)
3 (other IT related
field.)
11 5 (45.45) 16. (54.54)
Data Science Projects:
Predicting IT Employability Using Data Mining Techniques
Classification Table of CRT in Testing Data (N=70)
Results of Testing the Accuracy of CRT in Predicting Specific IT Field/Job to be Employed
• RapidMiner - https://rapidminer.com/
• BigML - https://bigml.com/
• Google Cloud AutoML - https://cloud.google.com/automl/
• Paxata - https://www.paxata.com/
• Trifacta -https://www.trifacta.com/
• MLBase - http://mlbase.org/
• Auto-WEKA -http://www.cs.ubc.ca/labs/beta/Projects/autoweka/
• Driverless AI - https://www.h2o.ai/driverless-ai/
DATA SCIENCE AND MACHINE LEARNING TOOLS
FOR PEOPLE WHO DON’T KNOW PROGRAMMING
• https://studio.azureml.net/ - https://studio.azureml.net/
• MLJar - https://mljar.com/
• Amazon Lex - https://aws.amazon.com/lex/
• IBM Watson Studio - https://www.ibm.com/cloud/watson-studio
• Automatic Statistician - https://www.automaticstatistician.com/index/
• KNIME - https://www.knime.com/
• FeatureLab - http://www.featurelab.co/
• MarketSwitch - http://www.experian.com/decision-analytics/marketswitch-optimization.html
• Logical Glue - http://www.logicalglue.com/
• Pure Predictive - http://www.purepredictive.com/
DATA SCIENCE AND MACHINE LEARNING TOOLS
FOR PEOPLE WHO DON’T KNOW PROGRAMMING
DO YOU THINK DATA
SCIENCE CAN DEVELOP
YOUR RESEARCH SKILLS?
AND HELP YOU DEVELOP
AMAZING RESEARCH?

Mais conteúdo relacionado

Mais procurados

Learning Analytics: Realizing the Big Data Promise in the CSU
Learning Analytics:  Realizing the Big Data Promise in the CSULearning Analytics:  Realizing the Big Data Promise in the CSU
Learning Analytics: Realizing the Big Data Promise in the CSUJohn Whitmer, Ed.D.
 
DataONE Education Module 09: Analysis and Workflows
DataONE Education Module 09: Analysis and WorkflowsDataONE Education Module 09: Analysis and Workflows
DataONE Education Module 09: Analysis and WorkflowsDataONE
 
Databases, Web Services and Tools For Systems Immunology
Databases, Web Services and Tools For Systems ImmunologyDatabases, Web Services and Tools For Systems Immunology
Databases, Web Services and Tools For Systems ImmunologyYannick Pouliot
 
Novero a resume-2018
Novero a resume-2018Novero a resume-2018
Novero a resume-2018Aileen Novero
 
Machine Learning for automated diagnosis of distributed ...AE
Machine Learning for automated diagnosis of distributed ...AEMachine Learning for automated diagnosis of distributed ...AE
Machine Learning for automated diagnosis of distributed ...AEbutest
 
Data at the NIH: Some Early Thoughts
Data at the NIH: Some Early ThoughtsData at the NIH: Some Early Thoughts
Data at the NIH: Some Early ThoughtsPhilip Bourne
 
ISMB/ECCB 2013 Keynote Goble Results may vary: what is reproducible? why do o...
ISMB/ECCB 2013 Keynote Goble Results may vary: what is reproducible? why do o...ISMB/ECCB 2013 Keynote Goble Results may vary: what is reproducible? why do o...
ISMB/ECCB 2013 Keynote Goble Results may vary: what is reproducible? why do o...Carole Goble
 
Lessons from Data Science Program at Indiana University: Curriculum, Students...
Lessons from Data Science Program at Indiana University: Curriculum, Students...Lessons from Data Science Program at Indiana University: Curriculum, Students...
Lessons from Data Science Program at Indiana University: Curriculum, Students...Geoffrey Fox
 
Connecting and synchronizing scientific knowledge
Connecting and synchronizing scientific knowledgeConnecting and synchronizing scientific knowledge
Connecting and synchronizing scientific knowledgePrashant Gupta
 
Capturing Context in Scientific Experiments: Towards Computer-Driven Science
Capturing Context in Scientific Experiments: Towards Computer-Driven ScienceCapturing Context in Scientific Experiments: Towards Computer-Driven Science
Capturing Context in Scientific Experiments: Towards Computer-Driven Sciencedgarijo
 
Reproducible research: First steps.
Reproducible research: First steps. Reproducible research: First steps.
Reproducible research: First steps. Richard Layton
 
An Introduction to Machine Learning and Genomics
An Introduction to Machine Learning and GenomicsAn Introduction to Machine Learning and Genomics
An Introduction to Machine Learning and GenomicsBrittany Lasseigne, Ph.D.
 
Introduction to Big Data and its Potential for Dementia Research
Introduction to Big Data and its Potential for Dementia ResearchIntroduction to Big Data and its Potential for Dementia Research
Introduction to Big Data and its Potential for Dementia ResearchDavid De Roure
 
RDAP 15: The Role of Assessment in Research Data Services
RDAP 15: The Role of Assessment in Research Data ServicesRDAP 15: The Role of Assessment in Research Data Services
RDAP 15: The Role of Assessment in Research Data ServicesASIS&T
 
Crowdsourcing Linked Data Quality Assessment
Crowdsourcing Linked Data Quality AssessmentCrowdsourcing Linked Data Quality Assessment
Crowdsourcing Linked Data Quality AssessmentAmrapali Zaveri, PhD
 
Web analytics webinar
Web analytics webinarWeb analytics webinar
Web analytics webinarJim Jansen
 
eScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodeScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodDuncan Hull
 

Mais procurados (20)

Learning Analytics: Realizing the Big Data Promise in the CSU
Learning Analytics:  Realizing the Big Data Promise in the CSULearning Analytics:  Realizing the Big Data Promise in the CSU
Learning Analytics: Realizing the Big Data Promise in the CSU
 
OpenML data@Sheffield
OpenML data@SheffieldOpenML data@Sheffield
OpenML data@Sheffield
 
DataONE Education Module 09: Analysis and Workflows
DataONE Education Module 09: Analysis and WorkflowsDataONE Education Module 09: Analysis and Workflows
DataONE Education Module 09: Analysis and Workflows
 
Databases, Web Services and Tools For Systems Immunology
Databases, Web Services and Tools For Systems ImmunologyDatabases, Web Services and Tools For Systems Immunology
Databases, Web Services and Tools For Systems Immunology
 
Novero a resume-2018
Novero a resume-2018Novero a resume-2018
Novero a resume-2018
 
Machine Learning for automated diagnosis of distributed ...AE
Machine Learning for automated diagnosis of distributed ...AEMachine Learning for automated diagnosis of distributed ...AE
Machine Learning for automated diagnosis of distributed ...AE
 
Data at the NIH: Some Early Thoughts
Data at the NIH: Some Early ThoughtsData at the NIH: Some Early Thoughts
Data at the NIH: Some Early Thoughts
 
ISMB/ECCB 2013 Keynote Goble Results may vary: what is reproducible? why do o...
ISMB/ECCB 2013 Keynote Goble Results may vary: what is reproducible? why do o...ISMB/ECCB 2013 Keynote Goble Results may vary: what is reproducible? why do o...
ISMB/ECCB 2013 Keynote Goble Results may vary: what is reproducible? why do o...
 
Lessons from Data Science Program at Indiana University: Curriculum, Students...
Lessons from Data Science Program at Indiana University: Curriculum, Students...Lessons from Data Science Program at Indiana University: Curriculum, Students...
Lessons from Data Science Program at Indiana University: Curriculum, Students...
 
Connecting and synchronizing scientific knowledge
Connecting and synchronizing scientific knowledgeConnecting and synchronizing scientific knowledge
Connecting and synchronizing scientific knowledge
 
Capturing Context in Scientific Experiments: Towards Computer-Driven Science
Capturing Context in Scientific Experiments: Towards Computer-Driven ScienceCapturing Context in Scientific Experiments: Towards Computer-Driven Science
Capturing Context in Scientific Experiments: Towards Computer-Driven Science
 
Reproducible research: First steps.
Reproducible research: First steps. Reproducible research: First steps.
Reproducible research: First steps.
 
An Introduction to Machine Learning and Genomics
An Introduction to Machine Learning and GenomicsAn Introduction to Machine Learning and Genomics
An Introduction to Machine Learning and Genomics
 
Amrapali Zaveri Defense
Amrapali Zaveri DefenseAmrapali Zaveri Defense
Amrapali Zaveri Defense
 
Introduction to Big Data and its Potential for Dementia Research
Introduction to Big Data and its Potential for Dementia ResearchIntroduction to Big Data and its Potential for Dementia Research
Introduction to Big Data and its Potential for Dementia Research
 
NISO Virtual Conference Scientific Data Management: Caring for Your Instituti...
NISO Virtual Conference Scientific Data Management: Caring for Your Instituti...NISO Virtual Conference Scientific Data Management: Caring for Your Instituti...
NISO Virtual Conference Scientific Data Management: Caring for Your Instituti...
 
RDAP 15: The Role of Assessment in Research Data Services
RDAP 15: The Role of Assessment in Research Data ServicesRDAP 15: The Role of Assessment in Research Data Services
RDAP 15: The Role of Assessment in Research Data Services
 
Crowdsourcing Linked Data Quality Assessment
Crowdsourcing Linked Data Quality AssessmentCrowdsourcing Linked Data Quality Assessment
Crowdsourcing Linked Data Quality Assessment
 
Web analytics webinar
Web analytics webinarWeb analytics webinar
Web analytics webinar
 
eScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodeScience: A Transformed Scientific Method
eScience: A Transformed Scientific Method
 

Semelhante a Lec 1 integrating data science and data analytics in various research thrust

Pemanfaatan Big Data Dalam Riset 2023.pptx
Pemanfaatan Big Data Dalam Riset 2023.pptxPemanfaatan Big Data Dalam Riset 2023.pptx
Pemanfaatan Big Data Dalam Riset 2023.pptxelisarosa29
 
Data Science Course in Pune
Data Science Course in Pune Data Science Course in Pune
Data Science Course in Pune nmdfilmProduction
 
Data Science Training and Placement
Data Science Training and PlacementData Science Training and Placement
Data Science Training and PlacementAkhilGGM
 
Best Selenium certification course
Best Selenium certification courseBest Selenium certification course
Best Selenium certification courseKumarNaik21
 
Data science training in hyd ppt converted (1)
Data science training in hyd ppt converted (1)Data science training in hyd ppt converted (1)
Data science training in hyd ppt converted (1)SayyedYusufali
 
Data science training in hyd pdf converted (1)
Data science training in hyd pdf converted (1)Data science training in hyd pdf converted (1)
Data science training in hyd pdf converted (1)SayyedYusufali
 
Data science training in hydpdf converted (1)
Data science training in hydpdf  converted (1)Data science training in hydpdf  converted (1)
Data science training in hydpdf converted (1)SayyedYusufali
 
Data Science.pptx NEW COURICUUMN IN DATA
Data Science.pptx NEW COURICUUMN IN DATAData Science.pptx NEW COURICUUMN IN DATA
Data Science.pptx NEW COURICUUMN IN DATAjaved75
 
Which institute is best for data science?
Which institute is best for data science?Which institute is best for data science?
Which institute is best for data science?DIGITALSAI1
 
Best Selenium certification course
Best Selenium certification courseBest Selenium certification course
Best Selenium certification courseKumarNaik21
 
Data science training in hyd ppt (1)
Data science training in hyd ppt (1)Data science training in hyd ppt (1)
Data science training in hyd ppt (1)SayyedYusufali
 
Data science training institute in hyderabad
Data science training institute in hyderabadData science training institute in hyderabad
Data science training institute in hyderabadVamsiNihal
 
Data science training in Hyderabad
Data science  training in HyderabadData science  training in Hyderabad
Data science training in Hyderabadsaitejavella
 
Data science training Hyderabad
Data science training HyderabadData science training Hyderabad
Data science training HyderabadNithinsunil1
 
Data science online training in hyderabad
Data science online training in hyderabadData science online training in hyderabad
Data science online training in hyderabadVamsiNihal
 
Data science training in hyd ppt (1)
Data science training in hyd ppt (1)Data science training in hyd ppt (1)
Data science training in hyd ppt (1)SayyedYusufali
 
data science training and placement
data science training and placementdata science training and placement
data science training and placementSaiprasadVella
 
online data science training
online data science trainingonline data science training
online data science trainingDIGITALSAI1
 
Data science online training in hyderabad
Data science online training in hyderabadData science online training in hyderabad
Data science online training in hyderabadVamsiNihal
 

Semelhante a Lec 1 integrating data science and data analytics in various research thrust (20)

Pemanfaatan Big Data Dalam Riset 2023.pptx
Pemanfaatan Big Data Dalam Riset 2023.pptxPemanfaatan Big Data Dalam Riset 2023.pptx
Pemanfaatan Big Data Dalam Riset 2023.pptx
 
Data Science Course in Pune
Data Science Course in Pune Data Science Course in Pune
Data Science Course in Pune
 
Data Science Training and Placement
Data Science Training and PlacementData Science Training and Placement
Data Science Training and Placement
 
Data Science and Analysis.pptx
Data Science and Analysis.pptxData Science and Analysis.pptx
Data Science and Analysis.pptx
 
Best Selenium certification course
Best Selenium certification courseBest Selenium certification course
Best Selenium certification course
 
Data science training in hyd ppt converted (1)
Data science training in hyd ppt converted (1)Data science training in hyd ppt converted (1)
Data science training in hyd ppt converted (1)
 
Data science training in hyd pdf converted (1)
Data science training in hyd pdf converted (1)Data science training in hyd pdf converted (1)
Data science training in hyd pdf converted (1)
 
Data science training in hydpdf converted (1)
Data science training in hydpdf  converted (1)Data science training in hydpdf  converted (1)
Data science training in hydpdf converted (1)
 
Data Science.pptx NEW COURICUUMN IN DATA
Data Science.pptx NEW COURICUUMN IN DATAData Science.pptx NEW COURICUUMN IN DATA
Data Science.pptx NEW COURICUUMN IN DATA
 
Which institute is best for data science?
Which institute is best for data science?Which institute is best for data science?
Which institute is best for data science?
 
Best Selenium certification course
Best Selenium certification courseBest Selenium certification course
Best Selenium certification course
 
Data science training in hyd ppt (1)
Data science training in hyd ppt (1)Data science training in hyd ppt (1)
Data science training in hyd ppt (1)
 
Data science training institute in hyderabad
Data science training institute in hyderabadData science training institute in hyderabad
Data science training institute in hyderabad
 
Data science training in Hyderabad
Data science  training in HyderabadData science  training in Hyderabad
Data science training in Hyderabad
 
Data science training Hyderabad
Data science training HyderabadData science training Hyderabad
Data science training Hyderabad
 
Data science online training in hyderabad
Data science online training in hyderabadData science online training in hyderabad
Data science online training in hyderabad
 
Data science training in hyd ppt (1)
Data science training in hyd ppt (1)Data science training in hyd ppt (1)
Data science training in hyd ppt (1)
 
data science training and placement
data science training and placementdata science training and placement
data science training and placement
 
online data science training
online data science trainingonline data science training
online data science training
 
Data science online training in hyderabad
Data science online training in hyderabadData science online training in hyderabad
Data science online training in hyderabad
 

Último

7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.pptibrahimabdi22
 
Digital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareDigital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareGraham Ware
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...Bertram Ludäscher
 
Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...kumargunjan9515
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...ZurliaSoop
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...gajnagarg
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...gajnagarg
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...gajnagarg
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制vexqp
 
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangePredicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangeThinkInnovation
 
Ranking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRanking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRajesh Mondal
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...nirzagarg
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...Health
 
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...gragchanchal546
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...Elaine Werffeli
 
Statistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numbersStatistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numberssuginr1
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...nirzagarg
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabiaahmedjiabur940
 
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...HyderabadDolls
 

Último (20)

7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt
 
Digital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareDigital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham Ware
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
 
Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
 
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangePredicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 
Ranking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRanking and Scoring Exercises for Research
Ranking and Scoring Exercises for Research
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
 
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
 
Statistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numbersStatistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numbers
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
 
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
 

Lec 1 integrating data science and data analytics in various research thrust

  • 1. INTEGRATING DATA SCIENCE AND DATA ANALYTICS IN VARIOUS RESEARCH TRUST OF THE UNIVERSITY Menchita F. Dumlao, Ph.D.
  • 2. DATA SCIENCE IS FOR BIG DATA
  • 3.
  • 4.
  • 5.
  • 6.
  • 7.
  • 8.
  • 9.
  • 10. MACHINE LEARNING • computer programs that automatically improve with experience." • interdisciplinary in nature • employs techniques from the fields of computer science, statistics, and artificial intelligence, among others. • algorithms which facilitate automatic improvement from • machine learning is a central aspect of data science. • pattern recognition Machine learning has a complex relationship with data mining.
  • 11.
  • 12.
  • 13.
  • 14. DATA SCIENCE : DAY T0 DAY
  • 15. WHAT DOES FACEBOOK DO TO YOUR DATA? • learning what consumers prefer • emotional contagion study • Cookies on your browser predicts who you are • Social plugins ("like", subscribe" or "recommend" buttons.) • information that Facebook sells to advertisers
  • 16. we’ve agreed to a huge amount of data being turned over and signing off on the social network’s seemingly limitless ability to do with it whatever it wants
  • 17. FACEBOOK DATA SCIENCE crawled or scraped data will be valuable and constructive for commercial, scientific, and many other fields of prediction and analysis
  • 18. FACEBOOK’S DATA PRIVACY POLICY: • …in addition to helping people see and find things that you do and share, we may use the information we receive about you… for internal operations, including troubleshooting, data analysis, testing, research and service improvement.
  • 19. OCTOPARSE Octoparse is a powerful web scraper that can scrape both static and dynamic websites with AJAX, JavaScript, cookies and etc
  • 21. VISUAL SCRAPER • Visual Scraper is another great free web scraper with simple point-and- click interface • collect data from the web • export the extracted data as CSV, XML, JSON or SQL files. • scrape data from up to 50,000 web pages for only one user.
  • 23. FACEBOOK DATA SCIENCE USING R • R is a data mining software application used to analyze big data. • Data science in FB using R.pdf
  • 24. • Rfacebook Package provides an interface to the Facebook API. For mining Facebook using R, the Rfacebook package provides functions that allow R to access Facebook’s API to get information about posts, comments, likes, group that mention specific keywords & much more.
  • 25.
  • 26.
  • 27.
  • 28. But: it is not much different from what we, especially statisticians, have been doing for many years
  • 29. Much more data is digitally available than was before Inexpensive computing + Cloud + Easy-to- use programming frameworks = Much easier to analyze it Often: large-scale data + simple algorithms > small data + complex algorithms Changes how you do analysis dramatically
  • 30. •Causation --> Correlation Goal of analysis often to figure out what caused what. Causation very hard to figure out  What causes breast cancer and other diseases Data Science correlates what causes things to happen:  When will earthquake come  Why students fail and pass board exam  job after graduation and why Using data understanding and computer science algorithms
  • 31. Datafication": •Process of converting abstract things into concrete data e.g., what you like represented as a stream of your likes; •your "sitting posture" captured using 100's of sensors placed in a car seat
  • 32. • Google Flu Trends • Early warning of flu outbreaks by analyzing search queries • Up to 1 or 2 weeks ahead of CDC • Analyzed 50M search queries to see which of them fit the physician visits for flu • 45 search terms used to create a single model
  • 33. DATA SCIENCE PROJECTS: ALGORITHMS, SIMULATION AND APPLICATIONS
  • 34. DATA SCIENCE PROJECTS •Determining Rice Bug Epidemic Using Decision Trees •Prediction Model for Students’ Performance in Java Programming with Course-content Recommendation System •Predicting IT Employability Using Data Mining Techniques
  • 35. DETERMINING RICE BUG EPIDEMIC USING DECISION TREES • Roland Calderon, Menchita Dumlao et. al (2016) • data mining techniques in agriculture for predicting future trends such as bug epidemic. • Insect Epidemiology Data Mining (IEDM). • IEDM - Discrete Mathematics and Theoretical Computer Science (DIMACS) that aims to provide an opportunity to develop and test problem instances and other methods of testing and comparing performance of algorithms Data Science Projects:
  • 36. DETERMINING RICE BUG EPIDEMIC USING DECISION TREES • uses decision tree . • classification and prediction • represents rules • CRISP-DM methodology Data Science Projects:
  • 37. • Rice Field Insect Light Trap (RFILT) mass traps both the sexes of insect pests • insect distribution, abundance, flight patterns, timing of the application of pesticide DETERMINING RICE BUG EPIDEMIC USING DECISION TREES Data Science Projects:
  • 38. • forecasting precision of a predictive model: confusion matrix Data Science Projects: DETERMINING RICE BUG EPIDEMIC USING DECISION TREES
  • 39. Data Science Projects: DETERMINING RICE BUG EPIDEMIC USING DECISION TREES
  • 40. Data Science Projects: DETERMINING RICE BUG EPIDEMIC USING DECISION TREES
  • 41. Data Science Projects: DETERMINING RICE BUG EPIDEMIC USING DECISION TREES
  • 42. • Lunar Cycle level is the best predictor of epidemic status • followed by Vegetative level • In Vegetative stage level, 100% resulted in outbreak status Data Science Projects: DETERMINING RICE BUG EPIDEMIC USING DECISION TREES
  • 43. • For the Ripening stage, the next best predictor is temperature. • Over 82% bugs occurred in the outbreak status if the temperature is lesser or equal to 32 to 38 temperatures • 97.3% if the temperature greater than to 32 temperatures. • For Reproduction and Resting stage, 52.7% bugs occurred in the infested status and this is also considered a terminal node. Data Science Projects: DETERMINING RICE BUG EPIDEMIC USING DECISION TREES
  • 44. • Evale, Digna, Menchita F. Dumlao, et.al (2016) • Comparative analysis among different data mining algorithm for attribute selection and classification • a two-phase study which aimed to predict the students’ performance in Java Programming and be able to generate recommendations Prediction Model for Students’ Performance in Java Programming with Course-content Recommendation System Data Science Projects:
  • 45. • Knowledge Discovery in Database (KDD) • Logistic Regression and Correlation-based Feature Selection was used for finding significant predictors • Classifiers such as CHAID, Exhaustive CHAID, CRT, QUEST, J48, BayesNet, NaïveBayes and JRip were implemented Prediction Model for Students’ Performance in Java Programming with Course-content Recommendation System Data Science Projects:
  • 46. • J48, has the highest percentage of prediction. • For the second phase evolutionary prototyping implemented • Ruby on Rails : a web-based examination module that will determine the students’ index of learning style and to assess their prior knowledge in Java Prediction Model for Students’ Performance in Java Programming with Course-content Recommendation System Data Science Projects:
  • 47. • A course-content recommendation presenting the learners’ strengths and weaknesses in the subject with suggested method of learning style will be automatically generated by the system. Prediction Model for Students’ Performance in Java Programming with Course-content Recommendation System Data Science Projects:
  • 48. • KDD: selection, pre-processing, transformation, mining and interpretation. • Selection- possible attributes is collected for data set • pre-processing - filtering and removing of irrelevant data. • Transformation- determining the most suited data mining technique to provide the best prediction algorithm. • Mining -discovering the pattern captured through classification rules, regression models or decision tree. Evaluation or interpretation is the process of visualization extracted from models. Prediction Model for Students’ Performance in Java Programming with Course-content Recommendation System Data Science Projects:
  • 49. • Waikato Environment for Knowledge Analysis (WEKA) data mining tool and IBM Statistical Package for the Social Science (SPSS). • There were 8 attributes namely gender, age, course, section, schedule and 3 academic performance for programming languages. Prediction Model for Students’ Performance in Java Programming with Course-content Recommendation System Data Science Projects:
  • 50. • Attribute selection was done using Standard Regression Analysis, Forward and Backward Conditional Regression, Likelihood Ratio, and WALD • WEKA was also used to conduct pre-processing thru filtering by AttributeSelection Prediction Model for Students’ Performance in Java Programming with Course-content Recommendation System Data Science Projects:
  • 51. • Summary of Attribute Selection Result Prediction Model for Students’ Performance in Java Programming with Course-content Recommendation System Data Science Projects:
  • 52. • 2 significant attribute out of eight original attributes • With a critical p value of .05 (significant predictors should have smaller critical p value), • Binary Logistic Regression (SPSS) • section and course as highly insignificant with .747 and .221 p value respectively. Prediction Model for Students’ Performance in Java Programming with Course-content Recommendation System Data Science Projects:
  • 53. • Pre-processing using attribute selection (SPSS and WEKA) • course and section was automatically removed (highly insignificant) Prediction Model for Students’ Performance in Java Programming with Course-content Recommendation System Data Science Projects:
  • 54. • CfsSubsetEvaluation - to further verify the significance of attribute gender • BestFirst method -gender was found significant with 0.239 value of merit of best subset (0 to 1,incorrectly classified instance) • 76.1% of correctly classified instances Prediction Model for Students’ Performance in Java Programming with Course-content Recommendation System Data Science Projects:
  • 55. • GreedyStepWise search method (through Cross Validation) • , course and section are not found in any of the ten folds while gender appeared in 7 out of 10 folds (70%). • significant predictors: age, gender, schedule, grade in Programming 1, grade in rogramming 2, and grade in Programming 3. Prediction Model for Students’ Performance in Java Programming with Course-content Recommendation System Data Science Projects:
  • 56. Summary of Accuracy of Different Algorithms tested Prediction Model for Students’ Performance in Java Programming with Course-content Recommendation System Data Science Projects:
  • 57. • J48 is the best algorithm • J48 has highest accuracy in making predictions • Also has the highest Cohen’s Kappa value which means that the prediction is strongly reliable with 64% to 81% reliability Prediction Model for Students’ Performance in Java Programming with Course-content Recommendation System Data Science Projects:
  • 58. • Piad, Keno, Menchita F. Dumlao, et.al.(2016) • Knowledge discovery of databases (KDD) • CRISP-DM (CROSS-Industry Standard Process for Data Mining) • Naive Bayes • Decision Tree • Ensemble Data Science Projects: Predicting IT Employability Using Data Mining Techniques
  • 59. • pre-processing • data sets : training and testing data sets • training datasets: used to generate model • testing datasets: used to determine the acceptability of the model. Data Science Projects: Predicting IT Employability Using Data Mining Techniques
  • 60. •Apriori Algorithm -determine associated attributes frequently occurred in the data sets •decision tree and naive bayes algorithm – used to design the predictive model •predictive model = equation or rule sets for prediction Data Science Projects: Predicting IT Employability Using Data Mining Techniques
  • 61. Data Science Projects: Predicting IT Employability Using Data Mining Techniques Rule set or equation learning instances of the testing sets WEKA AND SPSS graduate tracer student’s biographic profile cumulative grade point average (CGPA) 685 instances (tuples) SY 2011-2015 training and testing sets of data.
  • 62. Algorithm Accuracy Result Error Estimation Rate % Naive Bayes 75.33 24.47 J48 74.95 25.05 SimpleCart 73.01 26.99 Logistic regression 78.4 22.60 Chaid 76.3 23.70 Data Science Projects: Predicting IT Employability Using Data Mining Techniques Accuracy Result in Predicting IT Graduate Employability
  • 63. • Logistic regression measures the relationship between the categorical dependent logistic function Data Science Projects: Predicting IT Employability Using Data Mining Techniques
  • 64. Algorithm Accuracy Result Error Estimation Rate % Chaid 70.1 29.9 Quest 40 60 CRT 70.2 29.8 Exhaustive Chaid 70.1 29.9 ID3 67 33 J48 70 30 Data Science Projects: Predicting IT Employability Using Data Mining Techniques Accuracy Result in Predicting IT Specific Profession
  • 65. •Classification and Regression Trees. •CRT splits the data into segments that are as homogeneous :dependent variable. •all cases have the same value for the dependent variable is a homogeneous, "pure" node. Data Science Projects: Predicting IT Employability Using Data Mining Techniques
  • 66. • The CRT growing method: maximize within-node homogeneity. • node that do not represent a homogenous subset of cases:impurity. • a terminal node in which all cases have the same value for the dependent variable is a homogenous node that requires no further splitting because it is “pure.” Data Science Projects: Predicting IT Employability Using Data Mining Techniques
  • 67. Observed Value Predicted Percentage Corrected Not Related Related Target Related 22 48 68.5 Not Related 72 28 72 Average Percentage 70.5 Data Science Projects: Predicting IT Employability Using Data Mining Techniques Classification Table of Logistic Regression in Testing Data (N=170) Results of Testing the Accuracy of Logistic Regression in Predicting Employability
  • 68. IT Classifications IT Specific Career Correct Classificaiton Error Rate 1 (IT Software) 34 23 (67.64) 11 (32.35) 2 (IT Network/ Sys/ DB Admin) 25 16 (64.00) 9 (36.00) 3 (other IT related field.) 11 5 (45.45) 16. (54.54) Data Science Projects: Predicting IT Employability Using Data Mining Techniques Classification Table of CRT in Testing Data (N=70) Results of Testing the Accuracy of CRT in Predicting Specific IT Field/Job to be Employed
  • 69. • RapidMiner - https://rapidminer.com/ • BigML - https://bigml.com/ • Google Cloud AutoML - https://cloud.google.com/automl/ • Paxata - https://www.paxata.com/ • Trifacta -https://www.trifacta.com/ • MLBase - http://mlbase.org/ • Auto-WEKA -http://www.cs.ubc.ca/labs/beta/Projects/autoweka/ • Driverless AI - https://www.h2o.ai/driverless-ai/ DATA SCIENCE AND MACHINE LEARNING TOOLS FOR PEOPLE WHO DON’T KNOW PROGRAMMING
  • 70. • https://studio.azureml.net/ - https://studio.azureml.net/ • MLJar - https://mljar.com/ • Amazon Lex - https://aws.amazon.com/lex/ • IBM Watson Studio - https://www.ibm.com/cloud/watson-studio • Automatic Statistician - https://www.automaticstatistician.com/index/ • KNIME - https://www.knime.com/ • FeatureLab - http://www.featurelab.co/ • MarketSwitch - http://www.experian.com/decision-analytics/marketswitch-optimization.html • Logical Glue - http://www.logicalglue.com/ • Pure Predictive - http://www.purepredictive.com/ DATA SCIENCE AND MACHINE LEARNING TOOLS FOR PEOPLE WHO DON’T KNOW PROGRAMMING
  • 71. DO YOU THINK DATA SCIENCE CAN DEVELOP YOUR RESEARCH SKILLS? AND HELP YOU DEVELOP AMAZING RESEARCH?