SlideShare uma empresa Scribd logo
1 de 11
Predictive Modeling: Research Tasks




                            Nilitis, LLC. © 2012
1. Netflix Database
http://cms.uhd.edu/faculty/chenp/class/4319/project/netflixfiles.html

Netflix, Inc. - American provider of on-
demand Internet streaming media and
flat rate DVD-by-mail


Training data set:
100,480,507 ratings
480,189 users
17,770 movies
Data set entry:
<user (ID), movie (ID), date of grade (yyyy-mm-dd), grade(1-5)>


The BellKor Solution:
http://www.netflixprize.com/assets/GrandPrize2009_BPC_BellKor.pdf
The Big Chaos Solution:
http://www.netflixprize.com/assets/GrandPrize2009_BPC_BigChaos.pdf
The Pragmatic Theory Solution:
http://www.netflixprize.com/assets/GrandPrize2009_BPC_PragmaticTheory.pdf

                                                  2                         Nilitis, LLC. © 2012
1. Netflix Database

 User-based collaborative filtering
 - Look for users who share the same rating patterns
 - Use the ratings from those users to calculate a prediction

 Item-based collaborative filtering
 - Build an item-item matrix determining relationships between
          pairs of items
 - Using the matrix, and the data on the current user, infer his
          taste



…A note from the donor regarding Netflix data:
"Thank you for your interest in the Netflix Prize dataset. The dataset is no
longer available.“

Robust De-anonymization of Large Sparse Datasets
http://www.cs.utexas.edu/~shmat/shmat_oak08netflix.pdf



                                               3                       Nilitis, LLC. © 2012
2. EEG Database Data Set
   http://archive.ics.uci.edu/ml/datasets/EEG+Database

This data from a large study to examine EEG
correlates of genetic predisposition to alcoholism.

64 electrodes placed on subject's scalps which
were sampled at 256 Hz for 1 second.

There were two groups of subjects: alcoholic and
control.

Each subject was exposed to either a single
stimulus (S1) or to two stimuli (S1 and S2).

122 subjects, each subject completed 120 trials
where different stimuli were shown.




EEG / ERP data available for free public download
http://sccn.ucsd.edu/~arno/fam2data/publicly_available_EEG_data.html

                                                    4                  Nilitis, LLC. © 2012
2. EEG Database Data Set
Control                                    Alcoholic




             example plots of a control and alcoholic subject




http://www.ingber.com/ - webpage of Lester Ingber
Use Ingber’s Canonical Momentum Indicator or smth. else? Or raw data?

                                       5                            Nilitis, LLC. © 2012
3. Berlin Database of Emotional Speech
 http://database.syntheticspeech.de/

6 basic emotions: anger, joy,
sadness, fear, disgust and boredom
+ neutral speech

Ten professional native German
actors (5 female and 5 male)
simulated these emotions,
producing 10 utterances (5 short
and 5 longer sentences)

emotion was recognized by at least
80 % of the listeners




                                     6   Nilitis, LLC. © 2012
3. Berlin Database of Emotional Speech
Voice Emotion Recognition:

            Audio         Feature
                                          Classifier   Emotion
            Stream       Extraction




Feature Extraction: “openEAR”
http://sourceforge.net/projects/openart/?source=dlp

Take settings from openEAR “emobase” config files and articles
+ possibly to add some feature selection steps (state of the art–
sequential feature selection)

Classifier: state of the art – SVM with polynomial or RBF kernel
(libSVM included into openEAR package)



                                      7                          Nilitis, LLC. © 2012
4. Wikipedia page-to-page link database
 http://haselgrove.id.au/wikipedia.htm

Total pages: 5,716,808
Total links: 130,160,392



Google PageRank technology:
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.38.5427


                                             85% likelihood of choosing a random link
                                             from the page

                                             15% likelihood of jumping to a page
                                             chosen at random from the entire web




                                         8                                 Nilitis, LLC. © 2012
5. Detecting Malicious URLs
 http://sysnet.ucsd.edu/projects/url/


 about 2.4 million URLs
 3.2 million features


Estimating covariance matrix for
high-dimensional data

Linear implementation of SVM
(LIBLINEAR)




                                        9   Nilitis, LLC. © 2012
5. Pseudo Periodic Synthetic Time Series Data Set
   http://archive.ics.uci.edu/ml/datasets/Pseudo+Periodic+Synthetic+Time+Series




       + Branch and Bond evaluation




An Indexing Scheme for Fast Similarity Search in Large Time Series Databases
http://www.cs.rutgers.edu/~pazzani/Publications/ssdb99.pdf

                                                      10                       Nilitis, LLC. © 2012
Other Datasets
Individual household electric power consumption Data Set
         http://archive.ics.uci.edu/ml/datasets/Individual+household+electric+power+consumption
Bank Marketing Data Set
         http://archive.ics.uci.edu/ml/datasets/Bank+Marketing
Solar Flare Data Set
          http://archive.ics.uci.edu/ml/datasets/Solar+Flare
Forest Fires Data Set
         http://archive.ics.uci.edu/ml/datasets/Forest+Fires
Arrhythmia Data Set
         http://archive.ics.uci.edu/ml/datasets/Arrhythmia
Communities and Crime Data Set
         http://archive.ics.uci.edu/ml/datasets/Communities+and+Crime+Unnormalized
Census Income Data Set
         http://archive.ics.uci.edu/ml/datasets/Census+Income




                                                11                                 Nilitis, LLC. © 2012

Mais conteúdo relacionado

Destaque

Definition and types of research
Definition and types of researchDefinition and types of research
Definition and types of research
fadifm
 

Destaque (8)

Rm basic ok
Rm basic okRm basic ok
Rm basic ok
 
Design of Experiments
Design of ExperimentsDesign of Experiments
Design of Experiments
 
MetaScience: Holistic Approach for Research Modeling and Analysis
MetaScience: Holistic Approach for Research Modeling and AnalysisMetaScience: Holistic Approach for Research Modeling and Analysis
MetaScience: Holistic Approach for Research Modeling and Analysis
 
Research Methodology Lecture for Master & Phd Students
Research Methodology  Lecture for Master & Phd StudentsResearch Methodology  Lecture for Master & Phd Students
Research Methodology Lecture for Master & Phd Students
 
Research methodology notes
Research methodology notesResearch methodology notes
Research methodology notes
 
Research Methods: Basic Concepts and Methods
Research Methods: Basic Concepts and MethodsResearch Methods: Basic Concepts and Methods
Research Methods: Basic Concepts and Methods
 
Definition and types of research
Definition and types of researchDefinition and types of research
Definition and types of research
 
Types of Research
Types of ResearchTypes of Research
Types of Research
 

Semelhante a Predictive modeling DBs

Big data, data science & fast data
Big data, data science & fast dataBig data, data science & fast data
Big data, data science & fast data
Kunal Joshi
 
Educause Annual 2007
Educause Annual 2007Educause Annual 2007
Educause Annual 2007
Neil Matatall
 

Semelhante a Predictive modeling DBs (20)

iMicrobe_ASLO_2015
iMicrobe_ASLO_2015iMicrobe_ASLO_2015
iMicrobe_ASLO_2015
 
Linked Data and Semantic Web Application Development by Peter Haase
Linked Data and Semantic Web Application Development by Peter HaaseLinked Data and Semantic Web Application Development by Peter Haase
Linked Data and Semantic Web Application Development by Peter Haase
 
Open Energy Data
Open Energy DataOpen Energy Data
Open Energy Data
 
iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...
iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...
iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...
 
GBIF BIFA mentoring, Day 5a Data management, July 2016
GBIF BIFA mentoring, Day 5a Data management, July 2016GBIF BIFA mentoring, Day 5a Data management, July 2016
GBIF BIFA mentoring, Day 5a Data management, July 2016
 
Connecting the dots: drug information and Linked Data
Connecting the dots: drug information and Linked DataConnecting the dots: drug information and Linked Data
Connecting the dots: drug information and Linked Data
 
BrightTALK - Semantic AI
BrightTALK - Semantic AI BrightTALK - Semantic AI
BrightTALK - Semantic AI
 
Webinar: Mastering Python - An Excellent tool for Web Scraping and Data Anal...
Webinar:  Mastering Python - An Excellent tool for Web Scraping and Data Anal...Webinar:  Mastering Python - An Excellent tool for Web Scraping and Data Anal...
Webinar: Mastering Python - An Excellent tool for Web Scraping and Data Anal...
 
Software Mining and Software Datasets
Software Mining and Software DatasetsSoftware Mining and Software Datasets
Software Mining and Software Datasets
 
Linked Open Data_mlanet13
Linked Open Data_mlanet13Linked Open Data_mlanet13
Linked Open Data_mlanet13
 
2015 genome-center
2015 genome-center2015 genome-center
2015 genome-center
 
Intro to RDM
Intro to RDMIntro to RDM
Intro to RDM
 
Pine education-platform
Pine education-platformPine education-platform
Pine education-platform
 
Linked Data for Federation of OER Data &amp; Repositories
Linked Data for Federation of OER Data &amp; RepositoriesLinked Data for Federation of OER Data &amp; Repositories
Linked Data for Federation of OER Data &amp; Repositories
 
Big data, data science & fast data
Big data, data science & fast dataBig data, data science & fast data
Big data, data science & fast data
 
Gene Ontology Network Enrichment Analysis
Gene Ontology Network Enrichment AnalysisGene Ontology Network Enrichment Analysis
Gene Ontology Network Enrichment Analysis
 
Scott Edmunds: GigaScience - a journal or a database? Lessons learned from th...
Scott Edmunds: GigaScience - a journal or a database? Lessons learned from th...Scott Edmunds: GigaScience - a journal or a database? Lessons learned from th...
Scott Edmunds: GigaScience - a journal or a database? Lessons learned from th...
 
Feeding and consuming data to support open notebook science via the chem spid...
Feeding and consuming data to support open notebook science via the chem spid...Feeding and consuming data to support open notebook science via the chem spid...
Feeding and consuming data to support open notebook science via the chem spid...
 
HKU Data Curation MLIM7350 Class 8
HKU Data Curation MLIM7350 Class 8HKU Data Curation MLIM7350 Class 8
HKU Data Curation MLIM7350 Class 8
 
Educause Annual 2007
Educause Annual 2007Educause Annual 2007
Educause Annual 2007
 

Predictive modeling DBs

  • 1. Predictive Modeling: Research Tasks Nilitis, LLC. © 2012
  • 2. 1. Netflix Database http://cms.uhd.edu/faculty/chenp/class/4319/project/netflixfiles.html Netflix, Inc. - American provider of on- demand Internet streaming media and flat rate DVD-by-mail Training data set: 100,480,507 ratings 480,189 users 17,770 movies Data set entry: <user (ID), movie (ID), date of grade (yyyy-mm-dd), grade(1-5)> The BellKor Solution: http://www.netflixprize.com/assets/GrandPrize2009_BPC_BellKor.pdf The Big Chaos Solution: http://www.netflixprize.com/assets/GrandPrize2009_BPC_BigChaos.pdf The Pragmatic Theory Solution: http://www.netflixprize.com/assets/GrandPrize2009_BPC_PragmaticTheory.pdf 2 Nilitis, LLC. © 2012
  • 3. 1. Netflix Database User-based collaborative filtering - Look for users who share the same rating patterns - Use the ratings from those users to calculate a prediction Item-based collaborative filtering - Build an item-item matrix determining relationships between pairs of items - Using the matrix, and the data on the current user, infer his taste …A note from the donor regarding Netflix data: "Thank you for your interest in the Netflix Prize dataset. The dataset is no longer available.“ Robust De-anonymization of Large Sparse Datasets http://www.cs.utexas.edu/~shmat/shmat_oak08netflix.pdf 3 Nilitis, LLC. © 2012
  • 4. 2. EEG Database Data Set http://archive.ics.uci.edu/ml/datasets/EEG+Database This data from a large study to examine EEG correlates of genetic predisposition to alcoholism. 64 electrodes placed on subject's scalps which were sampled at 256 Hz for 1 second. There were two groups of subjects: alcoholic and control. Each subject was exposed to either a single stimulus (S1) or to two stimuli (S1 and S2). 122 subjects, each subject completed 120 trials where different stimuli were shown. EEG / ERP data available for free public download http://sccn.ucsd.edu/~arno/fam2data/publicly_available_EEG_data.html 4 Nilitis, LLC. © 2012
  • 5. 2. EEG Database Data Set Control Alcoholic example plots of a control and alcoholic subject http://www.ingber.com/ - webpage of Lester Ingber Use Ingber’s Canonical Momentum Indicator or smth. else? Or raw data? 5 Nilitis, LLC. © 2012
  • 6. 3. Berlin Database of Emotional Speech http://database.syntheticspeech.de/ 6 basic emotions: anger, joy, sadness, fear, disgust and boredom + neutral speech Ten professional native German actors (5 female and 5 male) simulated these emotions, producing 10 utterances (5 short and 5 longer sentences) emotion was recognized by at least 80 % of the listeners 6 Nilitis, LLC. © 2012
  • 7. 3. Berlin Database of Emotional Speech Voice Emotion Recognition: Audio Feature Classifier Emotion Stream Extraction Feature Extraction: “openEAR” http://sourceforge.net/projects/openart/?source=dlp Take settings from openEAR “emobase” config files and articles + possibly to add some feature selection steps (state of the art– sequential feature selection) Classifier: state of the art – SVM with polynomial or RBF kernel (libSVM included into openEAR package) 7 Nilitis, LLC. © 2012
  • 8. 4. Wikipedia page-to-page link database http://haselgrove.id.au/wikipedia.htm Total pages: 5,716,808 Total links: 130,160,392 Google PageRank technology: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.38.5427 85% likelihood of choosing a random link from the page 15% likelihood of jumping to a page chosen at random from the entire web 8 Nilitis, LLC. © 2012
  • 9. 5. Detecting Malicious URLs http://sysnet.ucsd.edu/projects/url/ about 2.4 million URLs 3.2 million features Estimating covariance matrix for high-dimensional data Linear implementation of SVM (LIBLINEAR) 9 Nilitis, LLC. © 2012
  • 10. 5. Pseudo Periodic Synthetic Time Series Data Set http://archive.ics.uci.edu/ml/datasets/Pseudo+Periodic+Synthetic+Time+Series + Branch and Bond evaluation An Indexing Scheme for Fast Similarity Search in Large Time Series Databases http://www.cs.rutgers.edu/~pazzani/Publications/ssdb99.pdf 10 Nilitis, LLC. © 2012
  • 11. Other Datasets Individual household electric power consumption Data Set http://archive.ics.uci.edu/ml/datasets/Individual+household+electric+power+consumption Bank Marketing Data Set http://archive.ics.uci.edu/ml/datasets/Bank+Marketing Solar Flare Data Set http://archive.ics.uci.edu/ml/datasets/Solar+Flare Forest Fires Data Set http://archive.ics.uci.edu/ml/datasets/Forest+Fires Arrhythmia Data Set http://archive.ics.uci.edu/ml/datasets/Arrhythmia Communities and Crime Data Set http://archive.ics.uci.edu/ml/datasets/Communities+and+Crime+Unnormalized Census Income Data Set http://archive.ics.uci.edu/ml/datasets/Census+Income 11 Nilitis, LLC. © 2012