SlideShare uma empresa Scribd logo
1 de 13
Baixar para ler offline
Predicting the
“Next Big Thing”
in Science
ADRIAN MLADENIĆ GROBELNIK
ADRIAN.GROBELNIK@GMAIL.COM
BRITISH INTERNATIONAL SCHOOL LJUBLJANA
LJUBLJANA, SLOVENIA
#scichallenge2017
What is this research project about?
 The aim is to make a C++ program for predicting which scientific topics will
become important in the future
 To predict the future of science, I have used Machine Learning algorithms
to learn how science behaved in the past, and to use the resulting model
to predict future trends in science
 To analyse how science evolved in the past, I used the data from the
recently released “Microsoft Academic Graph” which includes 125 million
scientific articles from the year 1800 to the present
Research Hypothesis
 My research hypothesis is that the science topics which will
become important in the future, already exist in today’s scientific
articles
 …they are just not visible yet,
 …but it is possible to identify them with Machine Learning
 The task is to find early indicators suggesting which scientific topics
in today’s literature will likely become important in the future
Context: How does science evolve?
 The main element of science is an invention
 Inventions always happen at the beginning of a scientific process
 After an invention happens, there is a period of scientific
exploration, to prove the invention is useful
 Some inventions prove themselves, and some do not
 If an invention proves itself, new products and research is done
involving ideas from the invention
 …less useful inventions usually get forgotten
Context: How to detect scientific
inventions and concepts?
 Scientists are typically strict and consistent when naming things
 In the same way, inventions and other scientific concepts get names which
are then used in scientific articles
 In this project I have used the names from the titles of scientific articles
to track how particular scientific topics evolve through time
 We can spot when a scientific topic appears for the first time, we can count
how frequently it appears, and we can spot when it stops being used
 …this is my base for predicting the “next big thing” in science
What data do we have available?
 There are many databases of scientific articles in the world, but only some are
open and available for research.
 The biggest open database of scientific articles is “Microsoft Academic Graph”
which was released for research use in 2016
 The database size is 130 Gigabytes
 It includes references to 125 million scientific articles from the year 1800 to the present
from all areas of science
 Each scientific article in the database is described by: (a) title, (b) authors and their (c)
institutions, (d) journal/conference where it was published, and (e) the year of publication
 Data available from: https://www.microsoft.com/en-us/research/project/microsoft-
academic-graph/
The task to be solved
 The core task in this project is to use the data from over 200 years of
science and to extract what are early signs of a scientific topic
becoming successful
 With Machine Learning algorithms I trained a statistical model to
classify scientific topics which became successful and which didn’t
 The trained model I am using on the current data (after 2010) to
predict which topics will be hot and relevant in the near future (in
early 2020s)
Description of the experiment (1/2)
 From 125 million article titles I extracted 2.5 million candidate topics
 …each topic is described by a phrase of the size 1 to 5 words
 …the phrase must appear at least 100 times in the database of article titles
 Each topic is represented by a set of features (attributes) describing the
first 10 years after its appearance
 …features include frequency and trend (slope from linear regression) of an
appearance of the topic within institutions, journals and conferences
 …each topic is described by approx. 55,000 features, represented in a feature
vector
Description of the experiment (2/2)
 Each topic is classified either as:
 Positive, if it became popular in the past (has increased by a factor 2 after the 10 years
from the topic’s first appearance), or as
 Negative, if the topic didn’t attract much attention
 We split the topics into a training (70%) and test set (30%)
 …where the training set is used to train the model and testing set used to test the model
 For machine learning I used the Perceptron algorithm which is relatively easy to
implement (https://en.wikipedia.org/wiki/Perceptron)
 …I used an improved version of the Perceptron (MaxMargin)
Key statistical results
 The statistical model, trained with the MaxMargin Perceptron
algorithm produced the following results on the testing data:
 Precision: 74%
 Recall: 72%
 F1 (a combination of both): 73%
 …this means, the model correctly predicts the success of
approx. 73% of all scientific topics (either successful ones or
unsuccessful ones)
Key descriptive results
 Looking at the resulting statistical model we can see:
 If a scientific topic gets increasingly used by important research
institutions (universities and research institutes)
 …and is getting published by important journals and conferences
 …within 10 years from the invention (when the initial mention is
spotted)
 …then, we can expect the increased use of the topic (by a factor
two or more) by science and industry in the next 5 years
Examples of best topics and features
 Example Best Topics (as predicted by the model):
 Collisions, efficient, proton proton collisions, higgs boson, system, quark,
particles, hadron, mobile augmented reality, variable quantum,
advanced network, molecular dynamics simulations
 Example Best Features (as identified by the Perceptron training):
 CERN, Journal of Proteomics & Bioinformatics, Industrial Research Limited,
Circulation-cardiovascular Imaging, Molecular BioSystems, Metamaterials
, Atw-international Journal for Nuclear Power
Summary
 In this research project I analyzed 125 million articles from “Microsoft
Academic Graph” from over 200 years of science
 I made a program in C++ to process 130 Gigabytes of data and to
build a machine learning model to predict which scientific topics will
become important in the future
 The resulting model predicts 73% of the scientific topics which became
important in the history of science
 C++ code and detailed results are available from: https://goo.gl/8luSwz

Mais conteúdo relacionado

Mais procurados

Significant Role of Statistics in Computational Sciences
Significant Role of Statistics in Computational SciencesSignificant Role of Statistics in Computational Sciences
Significant Role of Statistics in Computational Sciences
Editor IJCATR
 

Mais procurados (20)

Application of-statistics-in-CSE
Application of-statistics-in-CSEApplication of-statistics-in-CSE
Application of-statistics-in-CSE
 
Call for Papers - Applied Mathematics and Sciences: An International Journal ...
Call for Papers - Applied Mathematics and Sciences: An International Journal ...Call for Papers - Applied Mathematics and Sciences: An International Journal ...
Call for Papers - Applied Mathematics and Sciences: An International Journal ...
 
Call for Papers (December Issue) - Applied Mathematics and Sciences: An Inter...
Call for Papers (December Issue) - Applied Mathematics and Sciences: An Inter...Call for Papers (December Issue) - Applied Mathematics and Sciences: An Inter...
Call for Papers (December Issue) - Applied Mathematics and Sciences: An Inter...
 
Significant Role of Statistics in Computational Sciences
Significant Role of Statistics in Computational SciencesSignificant Role of Statistics in Computational Sciences
Significant Role of Statistics in Computational Sciences
 
Applied Mathematics and Sciences: An International Journal (MathSJ)
Applied Mathematics and Sciences: An International Journal (MathSJ)Applied Mathematics and Sciences: An International Journal (MathSJ)
Applied Mathematics and Sciences: An International Journal (MathSJ)
 
Call for papers - International Journal on Computational Science & Applicatio...
Call for papers - International Journal on Computational Science & Applicatio...Call for papers - International Journal on Computational Science & Applicatio...
Call for papers - International Journal on Computational Science & Applicatio...
 
Call for Paper - Applied Mathematics and Sciences: An International Journal (...
Call for Paper - Applied Mathematics and Sciences: An International Journal (...Call for Paper - Applied Mathematics and Sciences: An International Journal (...
Call for Paper - Applied Mathematics and Sciences: An International Journal (...
 
Techniques Machine Learning
Techniques Machine LearningTechniques Machine Learning
Techniques Machine Learning
 
IEEE 2014 JAVA DATA MINING PROJECTS Searching dimension incomplete databases
IEEE 2014 JAVA DATA MINING PROJECTS Searching dimension incomplete databasesIEEE 2014 JAVA DATA MINING PROJECTS Searching dimension incomplete databases
IEEE 2014 JAVA DATA MINING PROJECTS Searching dimension incomplete databases
 
Deep learning
Deep learningDeep learning
Deep learning
 
Ranking Related News Predictions
Ranking Related News PredictionsRanking Related News Predictions
Ranking Related News Predictions
 
Interactive mathematica
Interactive mathematicaInteractive mathematica
Interactive mathematica
 
An Independent Study Comparing SPSS to Intellectus Statistics: Preliminary ...
 An Independent Study Comparing SPSS to  Intellectus Statistics: Preliminary ... An Independent Study Comparing SPSS to  Intellectus Statistics: Preliminary ...
An Independent Study Comparing SPSS to Intellectus Statistics: Preliminary ...
 
Domain Ontology Usage Analysis Framework (OUSAF)
Domain Ontology Usage Analysis Framework (OUSAF)Domain Ontology Usage Analysis Framework (OUSAF)
Domain Ontology Usage Analysis Framework (OUSAF)
 
Data legend dh_benelux_2017.key
Data legend dh_benelux_2017.keyData legend dh_benelux_2017.key
Data legend dh_benelux_2017.key
 
Predicting students performance in final examination
Predicting students performance in final examinationPredicting students performance in final examination
Predicting students performance in final examination
 
Big Data Quality Panel : Diachron Workshop @EDBT
Big Data Quality Panel: Diachron Workshop @EDBTBig Data Quality Panel: Diachron Workshop @EDBT
Big Data Quality Panel : Diachron Workshop @EDBT
 
OMICS Publishing Group | Journal of Applied & Computational Mathematics
OMICS Publishing Group | Journal of Applied & Computational MathematicsOMICS Publishing Group | Journal of Applied & Computational Mathematics
OMICS Publishing Group | Journal of Applied & Computational Mathematics
 
Linked Open Data: Combining Data for the Social Sciences and Humanities (and ...
Linked Open Data: Combining Data for the Social Sciences and Humanities (and ...Linked Open Data: Combining Data for the Social Sciences and Humanities (and ...
Linked Open Data: Combining Data for the Social Sciences and Humanities (and ...
 
Transparency and reproducibility in research
Transparency and reproducibility in researchTransparency and reproducibility in research
Transparency and reproducibility in research
 

Semelhante a Predicting the “Next Big Thing” in Science - #scichallenge2017

The Computer Science Ontology: A Large-Scale Taxonomy of Research Areas
The Computer Science Ontology: A Large-Scale Taxonomy of Research AreasThe Computer Science Ontology: A Large-Scale Taxonomy of Research Areas
The Computer Science Ontology: A Large-Scale Taxonomy of Research Areas
Angelo Salatino
 
The Computer Science Ontology: A Large-Scale Taxonomy of Research Areas
The Computer Science Ontology:  A Large-Scale Taxonomy of Research AreasThe Computer Science Ontology:  A Large-Scale Taxonomy of Research Areas
The Computer Science Ontology: A Large-Scale Taxonomy of Research Areas
Angelo Salatino
 
Applied Optimization and Swarm Intelligence (Springer Tracts in Nature-Inspir...
Applied Optimization and Swarm Intelligence (Springer Tracts in Nature-Inspir...Applied Optimization and Swarm Intelligence (Springer Tracts in Nature-Inspir...
Applied Optimization and Swarm Intelligence (Springer Tracts in Nature-Inspir...
FajarMaulana962405
 
Invited Talk: Early Detection of Research Topics
Invited Talk: Early Detection of Research Topics Invited Talk: Early Detection of Research Topics
Invited Talk: Early Detection of Research Topics
Angelo Salatino
 
Navigation through citation network based on content similarity using cosine ...
Navigation through citation network based on content similarity using cosine ...Navigation through citation network based on content similarity using cosine ...
Navigation through citation network based on content similarity using cosine ...
Salam Shah
 

Semelhante a Predicting the “Next Big Thing” in Science - #scichallenge2017 (20)

A Review of Intelligent Agent Systems in Animal Health Care
A Review of Intelligent Agent Systems in Animal Health CareA Review of Intelligent Agent Systems in Animal Health Care
A Review of Intelligent Agent Systems in Animal Health Care
 
The Computer Science Ontology: A Large-Scale Taxonomy of Research Areas
The Computer Science Ontology: A Large-Scale Taxonomy of Research AreasThe Computer Science Ontology: A Large-Scale Taxonomy of Research Areas
The Computer Science Ontology: A Large-Scale Taxonomy of Research Areas
 
The Computer Science Ontology: A Large-Scale Taxonomy of Research Areas
The Computer Science Ontology:  A Large-Scale Taxonomy of Research AreasThe Computer Science Ontology:  A Large-Scale Taxonomy of Research Areas
The Computer Science Ontology: A Large-Scale Taxonomy of Research Areas
 
Applying machine learning techniques to big data in the scholarly domain
Applying machine learning techniques to big data in the scholarly domainApplying machine learning techniques to big data in the scholarly domain
Applying machine learning techniques to big data in the scholarly domain
 
Applied Optimization and Swarm Intelligence (Springer Tracts in Nature-Inspir...
Applied Optimization and Swarm Intelligence (Springer Tracts in Nature-Inspir...Applied Optimization and Swarm Intelligence (Springer Tracts in Nature-Inspir...
Applied Optimization and Swarm Intelligence (Springer Tracts in Nature-Inspir...
 
Berlin 6 Open Access Conference: Tony Hey
Berlin 6 Open Access Conference: Tony HeyBerlin 6 Open Access Conference: Tony Hey
Berlin 6 Open Access Conference: Tony Hey
 
The End(s) of e-Research
The End(s) of e-ResearchThe End(s) of e-Research
The End(s) of e-Research
 
An Ecosystem for Linked Humanities Data
An Ecosystem for Linked Humanities DataAn Ecosystem for Linked Humanities Data
An Ecosystem for Linked Humanities Data
 
Scientific Knowledge Graphs: an Overview
Scientific Knowledge Graphs: an OverviewScientific Knowledge Graphs: an Overview
Scientific Knowledge Graphs: an Overview
 
Design Science in Information Systems
Design Science in Information SystemsDesign Science in Information Systems
Design Science in Information Systems
 
NG2S: A Study of Pro-Environmental Tipping Point via ABMs
NG2S: A Study of Pro-Environmental Tipping Point via ABMsNG2S: A Study of Pro-Environmental Tipping Point via ABMs
NG2S: A Study of Pro-Environmental Tipping Point via ABMs
 
Automatic Classification of Springer Nature Proceedings with Smart Topic Miner
Automatic Classification of Springer Nature Proceedings with Smart Topic MinerAutomatic Classification of Springer Nature Proceedings with Smart Topic Miner
Automatic Classification of Springer Nature Proceedings with Smart Topic Miner
 
Computer Science Research Methodologies
Computer Science Research MethodologiesComputer Science Research Methodologies
Computer Science Research Methodologies
 
Invited Talk: Early Detection of Research Topics
Invited Talk: Early Detection of Research Topics Invited Talk: Early Detection of Research Topics
Invited Talk: Early Detection of Research Topics
 
The New e-Science
The New e-ScienceThe New e-Science
The New e-Science
 
The New e-Science (Bangalore Edition)
The New e-Science (Bangalore Edition)The New e-Science (Bangalore Edition)
The New e-Science (Bangalore Edition)
 
Navigation through citation network based on content similarity using cosine ...
Navigation through citation network based on content similarity using cosine ...Navigation through citation network based on content similarity using cosine ...
Navigation through citation network based on content similarity using cosine ...
 
Data Science & Analytics (light overview)
Data Science & Analytics (light overview) Data Science & Analytics (light overview)
Data Science & Analytics (light overview)
 
How can the use of computer simulation benefit the monitoring and mitigation ...
How can the use of computer simulation benefit the monitoring and mitigation ...How can the use of computer simulation benefit the monitoring and mitigation ...
How can the use of computer simulation benefit the monitoring and mitigation ...
 
Data science syllabus
Data science syllabusData science syllabus
Data science syllabus
 

Último

Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
amitlee9823
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
JoseMangaJr1
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
amitlee9823
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
amitlee9823
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
amitlee9823
 

Último (20)

Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics Program
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
 
ELKO dropshipping via API with DroFx.pptx
ELKO dropshipping via API with DroFx.pptxELKO dropshipping via API with DroFx.pptx
ELKO dropshipping via API with DroFx.pptx
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 

Predicting the “Next Big Thing” in Science - #scichallenge2017

  • 1. Predicting the “Next Big Thing” in Science ADRIAN MLADENIĆ GROBELNIK ADRIAN.GROBELNIK@GMAIL.COM BRITISH INTERNATIONAL SCHOOL LJUBLJANA LJUBLJANA, SLOVENIA #scichallenge2017
  • 2. What is this research project about?  The aim is to make a C++ program for predicting which scientific topics will become important in the future  To predict the future of science, I have used Machine Learning algorithms to learn how science behaved in the past, and to use the resulting model to predict future trends in science  To analyse how science evolved in the past, I used the data from the recently released “Microsoft Academic Graph” which includes 125 million scientific articles from the year 1800 to the present
  • 3. Research Hypothesis  My research hypothesis is that the science topics which will become important in the future, already exist in today’s scientific articles  …they are just not visible yet,  …but it is possible to identify them with Machine Learning  The task is to find early indicators suggesting which scientific topics in today’s literature will likely become important in the future
  • 4. Context: How does science evolve?  The main element of science is an invention  Inventions always happen at the beginning of a scientific process  After an invention happens, there is a period of scientific exploration, to prove the invention is useful  Some inventions prove themselves, and some do not  If an invention proves itself, new products and research is done involving ideas from the invention  …less useful inventions usually get forgotten
  • 5. Context: How to detect scientific inventions and concepts?  Scientists are typically strict and consistent when naming things  In the same way, inventions and other scientific concepts get names which are then used in scientific articles  In this project I have used the names from the titles of scientific articles to track how particular scientific topics evolve through time  We can spot when a scientific topic appears for the first time, we can count how frequently it appears, and we can spot when it stops being used  …this is my base for predicting the “next big thing” in science
  • 6. What data do we have available?  There are many databases of scientific articles in the world, but only some are open and available for research.  The biggest open database of scientific articles is “Microsoft Academic Graph” which was released for research use in 2016  The database size is 130 Gigabytes  It includes references to 125 million scientific articles from the year 1800 to the present from all areas of science  Each scientific article in the database is described by: (a) title, (b) authors and their (c) institutions, (d) journal/conference where it was published, and (e) the year of publication  Data available from: https://www.microsoft.com/en-us/research/project/microsoft- academic-graph/
  • 7. The task to be solved  The core task in this project is to use the data from over 200 years of science and to extract what are early signs of a scientific topic becoming successful  With Machine Learning algorithms I trained a statistical model to classify scientific topics which became successful and which didn’t  The trained model I am using on the current data (after 2010) to predict which topics will be hot and relevant in the near future (in early 2020s)
  • 8. Description of the experiment (1/2)  From 125 million article titles I extracted 2.5 million candidate topics  …each topic is described by a phrase of the size 1 to 5 words  …the phrase must appear at least 100 times in the database of article titles  Each topic is represented by a set of features (attributes) describing the first 10 years after its appearance  …features include frequency and trend (slope from linear regression) of an appearance of the topic within institutions, journals and conferences  …each topic is described by approx. 55,000 features, represented in a feature vector
  • 9. Description of the experiment (2/2)  Each topic is classified either as:  Positive, if it became popular in the past (has increased by a factor 2 after the 10 years from the topic’s first appearance), or as  Negative, if the topic didn’t attract much attention  We split the topics into a training (70%) and test set (30%)  …where the training set is used to train the model and testing set used to test the model  For machine learning I used the Perceptron algorithm which is relatively easy to implement (https://en.wikipedia.org/wiki/Perceptron)  …I used an improved version of the Perceptron (MaxMargin)
  • 10. Key statistical results  The statistical model, trained with the MaxMargin Perceptron algorithm produced the following results on the testing data:  Precision: 74%  Recall: 72%  F1 (a combination of both): 73%  …this means, the model correctly predicts the success of approx. 73% of all scientific topics (either successful ones or unsuccessful ones)
  • 11. Key descriptive results  Looking at the resulting statistical model we can see:  If a scientific topic gets increasingly used by important research institutions (universities and research institutes)  …and is getting published by important journals and conferences  …within 10 years from the invention (when the initial mention is spotted)  …then, we can expect the increased use of the topic (by a factor two or more) by science and industry in the next 5 years
  • 12. Examples of best topics and features  Example Best Topics (as predicted by the model):  Collisions, efficient, proton proton collisions, higgs boson, system, quark, particles, hadron, mobile augmented reality, variable quantum, advanced network, molecular dynamics simulations  Example Best Features (as identified by the Perceptron training):  CERN, Journal of Proteomics & Bioinformatics, Industrial Research Limited, Circulation-cardiovascular Imaging, Molecular BioSystems, Metamaterials , Atw-international Journal for Nuclear Power
  • 13. Summary  In this research project I analyzed 125 million articles from “Microsoft Academic Graph” from over 200 years of science  I made a program in C++ to process 130 Gigabytes of data and to build a machine learning model to predict which scientific topics will become important in the future  The resulting model predicts 73% of the scientific topics which became important in the history of science  C++ code and detailed results are available from: https://goo.gl/8luSwz