We made a system to predict which scientific topics will become important in the future. To predict the future of science, we have used Machine Learning algorithms to learn how science behaved in the past and to use the resulting model to predict future trends in science.
#scichallenge2017
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
Predicting the “Next Big Thing” in Science - #scichallenge2017
1. Predicting the
“Next Big Thing”
in Science
ADRIAN MLADENIĆ GROBELNIK
ADRIAN.GROBELNIK@GMAIL.COM
BRITISH INTERNATIONAL SCHOOL LJUBLJANA
LJUBLJANA, SLOVENIA
#scichallenge2017
2. What is this research project about?
The aim is to make a C++ program for predicting which scientific topics will
become important in the future
To predict the future of science, I have used Machine Learning algorithms
to learn how science behaved in the past, and to use the resulting model
to predict future trends in science
To analyse how science evolved in the past, I used the data from the
recently released “Microsoft Academic Graph” which includes 125 million
scientific articles from the year 1800 to the present
3. Research Hypothesis
My research hypothesis is that the science topics which will
become important in the future, already exist in today’s scientific
articles
…they are just not visible yet,
…but it is possible to identify them with Machine Learning
The task is to find early indicators suggesting which scientific topics
in today’s literature will likely become important in the future
4. Context: How does science evolve?
The main element of science is an invention
Inventions always happen at the beginning of a scientific process
After an invention happens, there is a period of scientific
exploration, to prove the invention is useful
Some inventions prove themselves, and some do not
If an invention proves itself, new products and research is done
involving ideas from the invention
…less useful inventions usually get forgotten
5. Context: How to detect scientific
inventions and concepts?
Scientists are typically strict and consistent when naming things
In the same way, inventions and other scientific concepts get names which
are then used in scientific articles
In this project I have used the names from the titles of scientific articles
to track how particular scientific topics evolve through time
We can spot when a scientific topic appears for the first time, we can count
how frequently it appears, and we can spot when it stops being used
…this is my base for predicting the “next big thing” in science
6. What data do we have available?
There are many databases of scientific articles in the world, but only some are
open and available for research.
The biggest open database of scientific articles is “Microsoft Academic Graph”
which was released for research use in 2016
The database size is 130 Gigabytes
It includes references to 125 million scientific articles from the year 1800 to the present
from all areas of science
Each scientific article in the database is described by: (a) title, (b) authors and their (c)
institutions, (d) journal/conference where it was published, and (e) the year of publication
Data available from: https://www.microsoft.com/en-us/research/project/microsoft-
academic-graph/
7. The task to be solved
The core task in this project is to use the data from over 200 years of
science and to extract what are early signs of a scientific topic
becoming successful
With Machine Learning algorithms I trained a statistical model to
classify scientific topics which became successful and which didn’t
The trained model I am using on the current data (after 2010) to
predict which topics will be hot and relevant in the near future (in
early 2020s)
8. Description of the experiment (1/2)
From 125 million article titles I extracted 2.5 million candidate topics
…each topic is described by a phrase of the size 1 to 5 words
…the phrase must appear at least 100 times in the database of article titles
Each topic is represented by a set of features (attributes) describing the
first 10 years after its appearance
…features include frequency and trend (slope from linear regression) of an
appearance of the topic within institutions, journals and conferences
…each topic is described by approx. 55,000 features, represented in a feature
vector
9. Description of the experiment (2/2)
Each topic is classified either as:
Positive, if it became popular in the past (has increased by a factor 2 after the 10 years
from the topic’s first appearance), or as
Negative, if the topic didn’t attract much attention
We split the topics into a training (70%) and test set (30%)
…where the training set is used to train the model and testing set used to test the model
For machine learning I used the Perceptron algorithm which is relatively easy to
implement (https://en.wikipedia.org/wiki/Perceptron)
…I used an improved version of the Perceptron (MaxMargin)
10. Key statistical results
The statistical model, trained with the MaxMargin Perceptron
algorithm produced the following results on the testing data:
Precision: 74%
Recall: 72%
F1 (a combination of both): 73%
…this means, the model correctly predicts the success of
approx. 73% of all scientific topics (either successful ones or
unsuccessful ones)
11. Key descriptive results
Looking at the resulting statistical model we can see:
If a scientific topic gets increasingly used by important research
institutions (universities and research institutes)
…and is getting published by important journals and conferences
…within 10 years from the invention (when the initial mention is
spotted)
…then, we can expect the increased use of the topic (by a factor
two or more) by science and industry in the next 5 years
12. Examples of best topics and features
Example Best Topics (as predicted by the model):
Collisions, efficient, proton proton collisions, higgs boson, system, quark,
particles, hadron, mobile augmented reality, variable quantum,
advanced network, molecular dynamics simulations
Example Best Features (as identified by the Perceptron training):
CERN, Journal of Proteomics & Bioinformatics, Industrial Research Limited,
Circulation-cardiovascular Imaging, Molecular BioSystems, Metamaterials
, Atw-international Journal for Nuclear Power
13. Summary
In this research project I analyzed 125 million articles from “Microsoft
Academic Graph” from over 200 years of science
I made a program in C++ to process 130 Gigabytes of data and to
build a machine learning model to predict which scientific topics will
become important in the future
The resulting model predicts 73% of the scientific topics which became
important in the history of science
C++ code and detailed results are available from: https://goo.gl/8luSwz