2024: The FAR, Federal Acquisition Regulations, Part 30
Öppen data och forskningens genomslag
1.
2. Översikt
• Öppen vetenskap: möjligheter och problem
• Maskininlärning och länkad data, samt “händelsedata”
• Data for Impact
• Tekniker i projektet och bortom…
• Textanalys: ‘topic modelling’ och ‘word embedding’
• (‘named entity recognition’)
• Sociala medierdata: Twitter som konversationskälla.
3. Öppen vetenskap för vem?
open to
society
open to
non-
scientists
open to
science
open to
innovation
<-participation?
Less demarcation ->
Open science as an
umbrella term
Nolin, J & Nelhans, G (n.d): Input and output legitimacy within the shift to Open science.
To be submitted to Science and Public Policy
4. Merton’s norm system of science
• Communalism
• Universalism
• Disinterestedness
• Organized Scepticism
Merton, R. K. 1973 (1942). The Normative Structure of Science. In The sociology of science:
theoretical and empirical investigations. Chicago: The University of Chicago Press. 267-278.
5. CUDOS vs. PLACE
• Communalism
• Universalism
• Disinterestedness
• Originality
• Scepticism
• Proprietary
• Local
• Authoritarian
• Commissioned
• Expert
Ziman, J. (1984). An Introduction to science studies. Cambridge: Cambridge University Press.
Mitroff, I (1974).Norms and Counter-Norms in a Select Group of the Apollo Moon Scientists:
A Case Study of the Ambivalence of Scientists. American Sociological Review 39, 579-595
6. Open Linked
data (OLD)
Resource Description Framework
Schema (RDF), which enables the
creation of ontologies (W3C, n.d.)
• relational - heterogenous data can be
connected.
• explainable - every step of the
classificatory schema is transparent.
• global - it is based on World Wide Web
Consortium (W3C) standards.
Wikidata: Can link ORCID, CrossrefID, Google ScholarID
7. Science Citation Index
• Introduced in Science 1955 by
Eugene Garfield
• The citation as a construction
• Citation ≠ reference
Garfield, E. 1955. Citation Indexes for Science: A New Dimension in Documentation through Association of
Ideas. Science 122 (3159):108-111
Time Citation
Reference
8
Google Search Index
• Introduced online 1997
• HTML links as references
• PageRank, ranking web pages
based on links.
• Recursive citation impact
indicators dates back to Pinski
and Narin (1976).
11. Data source Event type
Cambia Lens Citations in Patents
Crossref Metadata Links to DataCite registered content
DataCite Metadata Links to Crossref registered content
F1000Prime Recommendations of research publications
Hypothes.is Annotations in Hypothes.is
Newsfeed Discussed in blogs and media
Reddit Discussed on Reddit
Reddit Links Discussed on sites linked to in subreddits
Stack Exchange Network Discussed on StackExchange sites
Twitter Mentions in tweets
Wikipedia References on Wikipedia pages
Wordpress.com Discussed on Wordpress.com sites
13. Data4Impact: the basics
• Call: CO-CREATION-08-2016-2017: Better integration of evidence on
the impact of research and innovation in policy making
• Expected impacts:
• Improved monitoring of R&I activities: new indicators for assessing research
and innovation performance, including the impact of research and innovation
policies
• Prove value to the society: determining the societal impact of research and
innovation funding in order better to justify research and innovation spending
Data4Impact addresses key challenges and expected impacts of CO-CREATION-08-2016-2017 through a data
driven approach
Data4Impact has received funding from the European Union’s Horizon 2020 research and innovation programme
under grant agreement No 770531.
14. Where? Start with an individual
Individual level
Who participated in the programme?
Who were members of the extended team?
Organisation/team level
Research teams in universities & research centres;
Small companies and large enterprises
Project/programme level
Data aggregated at project or programme level
Analytical dimensions
Within researchers themselves; between researchers;
between researchers and organisations; between
organisations; between projects; between programmes
Key questions:
- Whom exactly did the programme attract?
- What happened during and after the projects?
- What was the impact?
15. How? Build a Knowledge Graph, Integrate Data
Source: Ontotext
17. Key achievements
Data4Impact offers unique coverage of data sources, with an aim to link them through specific entities
Data4Impact covers all key stages of the R&I lifecycle in the health domain, i.e. basic research ->
translational & applied research -> innovation & uptake on the market -> clinical practice & public health
New indicators and line of thinking investigated on academic impact
• The funder and society perspective: funding timely and relevant research? Do the ‘right thing’ by
funding rare topics?
• If a funder enters an area where few others invest, does this imply stronger impact?
• How does this interact with the researcher/organization perspective?
Data4Impact was first/one of the first to track data to medium- and long-term economic and
societal/health impacts, i.e. link previous project activities to events that happened recently
18. FP7/H2020 Projects –
DATA
CORDIS
- Call document
- Project description
- Final or periodic project reports (project
summary)
Scholarly publications deriving from each
project
Patents
Results in Brief – Expected Impact
automatic extraction
of pertinent info from
associated documents
(NLP), and metadata
19.
20. Publications
• > 5 million publ.
• H2020, FP7 proj.
• 20% of sample from
40+ funders of D4I
Project Reports
Deep Learning
NLP
Expert
469 Topics
10 major categories
D4I Topic Modelling
Academic Impact
21. Academic impact: Topic modelling
• completely bottom up approach
• very little domain knowledge needed (most important sources for documents)
• granularity
• each document associated with a list of topics (and a weight for each) fully flexible
indicators
• keywords
• each topic associated with keywords topic similarity
• removes programmatic structure
22. topic_id tidx Word 1 Word 2 Word 3 Word 4 Word 5
51 6.50 abundancemicrobial communitdiversity samples
112 5.15 gut microbiotafecal intestinal burn
52 4.26 soil content organic water carbon
34 4.10 exosomesderived cea exosome extracellu
391 4.05 participanemotionalself negative positive
454 3.58 surface nanoparticfilm materials polymer
242 3.52 energy spectra peak field band
416 3.48 surface force layer structure angle
95 3.13 strategy decision video game strategies
170 3.05 stability stable balance fall falls
219 3.00 protocol scheme transmissicommunicenergy
76 2.96 species long specimensdorsal lateral
107 2.90 seed rice seeds plant yield
48 2.60 plants plant root leaves leaf
23. RplotTopicTrends10pct
Topic
id
Word 1, Word 2, Word 3
7 exercise, training, post
9 cats, domestic, cat
10 mir, mirnas, mirna
25 china, chinese, shanghai
32 climate, area, areas
34 exosomes, derived, cea
38 sea, depth, marine
48 plants, plant, root
51
abundance, microbial,
community
52 soil, content, organic
65 included, review, meta
76 species, long, specimens
82 movement, position, motion
83
stimulation, electrode,
electrodes
84 antenna, radio, sad
95 strategy, decision, video
107 seed, rice, seeds
112 gut, microbiota, fecal
131 autophagy, atg, autophagic
170 stability, stable, balance
Temporality
24. Academic Impact: Safe Bets
Safe bet: a topic
with a strong
presence every
year
(weight more than a
st.dev. above the
average)
Topic
Antibiotic resistant infections
Cardiac (ventricular) remodelling
Community-based health promotion strategies
Health literacy in primary health care
Malaria and leishmaniasis
Organic chemistry synthesis
25. Academic Impact: Emerging Topics
Emerging: a topic
with low presence
before 2015 that is
now growing
“much” faster than
the average.
Topic
Antibiotic resistant bacteria
Chagas disease
Chemometric analysis of volatile compounds
Complementary and alternative medicine
Fluorescein isothiocyanate (FITC)
Hormonal disorders
Immortalised cell lines
Pulmonary hypertension
Sleep apnea
T-cell mediated inflammatory skin diseases
Teratology
26. Academic Impact: Hibernating Giants
Hibernating
Giant: a topic with
that used to be
strong up to [2011-
2013] and is now
consistently at low
levels
Topic
Enhancer-Binding Protein Complexes
Hydrogen bonds and coordination geometry
Hydrogen bonds and cyclohexane conformation
Minority health and health care disparities
Molecular dynamics and protein function
Regulation of protein function
Regulatory T cell function and immune system
Use of Arabidopsis thaliana as a plant model
27. Machine learning and OLD
• While AI and ML have several advantages in certain empirical
domains of classification where pattern recognition is at the forefront,
these methods are less useful for connecting heterogeneous types of
data.
• Black-boxed
• By combining linked data with machine learning exercises we can(?)
get the best out of each approach.
28. Clinical guidelines
• Clinical guidelines, systematic reviews and treatment
recommendation documents provide traces of clinical and
professional practice
• Proprietary data from Minso Solutions AB. Maintains a
database, Clinical Impact, (CI:TM)
• (Later, working with WHO, Cochrane, NICE data, also available in
PubMed)
• The coverage is nearly complete at the government level for
Sweden, Denmark, Norway, Germany (at the S3 level), and the
UK (NICE and SIGN guidelines), as well as good coverage of WHO
guideline documents and Cochrane Systematic Reviews.
• In total 855 clinical guidelines had a total of 3684 (2,073
fractional) references that were matched to 1781 publications
found in the D4I database.
29. Funder (EC breakdown)
Funder_type
Number
(full)
Number
(fract.)
EC_funder (FP7/H2020) 115 78.2
European nat’l funders 1,859 1,317.9
Internationa funders 1,710 676.9
Total sum 3,684 2,073.0
EC Funder
Number
(full)
Number
(fract)
EC_FP7-CORE 74 49.9
EC_FP7-EXTENDED 28 18.2
EC_H2020-EXTENDED 1 0.1
EC_other 12 10.0
Total sum 115 78.2
”Top 20”
Eur/Int Funder_full
Funder_c
ountry
Num
ber
(full)
Num
ber
(fract
)
National Institutes of Health US 1,645 624.6
Medical Research Council UK 585 452.4
Wellcome Trust UK 555 416.9
NHMRC – Nat’l Health and Medical Res. AU 156 85.5
Cancer Research UK UK 122 85.6
RCUK - Research Councils UK UK 85 37.7
Chief Scientist Office UK 82 66.5
EC_FP7-CORE EU 74 49.9
British Heart Foundation UK 69 34.6
Swiss National Science Foundation CH 64 41.9
Arthritis Research UK UK 29 27.5
World Health Organization Int. 29 27.3
EC_FP7-EXTENDED EU 28 18.2
AKA - Academy of Finland FIN 27 9.8
Biotechnology and Biological Sci. R.C. UK 15 9.6
EC_other EU 12 10.0
NWO - Netherlands Org. for Sci. Res. NE 12 6.6
Austrian Science Fund FWF AT 11 9.1
ARC - Australian Research Council AU 10 5.1
Other (N=26 funders) - 74 54
Sum - 3,684 2,073
30. Topical analysis of reference contexts
congue risus feugiat ref264 tincidunt lorem nullam
In the generated topic model, each word is associated with a
probability distribution of topics
For each reference, a symmetric context window of size k is
used as a pseudo-document, and the most probable topic is
calculated for that context window
congue risus feugiat ref264 tincidunt lorem nullam
31. Asthma, a chronic respiratory condition
affecting 300 million people globally (
aref15080825 ), causes inflammation of the lungs
as well as structural and functional remodelling
of the airways. It is characterised by recurrent
attacks of breathlessness and wheezing with
varying degrees of frequency and severity, which
is caused by swelling of the bronchial tubes
resulting in airflow limitation (WHO 2011).
Although the causes of asthma are not completely
understood, risk factors are known to include
inhaling asthma triggers such as allergens,
tobacco smoke and chemical irritants. Asthma is
incurable and the prevalence is increasing,
particularly in children and young adults (
aref22157151 ), however appropriate management
can control the disorder and enable people to
enjoy a high quality of life (WHO 2011).
https://doi.org/10.1002/14651858.CD001116.pub4
asthma a chronic respiratory condition affecting million people globally aref causes inflammation of the lungs as
well as structural and functional remodelling of the airways
Topic 346 (0.8149): asthma, copd, allergic, airway, disease, fev, ige, respiratory, lung, symptoms
Topic 78 (0.0689): pressure, lung, pulmonary, respiratory, gas, lungs, ventilation, volume, breathing, alveolar
39. Academic
27%
Academically
trained
11%
Other
Professional
23%
Media
38%
Policy/decision maker
1%
9,647 plain text biographies from Twitter profiles
classified using a rule-based method: 30% matched as professionals:
Class Keyword example
Science student student, studying,
Graduated MS, MA, graduate
University faculty lectur, prof., professor
Other scientist
technician, lab manager, -
ologist
Education and outreach curator, teacher, librarian
Applied science
organization
nonprofit, philantropy
Other professional
recruiter, entrepreneur,
manager
Media professional journalis, publisher
Policy/decision maker
congressman, senator,
parliament
Ekström, B. (2019): Developing a rule-based method for identifying researchers on Twitter: The case of vaccine discussions
Poster accepted to ISSI, 17th International Society of Scientometrics and Informetrics Conference, Rome, 2-5 September.
40. How can we use Twitter-bio personas?
- Retweet data
41. How can we use Twitter-bio personas?
Conversation data
?
42.
43. Input / Throughput
• EU contributions and Total cost of projects
• Project publication sources:
• Pubmed
• Other
47. Data4Impact has received funding from the European Union’s Horizon 2020 research and innovation programme
under grant agreement No 770531.
Thank you for your attention!
Visit out website:
www.data4impact.eu
Follow us on Twitter and SlideShare:
@Data4Impact
48. Ongoing work
• Eklund, J. (2018). The importance of scientific references in their contexts Poster presented at the 23rd Nordic Workshop on
Bibliometrics and Research Policy, Borås, 7-9 November.
• Eklund, J., Lorentzen, D.G., & Nelhans, G. (2019). MESH classification of clinical guidelines using conceptual embeddings of
references. In Proceedings of ISSI, the 17th International Conference on Scientometrics & Informetrics, Sept. 2-5, 2019, Rome.
2189-2198.
• Eklund, J., & Nelhans, G. (2017). Topic modelling approaches to aggregated citation data. Presented at the 22nd International
Conference on Science and Technology Indicators, Paris, September 6-8, 2017.
• Lorentzen, D.G. (2017). Is it all about politics? : A hashtag analysis of the activities of the Swedish political Twitter elite. Human IT,
13(3), 115–155. Retrieved from http://urn.kb.se/resolve?urn=urn:nbn:se:hb:diva-13119
• Lorenzen, D.G. (2018). Discussing research on Twitter : Measuring the conversational impact of scientific publications. Presented at
the 23rd Nordic Workshop on Bibliometrics and Research Policy, Borås, 8-9 November, 2018. Retrieved from
http://urn.kb.se/resolve?urn=urn:nbn:se:hb:diva-15858
• Lorentzen, D.G., Eklund, J., Ekström, B., & Nelhans, G. (2019). Mapping scientific issues and controversies on Twitter: a method for
investigation conversations mentioning research. In Proceedings of ISSI, the 17th International Conference on Scientometrics &
Informetrics, Sept. 2-5, 2019, Rome. 2189-2198.
• Nelhans, G., and Eklund, J. (2019). MESH classification of clinical guidelines using conceptual embeddings of references. Oral
presentation, accepted at 24th Nordic Workshop on Bibliometrics and Research Policy, Reykjavik, 7-9 November 2019
• Nelhans, G. and Lorentzen, D. (2016). Twitter conversation patterns related to research papers. Information Research, 21(2), Article
SM2.
• Data for impact web page: http://www.data4impact.eu