What if there was a platform where literature, conference abstracts, patents, clinical trials, news, grants and other sources were fully integrated? What if the data would be harmonized, enriched with standardized concepts and ready for analysis? After building our patent analytics platform we didn’t stop dreaming and built our big data analytics platform by semantically integrating text-rich, scientific sources. In my presentation I will talk about what we built and why we built it. And, of course, I will also address the challenges and hurdles along the way. Was it worth it and what comes next? Let’s talk about it!
2. Agenda
What platform did we built?
What does it look like?
Why did we build it?
Architecture and data enrichment
Challenges
Plans for the future
2 /// AI-SDV 2022 // Integrated Data Platform at Bayer
3. /// AI-SDV 2022 // Integrated Data Platform at Bayer
3
What Platform did we built?
1
4. /// AI-SDV 2022 // Integrated Data Platform at Bayer
4
Our platform semantically integrates Terabytes
of external scientific textual data to support
insight generation along the R&D value chain
5. /// AI-SDV 2022 // Integrated Data Platform at Bayer
5
Big data platform
This platform is…
• A semantically integrated and harmonized big data hub containing major external, text-
rich, and life-science related data sources
• Enriched with FAIR meta-data generated by extracting the key information (e.g., molecular
targets, medical conditions, active ingredients, technologies etc.) using NLP
• An analysis-ready platform for end-users (GUI access) and data scientists (API access)
6. /// AI-SDV 2022 // Integrated Data Platform at Bayer
6
Scientific
end users
Data scientists
Developers of
digital products
The users
7. /// AI-SDV 2022 // Integrated Data Platform at Bayer
7
The users
End-user GUIs
more power &
precision for
scientific search
Project leaders
R&D scientists
Tech scouts
& Co
Find relevant information
Alerts
Analysis
Filter & Review
Expert APIs
Provide structured
data for insight
generation
Data scientists
Computational scientists
Information professionals
Bioinformaticians
Generate insights
Find new targets & treatments
Support pipeline decisions
Build predictive models
8. /// AI-SDV 2022 // Integrated Data Platform at Bayer
8
What does it look like?
2
9. /// AI-SDV 2022 // Integrated Data Platform at Bayer
9
Example: Liver cancer
Google-like search interface
10. /// AI-SDV 2022 // Integrated Data Platform at Bayer
10
Example: Liver cancer
Interactive analysis and filtering
11. /// AI-SDV 2022 // Integrated Data Platform at Bayer
11
Example: Liver cancer
Result overview
12. /// AI-SDV 2022 // Integrated Data Platform at Bayer
12
Example: Liver cancer
Record view
13. /// AI-SDV 2022 // Integrated Data Platform at Bayer
13
Why did we build it?
3
14. /// AI-SDV 2022 // Integrated Data Platform at Bayer
14
Big Data Platform
6 Reasons why building it made and makes sense
Richness of data sources
Flexibility
Costs
Scalability
FAIR meta-data
Full transparency
and control
15. /// AI-SDV 2022 // Integrated Data Platform at Bayer
15
Scientific sources in our platform Platforms limited to publicly
available data
1. Bandwidth and richness of data sources
Big Data Platform
Why did we build it?
16. /// AI-SDV 2022 // Integrated Data Platform at Bayer
16
2. Maximum flexibility to analyze the data and to integrate it into our
Bayer data ecosystem
Existing platforms often come with limited/pre-defined analysis options and
limited integrability
Big Data Platform
Why did we build it?
17. /// AI-SDV 2022 // Integrated Data Platform at Bayer
17
Our platform is built on a scalable cloud infrastructure for big data analysis
and does allow you to analyze millions of records in one go.
Big Data Platform
Why did we build it?
3. Full scalability
18. /// AI-SDV 2022 // Integrated Data Platform at Bayer
18
4. Costs
This platform allowed us to save money and reduce complexity be replacing
various proprietary legacy platforms
Big Data Platform
Why did we build it?
19. /// AI-SDV 2022 // Integrated Data Platform at Bayer
19
5. One terminology across entire content and option to
adjust it to our needs
Individual sources / platforms typically have their own standards and
terminologies
One terminology
for entire platform
Big Data Platform
Why did we build it?
20. /// AI-SDV 2022 // Integrated Data Platform at Bayer
20
6. Comprehensiveness and quality of meta-data
Since we built on 20 years of thesauri and NLP algorithms optimized to
Bayer’s needs, our terminologies cover the real-life use of science much
better than established terminologies
Big Data Platform
Why did we build it?
MeSH:
21. /// AI-SDV 2022 // Integrated Data Platform at Bayer
21
6. Comprehensiveness and quality of meta-data
Proprietary disease thesaurus:
Big Data Platform
Why did we build it?
22. /// AI-SDV 2022 // Integrated Data Platform at Bayer
22
Architecture & Data enrichment
4
23. /// AI-SDV 2022 // Integrated Data Platform at Bayer
23
Conference Abstracts
Literature Abstracts
Literature Fulltexts
Patents
Patent Chemistry
Clinical Trials
Pipeline Information
Market reports
Company Websites Industry News
Research Grants
Tech Transfer Offers
D
A
T
A
Data Engineering: Normalization, Deduplication, Classification, etc
(Kafka Streams)
Index, Search, and API Services (Elastic)
Semantic Enrichment: Targets, Organisms, Sequences, Drugs,
Active Ingredients, Companies/Organizations, Analytics, etc
Automated Data Acquisition (Kafka Technology)
P
R
O
C
E
S
S
APIs & Data Science
Platform architecture
End User Products
D
E
L
I
V
E
R
Cross-search GUI
Advanced literature GUI
Advanced patent GUI
System/Application Integrations
Other proprietary
platforms and
workflows use this
platform as source
24. /// AI-SDV 2022 // Integrated Data Platform at Bayer
24
Resolve all flavours of heterogeneity to make textual data FAIR
Big Data Platform
Semantic data integration at large
Semantic data
integration
Structural heterogeneity
Same facts expressed in different
schemata
Missing / additional attributes
Technical heterogeneity
Data formats (JSON vs. XML),
communication protocols (REST vs.
ODBC), query languages (SQL vs.
SPARQL)
Data model heterogeneity
Relational vs. Semi-structured, Tuples
vs. Graphs,…
Syntactic heterogeneity
Different presentation of the same fact
(Unicode or ASCII, EUR or €,…)
Semantic heterogeneity
Same concepts are named differently
➢ Pulmonary carcinoma
➢ Neoplasm of the lung
➢ ….
Different concepts are named same
GSK
Lung cancer
25. /// AI-SDV 2022 // Integrated Data Platform at Bayer
26
Challenges
5
26. Heterogeneous
formats
/// AI-SDV 2022 // Integrated Data Platform at Bayer
27
Challenges: Data ingestion
Heterogeneous
update schedules
hourly
daily
weekly
monthly
27. /// AI-SDV 2022 // Integrated Data Platform at Bayer
28
Challenges: Data ingestion
Changes in record
structure
Changes in
volume over time
28. /// AI-SDV 2022 // Integrated Data Platform at Bayer
29
Challenges: Data ingestion
De-duplication
De-duplication
De-duplication
De-duplication
De-duplication
29. /// AI-SDV 2022 // Integrated Data Platform at Bayer
30
Challenges: Semantic enrichment
Lack of universially accepted identifier for an entity class
Human gene
NCBI Gene ID
Chemical compound
INN name
IUPAC
CAS-Nr
PubChem CID
Canonical smiles
Disease
MeSH ID
UMLS ID
Snomed ID
NCIT ID
Orphanet ID
Mondo ID
ICD-10 ID
MedDRA ID
DO ID
…..
30. /// AI-SDV 2022 // Integrated Data Platform at Bayer
31
Challenges: Semantic enrichment
Identification of different entities require different technologies:
➢Terminology based NLP (e.g., disease names)
➢ML based NLP (e.g., for ambiguous acronyms like cell lines, gene acronyms etc.)
➢Rule/pattern-based extraction (e.g., IUPAC chemical names, gene mutations)
“A lamp-snp assay detecting c580y mutation in pfkelch13 gene from clinically dried blood spot samples”
➢Image/graph processing (e.g., image2mol)
C1=CC=C(C(=C1)CC(=O)[O-])NC2=C(C=CC=C2Cl)Cl.[Na+]
31. /// AI-SDV 2022 // Integrated Data Platform at Bayer
32
Status quo & Plans for the future
6
32. /// AI-SDV 2022 // Integrated Data Platform at Bayer
33
Are we now living in a fairytale where everything is perfect?
33. /// AI-SDV 2022 // Integrated Data Platform at Bayer
34
Are we now living in a fairytale where everything is perfect?
There is still a lot to do…
➢Terminology is constantly evolving (new companies, new technologies etc.)
➢Development of scalable algorithms for complex entities
➢Finding the most relevant information in the ocean of data
➢Advanced visualization and analytics
➢Further standardization
➢…..
34. /// AI-SDV 2022 // Integrated Data Platform at Bayer
35
What can you do to help us in our endevour?
Vendors / Publisher / Data base producers
• Data quality
• FAIRification
• Using generally available
standards & IDs
• Consistency
• Collecting scattered data
• Harmonization
35. /// AI-SDV 2022 // Integrated Data Platform at Bayer
36
SOURCES
e.g., drug labels,
guidelines
USABILITY
THESAURI
Automatization
e.g. alerting CHEMISTRY
ANALYSES features
Big Data Platform
Plans for the future