A light look at the world of BigData for the lay person - a look at a couple of examples and what we do in MedChemica to speed up drug discovery. First presented at Macclesfield SciBar, and then Knutsford SciBar.
1. MedChemica
BigData
‘What is that ALL about?’
Al Dossetter
al.dossetter@medchemica.com
MedChemica Limited
Macclesfield Sci Bar
25th April 2016
2. Big Data – ‘What is that all about?’
• Introduction to Big Data
• Examples from History
• Big Data and science
• MedChemica – advancing drug design
through actionable knowledge
3. About Us Passionate about generating better decisions from data
Dr Andrew G. Leach
Technical Director
Liverpool John Moores
12 years experience
Applied computational
and medicinal chemistry
Dr Ed Griffen
Technical Director
21 years experience
Medicinal chemistry and
large scale statistical
analysis methods
Dr Al Dossetter
Managing Director
17 years Medicinal chemistry and
extensive cloud computing
experience
Dr Ali Griffen
Business Analyst
PhD Fungal Vascular wilt disease
21 years experience Team leader
bioscientist and biological data
curation
Dr Shane Montague
Lead Data Scientist
PhD Computer Science
13 years experience Data
science and information
security
Dr Jia Wu
Consultant Data Scientist
PhD Machine Learning
12 years experience in data
mining and machine learning.
Projects in finance, energy and
criminology.
4. Best Definition of Big Data
• Any analysis of a data set that is too large to
do by hand
– Requires computational techniques
– Requires statistical techniques
• Yields
– Knowledge
- Knowledge that can be counter intuitive
It got ‘Big’ because:
- the internet made a lot of data available very
quickly (often for free)
It got interesting because:
- Knowledge yields real benefits to the bottom line
- Reduce costs or Increased sales
You the consumer benefit….
- Cheaper goods, available on-line
- Flights on time, trains on time, deliveries on time
5. Big Data
“The Revolution that will
change the world we live in”
• Principles of Big Data
– Use ALL of the Data
• however noisy
– Analyse in an unbiased way
– “DO WHAT” it tells you
• Do Not Worry About “WHY”
– KEEP everything
• ‘you never know what question you
want to ask’
6. The
4
Vs
• Picture
from
Google
or
someone
• What
does
it
mean?
• Mostly
it
is
about
using
lots
of
computers
Most issues are sorted out by more CPUs, more drive
space, and better stats
7. Its actually been around quite a while…
• It was genius to break the codes
• Further genius of collating the data and reducing it so
that analysts can use in a timely manner (volume /
velocity / veracity)
• ….saved many many lives on both sides
9. What do Nappies and Beer have in common?
• Analysis of shopping habits found these two things were bought together
• Put them close together in the store and sell more
+
=
10. UPS delivery service
• Fitted sensors to all delivery
trucks and gathered data
• Analysed data to detect
early engine issues BEFORE
breakdown
• Therefore FIX early and
keep the van on the road
• The customer benefits
because:
• Deliveries on-time
• Even larger dataset – high
degree of predicition on
deliver times
11. Jet Engines – reliable service
• Sensors on jet engines – monitored in flight
• Similar to UPS
• Therefore FIX early and keep the planes in the air
• The customer benefits because:
• Flights on time and reliable
12. Google translate
The Unreasonable Effectiveness of Data
“Because of a huge shared cognitive and cultural
context, linguistic expression can be highly ambiguous
and still often be understood correctly.”
• h@ps://en.wikipedia.org/wiki/File:Google_Translate_Icon.png
• h@ps://en.wikipedia.org/wiki/Google_Translate
• h@ps://www.youtube.com/watch?v=yvDCzhbjYWs
• University
of
BriQsh
Columbia
DisQnguished
Lecture
Series
-‐
Sept
23rd
2011
Groups or pairs of words associated together on
websites around the internet
Statistical analyse of frequency of pairing
Therefore this word (or group) probably translates into
this word
13. What about science?
We need to be accurate (don’t we?)
• Large Hadron Collider shows how we can gather a lot
of data very accurately
• Large amount needs to reduce the errors – very very
big data
14. The Life Science industry has woken up to Big Data
• Human Genome
• Biological systems
• Kinome
• Metabolomics
• Proteomics
• 3D structural information (CDC /
Protein Data Bank)
• Literature and Patents (GVK Bio,
ChEMBL, Pubmed, PubChem)
• Reaction infomatics – what works,
what doesn’t
• Document management
• Regulatory submissions
Huge Opportunity in this area
15. What about life sciences?
• Hard and harder to discover drugs.
• They have to work
• They have to be safe
• People want them cheaply
• A description of the drug research and
development process
16. Company
Ticker
Number of drugs
approved
R&D Spending
Per Drug ($Mil)
Total R&D
Spending
1997-2011 ($Mil)
AstraZeneca
AZN
5
11,790.93
58,955
GlaxoSmithKline
GSK
10
8,170.81
81,708
Sanofi
SNY
8
7,909.26
63,274
Pfizer Inc.
PFE
14
7,727.03
108,178
Roche Holding AG
RHHBY
11
7,803.77
85,841
Johnson & Johnson
JNJ
15
5,885.65
88,285
Eli Lilly & Co.
LLY
11
4,577.04
50,347
Abbott Laboratories
ABT
8
4,496.21
35,970
Merck & Co Inc
MRK
16
4,209.99
67,360
Bristol-Myers
Squibb Co.
BMY
11
4,152.26
45,675
Novartis AG
NVS
21
3,983.13
83,646
Amgen Inc.
AMGN
9
3,692.14
33,229
Sources: InnoThink Center For Research In Biomedical Innovation;
Thomson Reuters Fundamentals via FactSet Research Systems
The Truly Staggering Cost Of Inventing New Drugs
Matthew Herper - Forbes
Drug failures later in development are mainly due to EFFICACY and SAFETY
17.
18. Actual spending – all LO projects are biggest spend
Paul, S. M. et al How to improve R&D productivity: the pharmaceutical
industry’s grand challenge, Nat. Rev. Drug Discovery 2010, 9, 203
Snap-Shot of a medium sized
companies R&D spend in one
year - $1.7 billion
For a period large pharma set targets at each stage of the process – an
attrition model - unsuccessful and very wasteful
Better chemistry
Reduce the number
of projects
Chemistry influence success and speed
Methods that really work, new formulations
19. What Causes Attrition in Development?
PK
7%
Lack of
efficacy in
man
46%
Adverse
effects in man
17%
Animal toxicity
16%
Commercial
reasons
7%
Miscellaneous
7%
Many compounds fail in development through inadequate
pharmacokinetics / bioavailability and unacceptable
toxicological profiles in addition to lack of efficacy in man
21. Roche
Data
rule
finder
Roche
Database
Genentech
Data
rule
finder
Genentech
Data
AZ
Data
rule
finder
AZ
Database
Grand Rule
Database
Grand Rule database
Better medicinal chemistry by sharing knowledge not data & structures
MedChemica
Grand Rule
Database
Grand Rule
Database
Grand Rule
Database
AZ
ExploitaQon
Roche
ExploitaQon
Genentech
ExploitaQon
Pharma 4
Data
rule
finder
Pharma 4
Data
Grand Rule
Database
Pharma
4
ExploitaQon
Grand Rule
Database
Pharma 5
Data
rule
finder
Pharma 5
Data
Grand Rule
Database
Pharma
5
ExploitaQon
Grand Rule
Database
>500
million
pairs
from
companies
+
12
million
from
public
data
24. What
about
clinical
safety?
SAFE
DRUGS
‘Potency’
Do
not
sacrifice
The
be@er
it
is
the
lower
the
dose
Improved
tes=ng
in-‐vivo
with
fewer
animals
Clinical
linkage
to
protein
target
Can
test
In-‐Vivo
AnQ
SAR
e.g.
hERG,
Nav1.5,
5-‐HT2a…
Analysis
of
In-‐Vivo
data
Pfizer
–
rat
data
<0.2mg/Kg
Dose
Metabolism
&
Pharmacokine=cs
Be@er
design
so
dose
is
lower
Grand Rule
Database
Hughes
et
al,
Bioorg
Med
Chem
Le>.
2008,
18(17),
4872
26. The
‘Internet
of
Things
(IoT)’
A higher diversity of devices connected to the internet with flow of
data to and from
For example Smart Watches
Life style device – marketed on selling fitness / wellness
Like UPS vans and RR jet engines can we detect the illness pre-
symptomatically?
27. Big Data – ‘What is that all about?’
• Introduction to Big Data
– Big enough to need a computer / advanced stats
• Examples from History
– Bletchley park, UPS, Beer and Nappies….
• Big Data and science
– Hadron collider….
• MedChemica – Advancing drug design
through actionable knowledge
– Allows sharing of knowledge to accelerate and
reduce costs of finding new, safe medicines