Open PHACTS April 2017 Science webinar Workflow tools
1. Workflow tools for Life Science
Research
Apr 2017
nick@openphactsfoundation.org
2. This webinar is being
recorded and will be uploaded
to Slideshare etc afterwards
@Open_PHACTS
LinkedIn Group
RSS & Newsletter
3. Agenda
Introduction to common workflow language (CWL) -
Michael Crusoe
Accessing Open PHACTS with Knime nodes to support
Life Science Business questions - James Lumley, Eli
Lilly & Company
Pipeline Pilot workflows with Open PHACTS Examples
Jean-Marc Neefs, Janssen
Panel discussion on where next with Workflow and
supporting Life Science research
4. Our speakers & panel
Michael Crusoe, Common Workflow Language co-founder
James Lumley, Informatics, Eli Lilly & Company
Jean-Marc Neefs, Janssen
Panel:
– Michael Crusoe, James Lumley, Jean-Marc Neefs
– Derek Marren, Eli Lilly
– Daniela Digles, University of Vienna
– Andrei Caracoti, Biovia
5. Workflow Examples
The Application of the Open Pharmacological Concepts Triple Store (Open
PHACTS) to Support Drug Discovery Research
PLoS ONE 2014 DOI: 10.1371/journal.pone.0115460
Drug discovery FAQs: workflows for answering multidomain drug discovery
questions
Drug Discovery Today 2015 DOI: 10.1016/j.drudis.2014.11.006
Open PHACTS computational protocols for in silico target validation of
cellular phenotypic screens: knowing the knowns
Med. Chem. Commun. 2016 DOI: 10.1039/c6md00065g
Selectivity profiling of BCRP versus P-gp inhibition: from automated
collection of polypharmacology data to multi-label learning
J Cheminform 2016 DOI: 10.1186/s13321-016-0121-y
7. https://goo.gl/Aujxza
Why use a workflow management system?
Features can include:
● separation of concerns: focus on the science being
done first; then optimize execution later
● automatic job execution: start a complicated
analysis involving many pieces with a single command
● scaling (across nodes, clusters, and possibly
continents)
● automatically generated graphical user interfaces
(example: Galaxy)
● How was this file made? (automatic provenance
tracking)
12. https://goo.gl/Aujxza
Why have a standard?
● Standards create a surface for collaboration that
promote innovation
● Research frequently dip in and out of different
systems but interoperability is not a basic
feature.
● Funders, journals, and other sources of
incentives prefer standards over proprietary or
single-source approaches
13. https://goo.gl/Aujxza
Common Workflow Language v1.0
● Common format for bioinformatics (and more!) tool
& workflow execution
● Community based standards effort, not a specific
software package; Very extensible
● Defined with a schema, specification, & test
suite
● Designed for shared-nothing clusters, academic
clusters, cloud environments, and local execution
● Supports the use of containers (e.g. Docker) and
shared research computing clusters with locally
installed software
15. https://goo.gl/Aujxza
Why use the Common Workflow Language?
Develop your pipeline on your local computer
(optionally with containers)
Execute on your research cluster or in the cloud
Deliver to users via workbenches like Arvados, Rabix,
Toil. Galaxy, Apache Taverna, AWE, Funnel (GCP)
support is in alpha stage.
16. https://goo.gl/Aujxza
● Low barrier to entry for implementers
● Support tooling such as generators, GUIs, converters
● Allow extensions, but must be well marked
● Be part of linked data ecosystem
● Be pragmatic
CWL Design principles
17. https://goo.gl/Aujxza
Linked Data & CWL
● Hyperlinks are common currency
● Bring your own RDF ontologies for metadata
● Supports SPARQL to query
Example: can use the EDAM ontology (ELIXIR-DK) to
specify file formats and reason about them:
“FASTQ Sanger” encoding is a type of FASTQ file
18. https://goo.gl/Aujxza
Use Cases for the CWL standards
Publication reproducibility, reusability
Workflow creation & improvement across institutions
and continents
Contests & challenges
Analysis on non-public data sets, possibly using GA4GH
job & workflow submission API
19. https://goo.gl/Aujxza
Early Adopters
(US) National Cancer Institute Cloud Pilots (Seven
Bridges Genomics, Institute for Systems Biology)
Cincinnati Children’s Hospital Medical Research Center
(Andrey Kartashov & Artem Barski)
bcbio: Validated, scalable, community developed
variant calling, RNA-seq and small RNA analysis (docs,
BOSC 2016 talk: video, slides) (Brad Chapman et al.)
Duke University, Center for Genomic and Computational
Biology: GENOMICS OF GENE REGULATION project (BOSC
2016 talk: video, slides, poster)(Dan Leehr et al.)
NCI DREAM SMC-RNA Challenge (Kyle Ellrott et al.)
Presentation
20. Adapted from Peter Amstutz’s presentation, licensed CC-BY-SA
Sample Real World CWL Workflow
Courtesy US NIH NCI Genomic Data Commons, visualization from
https://view.commonwl.org/workflows/github.com/NCI-GDC/gdc-dnaseq-cwl/tree/master/workflows/d
naseq/transform.cwl
21. https://goo.gl/Aujxza
Announcing: v1.0!
http://www.commonwl.org/v1.0/
Authors:
Peter Amstutz, Arvados Project, Curoverse
Michael R. Crusoe, Common Workflow Language project
Nebojša Tijanić, Seven Bridges Genomics
Contributors:
Brad Chapman, Harvard Chan School of Public Health
John Chilton, Galaxy Project, Pennsylvania State University
Michael Heuer, UC Berkeley AMPLab
Andrey Kartashov, Cincinnati Children's Hospital
Dan Leehr, Duke University
Hervé Ménager, Institut Pasteur
Maya Nedeljkovich, Seven Bridges Genomics
Matt Scales, Institute of Cancer Research, London
Stian Soiland-Reyes, University of Manchester
Luka Stojanovic, Seven Bridges Genomics
22. https://goo.gl/Aujxza
How did we do it?
Initial group started at BOSC Codefest 2014
Moved to open mailing list and extended onto GitHub &
then Gitter chat
Frequent (twice a month or more) video chats to work
through design issues with summaries emailed
Some participants doing CWL community work during
their day jobs, some on “nights & weekends”.
In October 2015 Seven Bridges sponsored one of the
co-founders (M. Crusoe) to work full time on the
project
23. https://goo.gl/Aujxza
Community Based Standards development
Different model than traditional nation-based or
regulatory approach
We adopted the Open-Stand.org Modern Paradigm for
Standards: Cooperation, Adherence to Principles (Due
process, Broad consensus, Transparency, Balance,
Openness), Collective Empowerment, (Free)
Availability, Voluntary Adoption
24. https://goo.gl/Aujxza
Challenges
Giving a standard to a community that is “free as in
puppies”: How does the community participate? How will
maintenance be funded?
CWL isn’t the only effort that has these needs; can we
join with related efforts?
25. https://goo.gl/Aujxza
A Grand Opportunity
if:
properly funded and embraced by the wider community
then:
the researchobject.org standards + CWL could fulfill
the huge need for an executable and complete
description of how computationaly derived research
results were made
26. https://goo.gl/Aujxza
What’s next for the Common Workflow
Language?
Public charity to own the standard
Tooling improvements
More implementations (Galaxy, Taverna, Kepler, Xenon,
…?)
Integration with researchobject.org standards for
attribution, provenance, and metadata guidance.
28. https://goo.gl/Aujxza
Michael R. Crusoe, who is this guy?
Phoenix, Arizona (Sonoran Desert), USA
Studied at Arizona State University: Computer Science;
time in industry as a developer & system administrator
(Google, others); returned to academia to study
Microbiology.
Introduced to bioinformatics via Anolis (lizard)
genome assembly and analysis (Kenro Kusumi, Arizona
State University)
Returned to software engineering as a Research
Software Engineer for k-h-mer project (C. Titus Brown,
Michigan State University, then U. of California,
Davis)
29. Adapted from Peter Amstutz’s presentation, licensed CC-BY-SA
File type & metadata
Input parameters
Output parameters
class: CommandLineTool
cwlVersion: v1.0
doc: Sort by chromosomal coordinates
inputs:
aligned_sequences:
type: File
format: edam:format_2572 # BAM binary alignment format
inputBinding:
position: 1
outputs:
sorted_aligned_sequences:
type: stdout
format: edam:format_2572
Executable
baseCommand: [samtools, sort]
hints:
DockerRequirement:
dockerPull: quay.io/cancercollaboratory/dockstore-tool-samtools-sort
Runtime environment
$namespaces: { edam: "http://edamontology.org/" }
$schemas: [ "http://edamontology.org/EDAM_1.15.owl" ]
Linked data support
Example: samtools-sort.cwl
30. Adapted from Peter Amstutz’s presentation, licensed CC-BY-SA
File type & metadata
class: CommandLineTool
cwlVersion: v1.0
doc: Sort by chromosomal coordinates
● Identify as a CommandLineTool object
● Core spec includes simple comments
● Metadata about tool extensible to arbitrary RDF
vocabularies, e.g.
○ Biotools & EDAM
○ Dublin Core Terms (DCT)
○ Description of a Project (DOAP)
● GA4GH Tool Registry project will develop best
practices for metadata & attribution
31. Adapted from Peter Amstutz’s presentation, licensed CC-BY-SA
hints:
DockerRequirement:
dockerPull: quay.io/[...]samtools-sort
Runtime Environment
● Define the execution environment of the tool
● “requirements” must be fulfilled or an error
● “hints” are soft requirements (express preference
but not an error if not satisfied)
● Also used to enable optional CWL features
○ Mechanism for defining extensions
32. Adapted from Peter Amstutz’s presentation, licensed CC-BY-SA
Input parameters
● Specify name & type of input parameters
○ Based on the Apache Avro type system
○ null, boolean, int, string, float, array, record
○ File formats can be IANA Media/MIME types, or from domain
specific ontologies, like EDAM for bioinformatics
● “inputBinding”: describes how to turn parameter
value into actual command line argument
inputs:
aligned_sequences:
type: File
format: edam:format_2572 # BAM binary format
inputBinding:
position: 1
33. Adapted from Peter Amstutz’s presentation, licensed CC-BY-SA
File type & metadata
Input parameters
Output parameters
class: CommandLineTool
cwlVersion: v1.0
doc: Sort by chromosomal coordinates
inputs:
aligned_sequences:
type: File
format: edam:format_2572 # BAM binary alignment format
inputBinding:
position: 1
outputs:
sorted_aligned_sequences:
type: stdout
format: edam:format_2572
Executable
baseCommand: [samtools, sort]
hints:
DockerRequirement:
dockerPull: quay.io/cancercollaboratory/dockstore-tool-samtools-sort
Runtime environment
$namespaces: { edam: "http://edamontology.org/" }
$schemas: [ "http://edamontology.org/EDAM_1.15.owl" ]
Linked data support
Example: samtools-sort.cwl
34. Adapted from Peter Amstutz’s presentation, licensed CC-BY-SA
inputs:
aligned_sequences:
type: File
format: edam:format_2572
inputBinding:
position: 1
baseCommand: [samtools, sort]
aligned_sequences:
class: File
location: example.bam
format: http://edamontology.org/format_2572
[“samtools”, “sort”, “example.bam”]
Input object
Command Line Building
● Associate input values with parameters
● Apply input bindings to generate strings
● Sort by “position”
● Prefix “base command”
35. Adapted from Peter Amstutz’s presentation, licensed CC-BY-SA
outputs:
sorted_aligned_sequences:
type: stdout
format: edam:format_2572
Output parameters
● Specify name & type of output parameters
● In this example, capture the STDOUT stream from
“samtools sort” and tag it as being BAM formatted.
36. Adapted from Peter Amstutz’s presentation, licensed CC-BY-SA
Workflows
● Specify data dependencies between steps
● Scatter/gather on steps
● Can nest workflows in steps
● Still working on:
● Conditionals & looping
38. Adapted from Peter Amstutz’s presentation, licensed CC-BY-SA
Example: grep & count
class: Workflow
cwlVersion: v1.0
inputs:
pattern: string
infiles: File[]
outputs:
outfile:
type: File
outputSource: wc/outfile
requirements:
- class: ScatterFeatureRequirement
steps:
grep:
run: grep.cwl
in:
pattern: pattern
infile: infiles
scatter: infile
out: [outfile]
wc:
run: wc.cwl
in:
infiles: grep/outfile
out: [outfile]
Tool to run
Scatter over
input array
Connect output
of “grep” to input
of “wc”
Connect output of “wc”
to workflow output
39. Accessing the
Open PHACTS Linked Data API
with KNIME
James A. Lumley
Research IT, Eli Lilly
April 2017
40. The KNIME Analytics Platform
Open source platform for data analytics. Over 1000 modules (or nodes) to connect to all major data
sources; support for many data types inc. XML/JSON/Images./Docs/Chemical Formats; Math and Stats
functions, Predictive modelling and machine learning; Tool blending for Python/R/Weka/SQL/Java;
Interactive data views and reporting. “a toolbox for any data scientist”.
https://www.knime.org/knime-analytics-platform
41. ♦ 2016 (VU Amsterdam)*
• Original Nodes and workflows by Ronald Siebes, VU Amsterdam
• OPS_Swagger and OPS_JSON nodes used to create and execute the
parameterized API calls, as well as transforming the output to a tabular form
♦ Q2 2017 (Eli Lilly)
• Update of Erl Wood KNIME Nodes will add new OPS node developed internally
at Eli Lilly with input from OPS
– KNIME Node: Luke Bullard
– Team input: James Lumley / Derek Marren (Lilly); Daniella Digles / Nick Lynch (OPS);
Randy Kerber (d2discovery)
– Workflows: James Lumley
• Single Node allows user to select the call of interest and return both JSON and
Tabular results
• Focus of development: Updating to new API, improving usability
• Further iterations possible once feedback received
OPS-KNIME Nodes
* http://www.openphactsfoundation.org/wp/wp-content/uploads/2016/02/2016-02-25_Creating-workflows-for-drug-discovery-with-Open-PHACTS-and-KNIME.pdf
42. OPS & Erl Wood Community Nodes
♦ View based on internal Beta
of Lilly opensource Erl Wood
nodes due for release Q2
2017
♦ Community Erlwood Nodes
Open PHACTS
♦ Open PHACTS sub-folder
contains single OPS Linked
Data API node that will allow a
configured call/return
43. Configuring the OPS Linked Data API node
♦ Preferences panel allows client/workflow
level control of API URL Endpoint and API
Id/Key, avoiding the need to configure
these in the node
44. Using the OPS Linked Data API node
App Id and App Key fields are
automatically populated if they
are set in the preferences
Drop down ‘Select Method Type’
allows selection of API call
45. Using the OPS Linked Data API node
Input port is optional. Toggle
on input field allows user string
input or selection of input table
column
First output port returns
formatted data table
(corresponding to API param
“_format=tsv”)
46. Using the OPS Linked Data API node
Drop down ‘Select Method
Type’ allows selection of API
call
Logically grouped methods
match developer API docs
(swagger) at
https://dev.openphacts.org/d
ocs/2.1
47. Allows formatted results table or full
JSON/XML return for debug/analysis
First output port returns
formatted data table
(corresponding to API
param “_format=tsv”)
Second output port is
optional and if
requested, will return
JSON or XML response
(via second API call
without _format param)
49. User input and example return
Raw Tabular Return:
Pivoted to show Column Names and Values:
50. User input and example return
Optional JSON Output as raw JSON Object
51. User input and example return
Rather than parsing the JSON to
understand the raw output, the node also
has an attached ‘View’ with a hierarchically
formatted tree view of the JSON output:
52. User input and example return
Generic JSON Extraction to
flat table shows additional
data returned from API,
deeper JSON processing
can be done using KNIME
JSON nodes
53. JSON/XML Support in KNIME 3.3
Extensive native support for JSON or XML parsing with KNIME 3.3 allows
complete/custom parsing of the return JSON object for full debugging
54. Chemistry Support on input SMI
Input columns of differing
chemical types are
automatically converted to
SMILES via Marvin if the API
param is SMILES based
55. API Timeouts and URL changes
Advanced developers can
change the API timeout value or
edit the API URL on a single
node using the Web Service
panel
56. 1. A new KNIME 3.3 compatible “OpenPHACTS Linked Data API”
node will be released in Q2 2017
2. Designed for users, it provides easy configuration of API settings
and parameters with easy to user tabular data return (via API
_format parameter)
3. Designed for developers it allows additional full JSON/XML
response that can be viewed/parsed by the expert user to see raw
response
4. Further example workflows will be release once the node is
available
Summary
59. List compounds active on target X
Open PHACTS + Pipeline Pilot Workflow:
1. Search target information
• [OPS API call ‘Free Text to Concept’]
2. Get active compounds on that target
• [OPS API call ‘Target Pharmacology: List’]
63. Find compounds against Alzheimer’s targets
Open PHACTS + Pipeline Pilot Workflow:
1. Search for disease
• [OPS API call ‘Free Text to Concept’]
2. Search target information
• [OPS API call ‘Targets for Disease: List’]
3. Get active compounds on that target
• [OPS API call ‘Target Pharmacology: List’]