Description of the API concept for engineering and how it can be useful. Particularly how it should be used with respect to genomics data. Finally, an analogy of the API concept in synthetic biology and how evolution allows encapsulation.
2. 2
The API, or how to make your
computational collaborators love you
Uri Laserson | @laserson | laserson@cloudera.com
21 May 2014
3. 3
The API, or how to make your
computational collaborators love you,
and also some perspectives on
engineering biology and immunology
Uri Laserson | @laserson | laserson@cloudera.com
21 May 2014
5. NCBI Sequence Read Archive (SRA)
5
Today…
1.14 petabytes
One year ago…
609 terabytes
6. For every “-ome” there’s a “-seq”
Genome DNA-seq
Transcriptome
RNA-seq
FRT-seq
NET-seq
Methylome Bisulfite-seq
Immunome Immune-seq
Proteome
PhIP-seq
Bind-n-seq
7. Crappy academic code
7
counts_dict = {}
for chain in vdj.parse_VDJXML(inhandle):
try: counts_dict[chain.junction] += 1
except KeyError: counts_dict[chain.junction] = 1
for count in counts_dict.itervalues():
print >>outhandle, np.int_(count)
8. Crappy academic code
8
counts_dict = {}
for chain in vdj.parse_VDJXML(inhandle):
try: counts_dict[chain.junction] += 1
except KeyError: counts_dict[chain.junction] = 1
for count in counts_dict.itervalues():
print >>outhandle, np.int_(count)
SELECT count(*) FROM antibodies GROUP BY junction
vs.
10. What is an API?
• Application Programming Interface
• Contract (between machines)
• Specifications for:
1. Procedures and methods
2. Data structures/messages
10
23. IMGT does not have an API
23
def __initVQUESTform(self):
# get form
request = urllib2.Request(
'http://imgt.cines.fr/IMGT_vquest/vquest?livret=0&Option=humanIg')
response = urllib2.urlopen(request)
forms = ClientForm.ParseResponse(response,
form_parser_class=ClientForm.XHTMLCompatibleFormParser,
backwards_compat=False)
response.close()
form = forms[0]
# fill out base part of form - Synthesis view with no extra options - TEXT
form['l01p01c03'] = ['inline']
form['l01p01c07'] = ['2. Synthesis']
form['l01p01c05'] = ['TEXT'] # may need to be 'TEXT'
form['l01p01c09'] = ['60']
form['l01p01c35'] = ['F+ORF+ in-frame P']
form['l01p01c36'] = ['0']
form['l01p01c40'] = ['1'] # ['1'] for searching with indels
form['l01p01c25'] = ['default’]
...
27. Flask/Bottle web server example
27
@route("/receptor/<id>")
def lookup_receptor(id):
# get the raw read
@route("/sample/<sample_id>")
def sample_summary(sample_id):
# impl for getting sample information; can return:
# * summary of repertoire information
# (num reads, VDJ distribution, etc.)
# * demographic info
@route("/sample/<sample_id>/common_junctions")
def common_junctions(sample_id):
# impl for getting the most common CDR3s
28. Genomics ETL has converged on standards
28
.fastq .bam .vcf
short read
alignment
genotype
calling analysisbiochemistry
29. VCF
29
##fileformat=VCFv4.1
##fileDate=20090805
##source=myImputationProgramV3.1
##reference=file:///seq/references/1000GenomesPilot-NCBI36.fasta
##contig=<ID=20,length=62435964,assembly=B36,md5=f126cdf8a6e0c7f379d618ff66beb2da,spe
##phasing=partial
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency">
##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">
##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">
##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">
##FILTER=<ID=q10,Description="Quality below 10">
##FILTER=<ID=s50,Description="Less than 50% of samples have data">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">
#CHR POS ID REF ALT QUAL FILTER INFO FORMAT NA00001
20 14370 rs605 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:
20 17330 . T A 3 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0|0:49:3:
20 1110696 rs604 A G,T 67 PASS NS=2;DP=10;AF=0.333,0.6 GT:GQ:DP:HQ 1|2:21:6:
20 1230237 . T . 47 PASS NS=3;DP=13;AA=T GT:GQ:DP 0|0:54:7:56,
30. What about immune data?
30
.fastq .bam .vcf
short read
alignment
genotype
calling analysisbiochemistry
.???immune receptor
alignment
31. Multiple models for same types: VDJFasta
31
sub new {
my ($class) = @_;
my $self = {};
$self->{filename} = "";
$self->{headers} = [];
$self->{sequence} = [];
$self->{germline} = [];
$self->{nseqs} = 0;
$self->{mids} = {};
$self->{accVsegQstart} = {}; # example: 124
$self->{accVsegQend} = {}; # example: 417
$self->{accJsegQstart} = {};
$self->{accJsegQend} = {};
$self->{accDsegQstart} = {};
38. Binary formats
• Protobuf, Thrift, or Avro
• Flexible data model
• All common primitive types (e.g. int, double string)
• Support nested types, including arrays and maps
• Efficient binary encoding
• Code generation for many languages (binary
compatible)
• Support for schema evolution
• Support IDL for data types and services
38
46. Predict composability of genetic elements
46
Kosuri, PNAS 110: 14024 (2013)
• 114 promoters
x 111 RBS
“…rather than relying on prediction or standardization,
we can screen synthetic libraries for desired behavior.”
66. Conclusions
• The API perspective helps organize and communicate
data
• Use sane file formats if possible:
• JSON for lightweight work
• Thrift/Avro for heavyweight serialization/communication
• Decouple data modeling for implementation details
• Biological engineering: what abstractions are
available?
• Evolution as nature’s encapsulator
66
Jake asked for computational tools and for biology; try to give you some of both.
“Industry talk” – devoid of original content
Used to be like you (researcher).
Immune repertoire sequencing with George Church.
Learned data management technology at Cloudera.
Share some insights.
Jake asked for computational tools and for biology; try to give you some of both.
“Industry talk” – devoid of original content
Used to be like you (researcher).
Immune repertoire sequencing with George Church.
Learned data management technology at Cloudera.
Share some insights.
Jake asked for computational tools and for biology; try to give you some of both.
“Industry talk” – devoid of original content
Used to be like you (researcher).
Immune repertoire sequencing with George Church.
Learned data management technology at Cloudera.
Share some insights.
Log scale.
Any assay that can be encoded in DNA is now high-throughput
People working on this data aren’t always aware of the best tools out there.
Custom script; who knows how it’s managed.
Doesn’t take advantage of possible optimizations in the data. Done manually.
Processing custom file format that no one else can do. Custom parser.
No support for automatically splitting the data and parallelizing.
Have to run it on a machine with access to the file system
Declarative description of what I want.
Abstracted away underlying store. Could be:
Table in distributed file system like Hadoop
Distributed In-memory data structure
SQLlite file on local disk
MySQL
Postgres
Can be sent to remote cluster.
Multiple possible implementations of size()
Multiple add methods
Array vs Linked Lit
Multiple possible implementations of size()
Multiple add methods
GOTO SITE
GOTO SITE
Love to see APIs for:
Accessing Ab/TCR sequences
Accessing germline sequences
Accessing immune locus information
Accessing primer sequences
Intersecting primer sequences with sequence databases
Accessing MHC sequences
MHC nomenclature conversion
Immunogenetics ontology definition and service
Immune receptor alignment and numbering
Immune receptor numbering conversion service
Immune receptor phylogeny
Immune receptor structure predictions
Accessing epitope database
In principle, they’ve done the work to support this type of stuff.
But…
Some people have proposed solving this problem
Accessing V-QUEST
Horrible documentation. Required a bit of reverse engineering.
Yelled at me for doing so.
Ideally the community would define the common set of endpoints that a user might expect.
Separately, genomics has converged.
Have to parse FORMAT before you can parse the actual genotype calls
FORMAT/INFO fields customizable
VCF records are dynamically typed. Classification as a SNP, Indel, Mixed, etc. depends on the properties of the alleles in the record.
Entries for particular CHROM must be in a single block. Position must be sorted. Makes it hard to add variants.
Number of rows is finite at length of the genome. But the records should scale according to the data type that grows, which is genotype calls. Difficult to add new samples.
Text format. Relatively poor compression. Verbose. Must be parsed. Slower.
Often Gzip-compressed – non-splittable.
VCF already better than the immune situation
One model of an aligned read
One model of an aligned read
GOTO SITE
Complain that binary is harder to read/process, but Avro/Thrift make that easy.
Enumeration of failure modes
Generation of diversity for the internal implementation.
Simple input and output signals.
The right kind of diversity matters.
Viruses depend on API compatibility in order to infect
Matrix of possible Ab-Ag interactions.
But not currently possible to get both at the same time.
We chose to get only the antibody information, with little functional information.
GENETIC approach
Alternatively, and cleverly, go for the other half. This way, the functionality is still useful.
Joined a project with Steve Elledge, led by Ben Larman to discover autoantigens.
Tile all human ORFs with peptides.
Synthesize peptides and clone into phage.
Carl June chimeric receptors?
Checkpoint blockade?
Steroids/immunosuppresants?