APIs and Synthetic Biology

1
The API
Uri Laserson | @laserson | laserson@cloudera.com
21 May 2014

2
The API, or how to make your
computational collaborators love you
21 May 2014

3
The API, or how to make your
computational collaborators love you,
and also some perspectives on
engineering biology and immunology
21 May 2014

NCBI Sequence Read Archive (SRA)
5
Today…
1.14 petabytes
One year ago…
609 terabytes

For every “-ome” there’s a “-seq”
Genome DNA-seq
Transcriptome
RNA-seq
FRT-seq
NET-seq
Methylome Bisulfite-seq
Immunome Immune-seq
Proteome
PhIP-seq
Bind-n-seq

Crappy academic code
7
counts_dict = {}
for chain in vdj.parse_VDJXML(inhandle):
try: counts_dict[chain.junction] += 1
except KeyError: counts_dict[chain.junction] = 1
for count in counts_dict.itervalues():
print >>outhandle, np.int_(count)

Crappy academic code
8
counts_dict = {}
for chain in vdj.parse_VDJXML(inhandle):
try: counts_dict[chain.junction] += 1
except KeyError: counts_dict[chain.junction] = 1
for count in counts_dict.itervalues():
print >>outhandle, np.int_(count)
SELECT count(*) FROM antibodies GROUP BY junction
vs.

What is an API?
• Application Programming Interface
• Contract (between machines)
• Specifications for:
1. Procedures and methods
2. Data structures/messages
10

Java API
13
public interface List<E> {
int size();
boolean isEmpty();
boolean contains(Object o);
boolean add(E e);
void add(int index, E element);
boolean remove(Object o);
}

Python DB API v2.0 (PEP 249)
14
http://legacy.python.org/dev/peps/pep-0249/

Why use an API?
• Encapsulation/interfaces/abstraction
• Loose-coupling of components
• Reusable services
• Service-oriented architecture
15

Linked-In’s Loose Coupling Architecture
16

Linked-In’s Loose Coupling Architecture
17

18
(If This Then That)
Stitching APIs together
https://ifttt.com/recipes#popular

IMGT “Spec”
21
http://www.imgt.org/IMGTScientificChart/

IMGT’s API is an FTP site
22

IMGT does not have an API
23
def __initVQUESTform(self):
# get form
request = urllib2.Request(
'http://imgt.cines.fr/IMGT_vquest/vquest?livret=0&Option=humanIg')
response = urllib2.urlopen(request)
forms = ClientForm.ParseResponse(response,
form_parser_class=ClientForm.XHTMLCompatibleFormParser,
backwards_compat=False)
response.close()
form = forms[0]
# fill out base part of form - Synthesis view with no extra options - TEXT
form['l01p01c03'] = ['inline']
form['l01p01c07'] = ['2. Synthesis']
form['l01p01c05'] = ['TEXT'] # may need to be 'TEXT'
form['l01p01c09'] = ['60']
form['l01p01c35'] = ['F+ORF+ in-frame P']
form['l01p01c36'] = ['0']
form['l01p01c40'] = ['1'] # ['1'] for searching with indels
form['l01p01c25'] = ['default’]
...

Haussler and genomics services
24

Flask/Bottle web server example
27
@route("/receptor/<id>")
def lookup_receptor(id):
# get the raw read
@route("/sample/<sample_id>")
def sample_summary(sample_id):
# impl for getting sample information; can return:
# * summary of repertoire information
# (num reads, VDJ distribution, etc.)
# * demographic info
@route("/sample/<sample_id>/common_junctions")
def common_junctions(sample_id):
# impl for getting the most common CDR3s

Genomics ETL has converged on standards
28
.fastq .bam .vcf
short read
alignment
genotype
calling analysisbiochemistry

VCF
29
##fileformat=VCFv4.1
##fileDate=20090805
##source=myImputationProgramV3.1
##reference=file:///seq/references/1000GenomesPilot-NCBI36.fasta
##contig=<ID=20,length=62435964,assembly=B36,md5=f126cdf8a6e0c7f379d618ff66beb2da,spe
##phasing=partial
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency">
##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">
##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">
##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">
##FILTER=<ID=q10,Description="Quality below 10">
##FILTER=<ID=s50,Description="Less than 50% of samples have data">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">
#CHR POS ID REF ALT QUAL FILTER INFO FORMAT NA00001
20 14370 rs605 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:
20 17330 . T A 3 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0|0:49:3:
20 1110696 rs604 A G,T 67 PASS NS=2;DP=10;AF=0.333,0.6 GT:GQ:DP:HQ 1|2:21:6:
20 1230237 . T . 47 PASS NS=3;DP=13;AA=T GT:GQ:DP 0|0:54:7:56,

What about immune data?
30
.fastq .bam .vcf
short read
alignment
genotype
calling analysisbiochemistry
.???immune receptor
alignment

Multiple models for same types: VDJFasta
31
sub new {
my ($class) = @_;
my $self = {};
$self->{filename} = "";
$self->{headers} = [];
$self->{sequence} = [];
$self->{germline} = [];
$self->{nseqs} = 0;
$self->{mids} = {};
$self->{accVsegQstart} = {}; # example: 124
$self->{accVsegQend} = {}; # example: 417
$self->{accJsegQstart} = {};
$self->{accJsegQend} = {};
$self->{accDsegQstart} = {};

Multiple models for same types: vdj
32
class ImmuneChain(SeqRecord):
def cdr3(self):
return len(self.junction)
def num_mutations(self):
aln = self.letter_annotations['alignment']
return aln.count('S') + aln.count('I')
def v(self):
return self.__getattribute__('V-REGION')
.qualifiers['allele'][0]
def v_seq(self):
return self.__getattribute__('V-REGION')
.extract(self.seq.tostring())

33
Interoperability/services depend on
being able to communicated data

CSV
34
9 CCTG_PRCONS=IGHC1_R1_IGM unproductive Homsap IGHV5-51*01 F, or Homsap IGHV5-51*0
12 GGGG_PRCONS=IGHC3_R1_IGA productive Homsap IGHV3-11*01 F Homsap IGHJ1*01 F
13 CTTC_PRCONS=IGHC5_R1_IGG unproductive Homsap IGHV1-2*02 F Homsap IGHJ5*02 F
18 ACTT_PRCONS=IGHC3_R1_IGA productive Homsap IGKV3-15*01 F, or Homsap IGKV3D-15*
20 GGAC_PRCONS=IGHC5_R1_IGG productive Homsap IGHV4-61*02 F Homsap IGHJ4*02 F
25 TCGT_PRCONS=IGHC2_R1_IGD productive Homsap IGHV3-23*01 F, or Homsap IGHV3-23*0
26 GGTG_PRCONS=IGHC5_R1_IGG productive Homsap IGHV4-34*01 F, or Homsap IGHV4-34*0
28 GTGA_PRCONS=IGHC5_R1_IGG productive Homsap IGHV1-46*01 F, or Homsap IGHV1-46*0
31 ACCC_PRCONS=IGHC1_R1_IGM productive Homsap IGHV3-9*01 F, or Homsap IGHV3-9*02
36 GCAA_PRCONS=IGHC1_R1_IGM productive Homsap IGHV3-9*01 F, or Homsap IGHV3-9*02
39 GCAA_PRCONS=IGHC1_R1_IGM productive Homsap IGHV3-7*01 F Homsap IGHJ6*02 F
40 GGGT_PRCONS=IGHC1_R1_IGM productive Homsap IGHV4-34*01 F, or Homsap IGHV4-34*0
42 TAGG_PRCONS=IGHC5_R1_IGG productive Homsap IGHV4-39*01 F, or Homsap IGHV4-39*0
47 CAAA_PRCONS=IGHC1_R1_IGM productive Homsap IGHV3-15*01 F, or Homsap IGHV3-15*0
48 AGAA_PRCONS=IGHC5_R1_IGG unproductive Homsap IGHV3-30*04 F, or Homsap IGHV3-30-3
52 GCAG_PRCONS=IGHC1_R1_IGM productive Homsap IGHV3-23*01 F, or Homsap IGHV3-23*0
53 AACC_PRCONS=IGHC3_R1_IGA productive Homsap IGHV3-30*02 F Homsap IGHJ4*02 F

XML
35
<ImmuneChain>
<c>IGHD</c>
<barcode>RL014</barcode>
<j_start_idx>389</j_start_idx>
<seq>TTGTGGCTATTTTAAA ... CTCGGACT</seq>
<descr>003699_0091_0140</descr>
<tag>coding</tag>
<clone>IGHV3-43_IGHJ4|387</clone>
<j>IGHJ4*02</j>
<v_end_idx>314</v_end_idx>
<v>IGHV3-43*01</v>
<junction>TGTGCAAAAGATAATCT ... TCTTTGACTACTGG</junction>
<d>IGHD5-24*01</d>
</ImmuneChain>

JSON
36
{
"v": "IGHV4-39*02",
"seq": "CCTATCCCCCTGTGTGCCTT ... CTCCACCAAG",
"num_mutations": 43,
"name": "HG2DXMN01CY8UH",
"letter_annotations": {
"alignment": "..............S....S....3333333333333333..
},
"junction_nt": "GCGAGGGGCCGATGGGACTTTTATTACATGGACGTC",
"j": "IGHJ6*03",
"annotations": {
"usearch_90_cluster": "6277",
"experiment_date": "20120119",
"donor": "17517",
"sample_type": "memory_B_cells",
"source": "SeqWright",
"tags": ["revcomp", "coding"],
"taxonomy": []
},
"d": "IGHD3-10*01",
http://www.json.org/

JSON
37
{ "__SeqRecord__" : true, "_id" : { "$oid" : "4f1f5525e7c6172308

Binary formats
• Protobuf, Thrift, or Avro
• Flexible data model
• All common primitive types (e.g. int, double string)
• Support nested types, including arrays and maps
• Efficient binary encoding
• Code generation for many languages (binary
compatible)
• Support for schema evolution
• Support IDL for data types and services
38

Thrift example: Twitter
39
service Twitter {
void ping();
bool postTweet(1:Tweet tweet);
TweetSearchResult searchTweets(1:string query);
}
struct Tweet {
1: required i32 userId;
2: required string userName;
3: required string text;
4: optional Location loc;
16: optional string language = "english"
}

Thrift example: Immune receptor
40
cd ~/repos/kiwi
thrift --gen java kiwi-format/src/main/resources/thrift/kiwi.thrift
thrift --gen py:new_style kiwi-format/src/main/resources/thrift/kiwi.thrift
See: https://github.com/laserson/kiwi

42
Biological parts
specifications
• Library of parts with
well-characterized
input-output
characteristics
• In total, similar to
API spec
Canton, Nat. Biotech. 26: 787 (2008)

Engineering signaling pathways at
inputs/outputs
43
Lim, Nat. Rev. Mol. Cell 11: 393 (2010)

Bottom-up genetic circuit design
44
Brophy, Nature Meth. 11: 508 (2014)

Bottom-up genetic circuit design
45
Brophy, Nature Meth. 11: 508 (2014)

Predict composability of genetic elements
46
Kosuri, PNAS 110: 14024 (2013)
• 114 promoters
x 111 RBS
“…rather than relying on prediction or standardization,
we can screen synthetic libraries for desired behavior.”

47
Most addressable
Cheapest to create
ZFN => TALEN => CRISPR/Cas
Least addressable
Most expensive
to create

Addressability for precision nanoscale
engineering
48
Douglas, NAR 37: 5001(2009)

Addressability for precision nanoscale
engineering
49
Douglas, Nature 459: 414 (2009)

Evolution for encapsulation: an evolved
electronic thermometer
50
http://www.genetic-programming.com/hc/thermometer.html

Lycopene synthesis optimization
51
Wang, Nature 460: 894 (2009)

Evolutionary encapsulation for signaling
pathway engineering
52
Peisajovich, Science 328: 368 (2010)

Evolutionary encapsulation for signaling
pathway engineering
53
Peisajovich, Science 328: 368 (2010)

Genetic isolation with Re.coli
54
Lajoie, Science 342: 357 (2013)

So far, we discussed antibody-only data analysis

Larman, Nat. Biotech. 29: 535 (2011)
Ben Larman
Steve Elledge
Agilent OLS array

59
Phage immunoprecipitation sequencing (PhIP-seq)

60
Patient A Replica 1
PatientAReplica2
SAPK4
NOVA1
TGIF2LX
log10(-log10 P-value)
PhIP-seq proof-of-principle

63
‘Immunization without vaccination’

Encapsulation for cancer immunotherapy
through TMG processing
64
Tran, Science 344: 641 (2014)

Conclusions
• The API perspective helps organize and communicate
data
• Use sane file formats if possible:
• JSON for lightweight work
• Thrift/Avro for heavyweight serialization/communication
• Decouple data modeling for implementation details
• Biological engineering: what abstractions are
available?
• Evolution as nature’s encapsulator
66

APIs and Synthetic Biology

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (19)

Semelhante a APIs and Synthetic Biology

Semelhante a APIs and Synthetic Biology (20)

Mais de Uri Laserson

Mais de Uri Laserson (6)

Último

Último (20)

APIs and Synthetic Biology

Notas do Editor