SlideShare uma empresa Scribd logo
1 de 37
Awesome Big Data Algorithms




                 http://xkcd.com/1185/
Awesome Big Data
   Algorithms
                C. Titus Brown
                ctb@msu.edu
   Asst Professor, Michigan State University
(Microbiology, Computer Science, and BEACON)
Welcome!
• More of a computational scientist than a
  computer scientist; will be using simulations
  to demo & explore algorithm behavior.


• Send me questions/comments @ctitusbrown,
  or ctb@msu.edu.
“Features”
• I will be using Python rather than C++, because Python
  is easier to read.


• I will be using IPython Notebook to demo.


• I apologize in advance for not covering your favorite
  data structure or algorithm.
Outline
• The basic idea
• Three examples
   – Skip lists (a fast key/value store)
   – HyperLogLog Counting (counting discrete elements)
   – Bloom filters and CountMin Sketches

• Folding, spindling, and mutilating DNA sequence
• References and further reading
The basic idea
• Problem: you have a lot of data to count, track, or otherwise
   analyze.
• This data is Data of Unusual Size, i.e. you can’t just brute force the
   analysis.
• For example,
    – Count the approximate number of distinct elements in a very large
      (infinite?) data set
    – Optimize queries by using an efficient but approximate prefilter
    – Determine the frequency distribution of distinct elements in a very
      large data set.
Online and streaming vs. offline
          “Large is hard; infinite is much easier.”
• Offline algorithms analyze an entire data set all at
  once.
• Online algorithms analyze data serially, one piece at a
  time.
• Streaming algorithms are online algorithms that can
  be used for very memory & compute limited analysis.
Exact vs random or probabilistic
• Often an approximate answer is sufficient,
  esp if you can place bounds on how wrong the
  approximation is likely to be.
• Often random algorithms or probabilistic data
  structures can be found with good typical
  behavior but bad worst case behavior.
For one (stupid) example
You can trim 8 bits off of integers for the purpose of averaging them
Skip lists
  A randomly indexed improvement on linked lists.

Each node can belong to one or more vertical “levels”,
which allow fast search/insertion/deletion – ~O(log(n))
                       typically!




                                                          wikipedia
Skip lists
        A randomly indexed improvement on linked lists.

     Very easy to implement; asymptotically good behavior.




From reddit, “if someone held a gun to my head and asked
me to implement an efficient set/map storage, I would
implement a skip list.”

(Response: “does this happen to you a lot??”)                wikipedia
Channel randomness!

• If you can construct or rely on randomness,
  then you can easily get good typical behavior.



• Note, a good hash function is essentially the
  same as a good random number generator…
HyperLogLog cardinality counting

• Suppose you have an incoming stream of
 many, many “objects”.

• And you want to track how many distinct
 items there are, and you want to accumulate
 the count of distinct objects over time.
Relevant digression:
• Flip some unknown number of coins. Q: what is
  something simple to track that will tell you
  roughly how many coins you’ve flipped?


• A: longest run of HEADs. Long runs are very rare
  and are correlated with how many coins you’ve
  flipped.
Cardinality counting with HyperLogLog

• Essentially, use longest run of 0-bits observed
  in a hash value.
• Use multiple hash functions so that you can
  take the average.
• Take harmonic mean + low/high sampling
  adjustment => result.
Bloom filters

• A set membership data structure that is
  probabilistic but only yields false positives.

• Trivial to implement; hash function is main
  cost; extremely memory efficient.
My research applications
    Biology is fast becoming a data-driven science.




                      http://www.genome.gov/sequencingcosts/
Shotgun sequencing analogy:
  feeding books into a paper shredder,
digitizing the shreds, and reconstructing
                 the book.




   Although for books, we often know the language and not just the alphabet 
Shotgun sequencing is --


• Randomly ordered.

• Randomly sampled.

• Too big to efficiently do multiple passes
Shotgun sequencing




   “Coverage” is simply the average number of reads that overlap
                    each true base in genome.

Here, the coverage is ~10 – just draw a line straight down from the top
                       through all of the reads.
Random sampling => deep sampling needed




   Typically 10-100x needed for robust recovery (300 Gbp for human)
Random sampling => deep sampling needed




   Typically 10-100x needed for robust recovery (300 Gbp for human)

     But this data is massively redundant!! Only need 5x systematic!
              All the stuff above the red line is unnecessary!
Streaming algorithm to do so:
    digital normalization
Digital normalization
Digital normalization
Digital normalization
Digital normalization
Digital normalization
Storing data this way is better than best-
 possible information-theoretic storage.




                                 Pell et al., PNAS 2012
Use Bloom filter to store graphs
    Graphs only gain nodes because of Bloom filter false positives.




                                                         Pell et al., PNAS 2012
Some assembly details
• This was completely intractable.
• Implemented in C++ and Python; “good practice” (?)
• We’ve changed scaling behavior from data to information.
• Practical scaling for ~soil metagenomics is 10x:
    – need < 1 TB of RAM for ~2 TB of data, ~2 weeks.
    – Before, ~10TB.
• Smaller problems are pretty much solved.
• Just beginning to explore threading, multicore, etc. (BIG DATA grant
   proposal)
• Goal is to scale to 50 Tbp of data (~5-50 TB RAM currently)
Concluding thoughts
• Channel randomness.
• Embrace streaming.
• Live with minor uncertainty.
• Don’t be afraid to discard data.


  (Also, I’m an open source hacker who can confer PhDs, in
    exchange for long years of low pay living in Michigan.
E-mail me! And don’t talk to Brett Cannon about PhDs first.)
References
SkipLists: Wikipedia, and John Shipman’s code:
http://infohost.nmt.edu/tcc/help/lang/python/examples/pyskip/pyskip.pdf

HyperLogLog: Aggregate Knowledge’s blog,
http://blog.aggregateknowledge.com/2012/10/25/sketch-of-the-day-hyperloglog-cornerstone-of-a-big-data-infrastructure/

And: https://github.com/svpcom/hyperloglog

Bloom Filters: Wikipedia


 Our work: http://ivory.idyll.org/blog/ and http://ged.msu.edu/interests.html
                                                                  ctb@msu.edu

Mais conteúdo relacionado

Destaque

13th Annual Seminar on Professional Responsibility
13th Annual Seminar on Professional Responsibility13th Annual Seminar on Professional Responsibility
13th Annual Seminar on Professional ResponsibilityKegler Brown Hill + Ritter
 
Theoretical framework d1 2016 11-18
Theoretical framework d1 2016 11-18Theoretical framework d1 2016 11-18
Theoretical framework d1 2016 11-18Zafar Ahmad
 
Undangan (Kak Melly n Kak Dicky)
Undangan (Kak Melly n Kak Dicky)Undangan (Kak Melly n Kak Dicky)
Undangan (Kak Melly n Kak Dicky)@rtNya
 
19th annual professional responsibility seminar
19th annual professional responsibility seminar19th annual professional responsibility seminar
19th annual professional responsibility seminarKegler Brown Hill + Ritter
 
2013 stamps-intro-assembly
2013 stamps-intro-assembly2013 stamps-intro-assembly
2013 stamps-intro-assemblyc.titus.brown
 
Children Consultation Report
Children Consultation ReportChildren Consultation Report
Children Consultation ReportZafar Ahmad
 
Rational App Scan&Policy Tester
Rational App Scan&Policy TesterRational App Scan&Policy Tester
Rational App Scan&Policy TesterKristina O'Regan
 
R E V I V A L C O L L E G E S A Presentation
R E V I V A L   C O L L E G E  S A PresentationR E V I V A L   C O L L E G E  S A Presentation
R E V I V A L C O L L E G E S A PresentationIvin
 
2014 whitney-research
2014 whitney-research2014 whitney-research
2014 whitney-researchc.titus.brown
 
Nor Cal Pacific Dimensions Presentation
Nor Cal Pacific Dimensions PresentationNor Cal Pacific Dimensions Presentation
Nor Cal Pacific Dimensions Presentationlmeneley
 
Presentatie De Salesmanagers
Presentatie De SalesmanagersPresentatie De Salesmanagers
Presentatie De Salesmanagersrwarntjes
 
Preserve Plan 4
Preserve Plan 4Preserve Plan 4
Preserve Plan 4lmeneley
 

Destaque (20)

What Is Eric
What Is EricWhat Is Eric
What Is Eric
 
13th Annual Seminar on Professional Responsibility
13th Annual Seminar on Professional Responsibility13th Annual Seminar on Professional Responsibility
13th Annual Seminar on Professional Responsibility
 
Br10 tekniske installationer
Br10 tekniske installationerBr10 tekniske installationer
Br10 tekniske installationer
 
Theoretical framework d1 2016 11-18
Theoretical framework d1 2016 11-18Theoretical framework d1 2016 11-18
Theoretical framework d1 2016 11-18
 
2014 davis-talk
2014 davis-talk2014 davis-talk
2014 davis-talk
 
Underage Drinking Parties in San Antonio 2016
Underage Drinking Parties in San Antonio 2016Underage Drinking Parties in San Antonio 2016
Underage Drinking Parties in San Antonio 2016
 
IT & Business Centre
IT & Business CentreIT & Business Centre
IT & Business Centre
 
Undangan (Kak Melly n Kak Dicky)
Undangan (Kak Melly n Kak Dicky)Undangan (Kak Melly n Kak Dicky)
Undangan (Kak Melly n Kak Dicky)
 
19th annual professional responsibility seminar
19th annual professional responsibility seminar19th annual professional responsibility seminar
19th annual professional responsibility seminar
 
2013 stamps-intro-assembly
2013 stamps-intro-assembly2013 stamps-intro-assembly
2013 stamps-intro-assembly
 
Children Consultation Report
Children Consultation ReportChildren Consultation Report
Children Consultation Report
 
Ramifications of New USEPA
Ramifications of New USEPARamifications of New USEPA
Ramifications of New USEPA
 
Rational App Scan&Policy Tester
Rational App Scan&Policy TesterRational App Scan&Policy Tester
Rational App Scan&Policy Tester
 
R E V I V A L C O L L E G E S A Presentation
R E V I V A L   C O L L E G E  S A PresentationR E V I V A L   C O L L E G E  S A Presentation
R E V I V A L C O L L E G E S A Presentation
 
2014 pycon-talk
2014 pycon-talk2014 pycon-talk
2014 pycon-talk
 
2014 whitney-research
2014 whitney-research2014 whitney-research
2014 whitney-research
 
Nor Cal Pacific Dimensions Presentation
Nor Cal Pacific Dimensions PresentationNor Cal Pacific Dimensions Presentation
Nor Cal Pacific Dimensions Presentation
 
Presentatie De Salesmanagers
Presentatie De SalesmanagersPresentatie De Salesmanagers
Presentatie De Salesmanagers
 
Preserve Plan 4
Preserve Plan 4Preserve Plan 4
Preserve Plan 4
 
SocialMediaMagic slides
SocialMediaMagic slidesSocialMediaMagic slides
SocialMediaMagic slides
 

Semelhante a 2013 py con awesome big data algorithms

2013 siam-cse-big-data
2013 siam-cse-big-data2013 siam-cse-big-data
2013 siam-cse-big-datac.titus.brown
 
2014 nicta-reproducibility
2014 nicta-reproducibility2014 nicta-reproducibility
2014 nicta-reproducibilityc.titus.brown
 
2014 manchester-reproducibility
2014 manchester-reproducibility2014 manchester-reproducibility
2014 manchester-reproducibilityc.titus.brown
 
2013 talk at TGAC, November 4
2013 talk at TGAC, November 42013 talk at TGAC, November 4
2013 talk at TGAC, November 4c.titus.brown
 
2013 bms-retreat-talk
2013 bms-retreat-talk2013 bms-retreat-talk
2013 bms-retreat-talkc.titus.brown
 
CT Brown - Doing next-gen sequencing analysis in the cloud
CT Brown - Doing next-gen sequencing analysis in the cloudCT Brown - Doing next-gen sequencing analysis in the cloud
CT Brown - Doing next-gen sequencing analysis in the cloudJan Aerts
 
Talk at Bioinformatics Open Source Conference, 2012
Talk at Bioinformatics Open Source Conference, 2012Talk at Bioinformatics Open Source Conference, 2012
Talk at Bioinformatics Open Source Conference, 2012c.titus.brown
 
Probabilistic breakdown of assembly graphs
Probabilistic breakdown of assembly graphsProbabilistic breakdown of assembly graphs
Probabilistic breakdown of assembly graphsc.titus.brown
 
2013 hmp-assembly-webinar
2013 hmp-assembly-webinar2013 hmp-assembly-webinar
2013 hmp-assembly-webinarc.titus.brown
 
Data science and Hadoop
Data science and HadoopData science and Hadoop
Data science and HadoopDonald Miner
 
HPCAC - the state of bioinformatics in 2017
HPCAC - the state of bioinformatics in 2017HPCAC - the state of bioinformatics in 2017
HPCAC - the state of bioinformatics in 2017philippbayer
 
Reproducible research - to infinity
Reproducible research - to infinityReproducible research - to infinity
Reproducible research - to infinityPeterMorrell4
 
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...Sean Kandel - Data profiling: Assessing the overall content and quality of a ...
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...huguk
 
Python as the Zen of Data Science
Python as the Zen of Data SciencePython as the Zen of Data Science
Python as the Zen of Data ScienceTravis Oliphant
 
How to interactively visualise and explore a billion objects (wit vaex)
How to interactively visualise and explore a billion objects (wit vaex)How to interactively visualise and explore a billion objects (wit vaex)
How to interactively visualise and explore a billion objects (wit vaex)Ali-ziane Myriam
 
Scaling up genomic analysis with ADAM
Scaling up genomic analysis with ADAMScaling up genomic analysis with ADAM
Scaling up genomic analysis with ADAMfnothaft
 

Semelhante a 2013 py con awesome big data algorithms (20)

2013 siam-cse-big-data
2013 siam-cse-big-data2013 siam-cse-big-data
2013 siam-cse-big-data
 
2014 nicta-reproducibility
2014 nicta-reproducibility2014 nicta-reproducibility
2014 nicta-reproducibility
 
2014 manchester-reproducibility
2014 manchester-reproducibility2014 manchester-reproducibility
2014 manchester-reproducibility
 
2013 talk at TGAC, November 4
2013 talk at TGAC, November 42013 talk at TGAC, November 4
2013 talk at TGAC, November 4
 
2014 aus-agta
2014 aus-agta2014 aus-agta
2014 aus-agta
 
2013 bms-retreat-talk
2013 bms-retreat-talk2013 bms-retreat-talk
2013 bms-retreat-talk
 
CT Brown - Doing next-gen sequencing analysis in the cloud
CT Brown - Doing next-gen sequencing analysis in the cloudCT Brown - Doing next-gen sequencing analysis in the cloud
CT Brown - Doing next-gen sequencing analysis in the cloud
 
Talk at Bioinformatics Open Source Conference, 2012
Talk at Bioinformatics Open Source Conference, 2012Talk at Bioinformatics Open Source Conference, 2012
Talk at Bioinformatics Open Source Conference, 2012
 
Probabilistic breakdown of assembly graphs
Probabilistic breakdown of assembly graphsProbabilistic breakdown of assembly graphs
Probabilistic breakdown of assembly graphs
 
2013 hmp-assembly-webinar
2013 hmp-assembly-webinar2013 hmp-assembly-webinar
2013 hmp-assembly-webinar
 
2014 nci-edrn
2014 nci-edrn2014 nci-edrn
2014 nci-edrn
 
Data science and Hadoop
Data science and HadoopData science and Hadoop
Data science and Hadoop
 
2015 illinois-talk
2015 illinois-talk2015 illinois-talk
2015 illinois-talk
 
HPCAC - the state of bioinformatics in 2017
HPCAC - the state of bioinformatics in 2017HPCAC - the state of bioinformatics in 2017
HPCAC - the state of bioinformatics in 2017
 
Reproducible research - to infinity
Reproducible research - to infinityReproducible research - to infinity
Reproducible research - to infinity
 
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...Sean Kandel - Data profiling: Assessing the overall content and quality of a ...
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...
 
Python as the Zen of Data Science
Python as the Zen of Data SciencePython as the Zen of Data Science
Python as the Zen of Data Science
 
How to interactively visualise and explore a billion objects (wit vaex)
How to interactively visualise and explore a billion objects (wit vaex)How to interactively visualise and explore a billion objects (wit vaex)
How to interactively visualise and explore a billion objects (wit vaex)
 
Vaex talk-pydata-paris
Vaex talk-pydata-parisVaex talk-pydata-paris
Vaex talk-pydata-paris
 
Scaling up genomic analysis with ADAM
Scaling up genomic analysis with ADAMScaling up genomic analysis with ADAM
Scaling up genomic analysis with ADAM
 

Mais de c.titus.brown

Mais de c.titus.brown (20)

2016 bergen-sars
2016 bergen-sars2016 bergen-sars
2016 bergen-sars
 
2016 davis-plantbio
2016 davis-plantbio2016 davis-plantbio
2016 davis-plantbio
 
2016 davis-biotech
2016 davis-biotech2016 davis-biotech
2016 davis-biotech
 
2015 genome-center
2015 genome-center2015 genome-center
2015 genome-center
 
2015 beacon-metagenome-tutorial
2015 beacon-metagenome-tutorial2015 beacon-metagenome-tutorial
2015 beacon-metagenome-tutorial
 
2015 aem-grs-keynote
2015 aem-grs-keynote2015 aem-grs-keynote
2015 aem-grs-keynote
 
2015 msu-code-review
2015 msu-code-review2015 msu-code-review
2015 msu-code-review
 
2015 mcgill-talk
2015 mcgill-talk2015 mcgill-talk
2015 mcgill-talk
 
2015 pycon-talk
2015 pycon-talk2015 pycon-talk
2015 pycon-talk
 
2015 opencon-webcast
2015 opencon-webcast2015 opencon-webcast
2015 opencon-webcast
 
2015 vancouver-vanbug
2015 vancouver-vanbug2015 vancouver-vanbug
2015 vancouver-vanbug
 
2015 osu-metagenome
2015 osu-metagenome2015 osu-metagenome
2015 osu-metagenome
 
2015 ohsu-metagenome
2015 ohsu-metagenome2015 ohsu-metagenome
2015 ohsu-metagenome
 
2015 balti-and-bioinformatics
2015 balti-and-bioinformatics2015 balti-and-bioinformatics
2015 balti-and-bioinformatics
 
2015 pag-chicken
2015 pag-chicken2015 pag-chicken
2015 pag-chicken
 
2015 pag-metagenome
2015 pag-metagenome2015 pag-metagenome
2015 pag-metagenome
 
2014 nyu-bio-talk
2014 nyu-bio-talk2014 nyu-bio-talk
2014 nyu-bio-talk
 
2014 bangkok-talk
2014 bangkok-talk2014 bangkok-talk
2014 bangkok-talk
 
2014 anu-canberra-streaming
2014 anu-canberra-streaming2014 anu-canberra-streaming
2014 anu-canberra-streaming
 
2014 abic-talk
2014 abic-talk2014 abic-talk
2014 abic-talk
 

Último

Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesBoston Institute of Analytics
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 

Último (20)

Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 

2013 py con awesome big data algorithms

  • 1. Awesome Big Data Algorithms http://xkcd.com/1185/
  • 2. Awesome Big Data Algorithms C. Titus Brown ctb@msu.edu Asst Professor, Michigan State University (Microbiology, Computer Science, and BEACON)
  • 3. Welcome! • More of a computational scientist than a computer scientist; will be using simulations to demo & explore algorithm behavior. • Send me questions/comments @ctitusbrown, or ctb@msu.edu.
  • 4. “Features” • I will be using Python rather than C++, because Python is easier to read. • I will be using IPython Notebook to demo. • I apologize in advance for not covering your favorite data structure or algorithm.
  • 5. Outline • The basic idea • Three examples – Skip lists (a fast key/value store) – HyperLogLog Counting (counting discrete elements) – Bloom filters and CountMin Sketches • Folding, spindling, and mutilating DNA sequence • References and further reading
  • 6. The basic idea • Problem: you have a lot of data to count, track, or otherwise analyze. • This data is Data of Unusual Size, i.e. you can’t just brute force the analysis. • For example, – Count the approximate number of distinct elements in a very large (infinite?) data set – Optimize queries by using an efficient but approximate prefilter – Determine the frequency distribution of distinct elements in a very large data set.
  • 7. Online and streaming vs. offline “Large is hard; infinite is much easier.” • Offline algorithms analyze an entire data set all at once. • Online algorithms analyze data serially, one piece at a time. • Streaming algorithms are online algorithms that can be used for very memory & compute limited analysis.
  • 8. Exact vs random or probabilistic • Often an approximate answer is sufficient, esp if you can place bounds on how wrong the approximation is likely to be. • Often random algorithms or probabilistic data structures can be found with good typical behavior but bad worst case behavior.
  • 9. For one (stupid) example You can trim 8 bits off of integers for the purpose of averaging them
  • 10. Skip lists A randomly indexed improvement on linked lists. Each node can belong to one or more vertical “levels”, which allow fast search/insertion/deletion – ~O(log(n)) typically! wikipedia
  • 11.
  • 12. Skip lists A randomly indexed improvement on linked lists. Very easy to implement; asymptotically good behavior. From reddit, “if someone held a gun to my head and asked me to implement an efficient set/map storage, I would implement a skip list.” (Response: “does this happen to you a lot??”) wikipedia
  • 13. Channel randomness! • If you can construct or rely on randomness, then you can easily get good typical behavior. • Note, a good hash function is essentially the same as a good random number generator…
  • 14. HyperLogLog cardinality counting • Suppose you have an incoming stream of many, many “objects”. • And you want to track how many distinct items there are, and you want to accumulate the count of distinct objects over time.
  • 15. Relevant digression: • Flip some unknown number of coins. Q: what is something simple to track that will tell you roughly how many coins you’ve flipped? • A: longest run of HEADs. Long runs are very rare and are correlated with how many coins you’ve flipped.
  • 16.
  • 17. Cardinality counting with HyperLogLog • Essentially, use longest run of 0-bits observed in a hash value. • Use multiple hash functions so that you can take the average. • Take harmonic mean + low/high sampling adjustment => result.
  • 18.
  • 19. Bloom filters • A set membership data structure that is probabilistic but only yields false positives. • Trivial to implement; hash function is main cost; extremely memory efficient.
  • 20.
  • 21. My research applications Biology is fast becoming a data-driven science. http://www.genome.gov/sequencingcosts/
  • 22. Shotgun sequencing analogy: feeding books into a paper shredder, digitizing the shreds, and reconstructing the book. Although for books, we often know the language and not just the alphabet 
  • 23. Shotgun sequencing is -- • Randomly ordered. • Randomly sampled. • Too big to efficiently do multiple passes
  • 24. Shotgun sequencing “Coverage” is simply the average number of reads that overlap each true base in genome. Here, the coverage is ~10 – just draw a line straight down from the top through all of the reads.
  • 25. Random sampling => deep sampling needed Typically 10-100x needed for robust recovery (300 Gbp for human)
  • 26. Random sampling => deep sampling needed Typically 10-100x needed for robust recovery (300 Gbp for human) But this data is massively redundant!! Only need 5x systematic! All the stuff above the red line is unnecessary!
  • 27. Streaming algorithm to do so: digital normalization
  • 33. Storing data this way is better than best- possible information-theoretic storage. Pell et al., PNAS 2012
  • 34. Use Bloom filter to store graphs Graphs only gain nodes because of Bloom filter false positives. Pell et al., PNAS 2012
  • 35. Some assembly details • This was completely intractable. • Implemented in C++ and Python; “good practice” (?) • We’ve changed scaling behavior from data to information. • Practical scaling for ~soil metagenomics is 10x: – need < 1 TB of RAM for ~2 TB of data, ~2 weeks. – Before, ~10TB. • Smaller problems are pretty much solved. • Just beginning to explore threading, multicore, etc. (BIG DATA grant proposal) • Goal is to scale to 50 Tbp of data (~5-50 TB RAM currently)
  • 36. Concluding thoughts • Channel randomness. • Embrace streaming. • Live with minor uncertainty. • Don’t be afraid to discard data. (Also, I’m an open source hacker who can confer PhDs, in exchange for long years of low pay living in Michigan. E-mail me! And don’t talk to Brett Cannon about PhDs first.)
  • 37. References SkipLists: Wikipedia, and John Shipman’s code: http://infohost.nmt.edu/tcc/help/lang/python/examples/pyskip/pyskip.pdf HyperLogLog: Aggregate Knowledge’s blog, http://blog.aggregateknowledge.com/2012/10/25/sketch-of-the-day-hyperloglog-cornerstone-of-a-big-data-infrastructure/ And: https://github.com/svpcom/hyperloglog Bloom Filters: Wikipedia Our work: http://ivory.idyll.org/blog/ and http://ged.msu.edu/interests.html ctb@msu.edu

Notas do Editor

  1. Obviously not all algorithms can be made into online algorithms, much less streaming.