SlideShare a Scribd company logo
1 of 13
Analytics on
100 TB+ catalogs
Enabling astronomy in the
era of massive survey
telescopesMario Juric <mjuric@astro.washington.edu>
UW Astronomy | DIRAC | eScience
@mjuric
Zwicky Transient Facility
> 1000 images/night, 576 mpix
> 300 M detected sources/night
> 1 billion objects, 75-250 mea/obj/year
> 1 M alerts/night
http://ztf.caltech.edu
Zwicky Transient Facility
> 1 TB/night (raw), 10 TB (processed)
> 150 GB sources/night
> 20-40 GB alerts/night
http://ztf.caltech.edu
Zwicky Transient Facility
> 2.5 PB images/yr
> 37.5 TB of sources/year
> 5-10 TB of alerts/year
Starting: ~NOW!
Uncertainty: ~3x!
http://ztf.caltech.edu
Spatial Extent: the Entire Sky
Example: The sky footprint of early Pan-STARRS PS1 data
Spatial Extent: the ~Entire Sky
Example: The sky footprint of early Pan-STARRS PS1 data
(zoomed in on a ”medium deep field”)
New Science: the Time Component
> Time series analysis
(classification)
> Rapid identification and
alerting on “interesting”
variability
> Identification of moving
sources
Example RR Lyrae light curves from Székely et al. (2007)
The Wishlist: What we’re looking for in
a DBMS
> Must be able to reliably store the data
> Must enable efficient batch processing
– I.e., ”compute this statistic over all time series”, in ~hours
> Must enable fast extraction of individual time series
– I.e., ”give me the light curve of X”, in <1s
> Must enable fast spatial queries, fast histograms
– I.e., “Give me all objects in this area on the sky”, in <1s to start
> Must enable easy “cross matching”
– Positionally cross-match N catalogs, find neighbors
The Wishlist: What we’re looking for in
a DBMS
> Must support insertions of ~300M rows/night
> Must scale to ~100TB+ catalogs in ~3 years
> Efficient in multi-user mode
> Should (must) be easy to use
– Shallow learning curve, ease of install, strong Python APIs
– Ideally easily replicated and manageable by astronomers.
– SQL-like interface is a plus (declarative queries)
> Ideally would like to be able to get it up and running in ~4-6
months.
Options We’re Looking At
> Relational Databases
– Postgres, Oracle, qserv (experimental)
– Challenging to have tables of ~100 billion rows (expectation after ~1yr)
– Slow time-series extraction
> Parquet+Spark
– Looks like it may scale.
– Not easy to set up, steep learning curve
– No native multi-user awareness
> Custom solution (”Large Survey Database”; http://lsddb.org)
– Partitioned tree of HDF5 files, Parquet before Parquet + Python client
– Special snowflake, will need eternal support, no community.
Discuss
Are there other areas that have to deal with
~billion time series of 100+ measurements?
What are the technology choices you use to
manage your data sets? What should we
be looking at?
A Related Problem: Telemetry
Databases
> ~100+ sensors, <=10 Hz sampling
– ~500 MB/night
– ~150 GB/yr
> Slightly different slicing needs
– ”Give me the data from all sensors in the following time
window”, as opposed to “give me all the data for the following
set of objects”
> Simple HDF5 may work
The Next Problem (in 2022)
The Large Synoptic Survey Telescope
An automated 8.4 meter telescope that for 10 years will
image half the sky every ~3 days, generate ~50 PB of
(raw) imaging data, issue real-time alerts to any changes
in the sky (~10 million/night), measure properties of
~40 billion objects in the sky (~1000 times
each), and make the results available
in a web-accessible database.
http://lsst.org

More Related Content

What's hot

Talk for "The X-ray Universe 2014, Dublin"
Talk for "The X-ray Universe 2014, Dublin"Talk for "The X-ray Universe 2014, Dublin"
Talk for "The X-ray Universe 2014, Dublin"
Alexey Mints
 
Climate data in r with the raster package
Climate data in r with the raster packageClimate data in r with the raster package
Climate data in r with the raster package
Alberto Labarga
 

What's hot (20)

The big data Universe. Literally.
The big data Universe. Literally.The big data Universe. Literally.
The big data Universe. Literally.
 
Autoencoding RNN for inference on unevenly sampled time-series data
Autoencoding RNN for inference on unevenly sampled time-series dataAutoencoding RNN for inference on unevenly sampled time-series data
Autoencoding RNN for inference on unevenly sampled time-series data
 
The Pacific Research Platform
 Two Years In
The Pacific Research Platform
 Two Years InThe Pacific Research Platform
 Two Years In
The Pacific Research Platform
 Two Years In
 
Big Data for Big Discoveries
Big Data for Big DiscoveriesBig Data for Big Discoveries
Big Data for Big Discoveries
 
Q4 2016 GeoTrellis Presentation
Q4 2016 GeoTrellis PresentationQ4 2016 GeoTrellis Presentation
Q4 2016 GeoTrellis Presentation
 
NERSC, AI and the Superfacility, Debbie Bard
NERSC, AI and the Superfacility, Debbie BardNERSC, AI and the Superfacility, Debbie Bard
NERSC, AI and the Superfacility, Debbie Bard
 
Of Sampling and Smoothing: Approximating Distributions over Linked Open Data
Of Sampling and Smoothing: Approximating Distributions over Linked Open DataOf Sampling and Smoothing: Approximating Distributions over Linked Open Data
Of Sampling and Smoothing: Approximating Distributions over Linked Open Data
 
SkyhookDM - Towards an Arrow-Native Storage System
SkyhookDM - Towards an Arrow-Native Storage SystemSkyhookDM - Towards an Arrow-Native Storage System
SkyhookDM - Towards an Arrow-Native Storage System
 
Demonstrating a Pre-Exascale, Cost-Effective Multi-Cloud Environment for Scie...
Demonstrating a Pre-Exascale, Cost-Effective Multi-Cloud Environment for Scie...Demonstrating a Pre-Exascale, Cost-Effective Multi-Cloud Environment for Scie...
Demonstrating a Pre-Exascale, Cost-Effective Multi-Cloud Environment for Scie...
 
Running a GPU burst for Multi-Messenger Astrophysics with IceCube across all ...
Running a GPU burst for Multi-Messenger Astrophysics with IceCube across all ...Running a GPU burst for Multi-Messenger Astrophysics with IceCube across all ...
Running a GPU burst for Multi-Messenger Astrophysics with IceCube across all ...
 
Many Task Applications for Grids and Supercomputers
Many Task Applications for Grids and SupercomputersMany Task Applications for Grids and Supercomputers
Many Task Applications for Grids and Supercomputers
 
AstroInformatics 2015: Large Sky Surveys: Entering the Era of Software-Bound ...
AstroInformatics 2015: Large Sky Surveys: Entering the Era of Software-Bound ...AstroInformatics 2015: Large Sky Surveys: Entering the Era of Software-Bound ...
AstroInformatics 2015: Large Sky Surveys: Entering the Era of Software-Bound ...
 
GaiaCal2014: Creating and Calibrating LSST Data Product
GaiaCal2014: Creating and Calibrating LSST Data ProductGaiaCal2014: Creating and Calibrating LSST Data Product
GaiaCal2014: Creating and Calibrating LSST Data Product
 
Talk for "The X-ray Universe 2014, Dublin"
Talk for "The X-ray Universe 2014, Dublin"Talk for "The X-ray Universe 2014, Dublin"
Talk for "The X-ray Universe 2014, Dublin"
 
Toward Real-Time Analysis of Large Data Volumes for Diffraction Studies by Ma...
Toward Real-Time Analysis of Large Data Volumes for Diffraction Studies by Ma...Toward Real-Time Analysis of Large Data Volumes for Diffraction Studies by Ma...
Toward Real-Time Analysis of Large Data Volumes for Diffraction Studies by Ma...
 
LocationTech Projects
LocationTech ProjectsLocationTech Projects
LocationTech Projects
 
Application of HDF/HDF-EOS data to atmospheric and climate sciences at Univer...
Application of HDF/HDF-EOS data to atmospheric and climate sciences at Univer...Application of HDF/HDF-EOS data to atmospheric and climate sciences at Univer...
Application of HDF/HDF-EOS data to atmospheric and climate sciences at Univer...
 
Climate data in r with the raster package
Climate data in r with the raster packageClimate data in r with the raster package
Climate data in r with the raster package
 
LSST/DM: Building a Next Generation Survey Data Processing System
LSST/DM: Building a Next Generation Survey Data Processing SystemLSST/DM: Building a Next Generation Survey Data Processing System
LSST/DM: Building a Next Generation Survey Data Processing System
 
ADASS XXV: LSST DM - Building the Data System for the Era of Petascale Optica...
ADASS XXV: LSST DM - Building the Data System for the Era of Petascale Optica...ADASS XXV: LSST DM - Building the Data System for the Era of Petascale Optica...
ADASS XXV: LSST DM - Building the Data System for the Era of Petascale Optica...
 

Similar to Round Table Introduction: Analytics on 100 TB+ catalogs

OpenNebulaconf2017US: Rapid scaling of research computing to over 70,000 cor...
OpenNebulaconf2017US:  Rapid scaling of research computing to over 70,000 cor...OpenNebulaconf2017US:  Rapid scaling of research computing to over 70,000 cor...
OpenNebulaconf2017US: Rapid scaling of research computing to over 70,000 cor...
OpenNebula Project
 

Similar to Round Table Introduction: Analytics on 100 TB+ catalogs (20)

World widetelescopetecfest
World widetelescopetecfestWorld widetelescopetecfest
World widetelescopetecfest
 
E Science As A Lens On The World Lazowska
E Science As A Lens On The World   LazowskaE Science As A Lens On The World   Lazowska
E Science As A Lens On The World Lazowska
 
E Science As A Lens On The World Lazowska
E Science As A Lens On The World   LazowskaE Science As A Lens On The World   Lazowska
E Science As A Lens On The World Lazowska
 
Very Large Data Files, Object Stores, and Deep Learning—Lessons Learned While...
Very Large Data Files, Object Stores, and Deep Learning—Lessons Learned While...Very Large Data Files, Object Stores, and Deep Learning—Lessons Learned While...
Very Large Data Files, Object Stores, and Deep Learning—Lessons Learned While...
 
The Discovery Cloud: Accelerating Science via Outsourcing and Automation
The Discovery Cloud: Accelerating Science via Outsourcing and AutomationThe Discovery Cloud: Accelerating Science via Outsourcing and Automation
The Discovery Cloud: Accelerating Science via Outsourcing and Automation
 
Toward a Global Interactive Earth Observing Cyberinfrastructure
Toward a Global Interactive Earth Observing CyberinfrastructureToward a Global Interactive Earth Observing Cyberinfrastructure
Toward a Global Interactive Earth Observing Cyberinfrastructure
 
Computation and Knowledge
Computation and KnowledgeComputation and Knowledge
Computation and Knowledge
 
Frossie Economou & Angelo Fausti [Vera C. Rubin Observatory] | How InfluxDB H...
Frossie Economou & Angelo Fausti [Vera C. Rubin Observatory] | How InfluxDB H...Frossie Economou & Angelo Fausti [Vera C. Rubin Observatory] | How InfluxDB H...
Frossie Economou & Angelo Fausti [Vera C. Rubin Observatory] | How InfluxDB H...
 
Using Containers and HPC to Solve the Mysteries of the Universe by Deborah Bard
Using Containers and HPC to Solve the Mysteries of the Universe by Deborah BardUsing Containers and HPC to Solve the Mysteries of the Universe by Deborah Bard
Using Containers and HPC to Solve the Mysteries of the Universe by Deborah Bard
 
Positioning University of California Information Technology for the Future: S...
Positioning University of California Information Technology for the Future: S...Positioning University of California Information Technology for the Future: S...
Positioning University of California Information Technology for the Future: S...
 
Solar System Processing with LSST: A Status Update
Solar System Processing with LSST: A Status UpdateSolar System Processing with LSST: A Status Update
Solar System Processing with LSST: A Status Update
 
AWS Public Sector Symposium 2014 Canberra | Big Data in the Cloud: Accelerati...
AWS Public Sector Symposium 2014 Canberra | Big Data in the Cloud: Accelerati...AWS Public Sector Symposium 2014 Canberra | Big Data in the Cloud: Accelerati...
AWS Public Sector Symposium 2014 Canberra | Big Data in the Cloud: Accelerati...
 
Using the Open Science Data Cloud for Data Science Research
Using the Open Science Data Cloud for Data Science ResearchUsing the Open Science Data Cloud for Data Science Research
Using the Open Science Data Cloud for Data Science Research
 
OpenNebulaconf2017US: Rapid scaling of research computing to over 70,000 cor...
OpenNebulaconf2017US:  Rapid scaling of research computing to over 70,000 cor...OpenNebulaconf2017US:  Rapid scaling of research computing to over 70,000 cor...
OpenNebulaconf2017US: Rapid scaling of research computing to over 70,000 cor...
 
Creating a Science-Driven Big Data Superhighway
Creating a Science-Driven Big Data SuperhighwayCreating a Science-Driven Big Data Superhighway
Creating a Science-Driven Big Data Superhighway
 
Presentation
PresentationPresentation
Presentation
 
Accelerating data-intensive science by outsourcing the mundane
Accelerating data-intensive science by outsourcing the mundaneAccelerating data-intensive science by outsourcing the mundane
Accelerating data-intensive science by outsourcing the mundane
 
Burst data retrieval after 50k GPU Cloud run
Burst data retrieval after 50k GPU Cloud runBurst data retrieval after 50k GPU Cloud run
Burst data retrieval after 50k GPU Cloud run
 
Terabit Applications: What Are They, What is Needed to Enable Them?
Terabit Applications: What Are They, What is Needed to Enable Them?Terabit Applications: What Are They, What is Needed to Enable Them?
Terabit Applications: What Are They, What is Needed to Enable Them?
 
WebServices_Grid.ppt
WebServices_Grid.pptWebServices_Grid.ppt
WebServices_Grid.ppt
 

Recently uploaded

Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
PirithiRaju
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
PirithiRaju
 
development of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virusdevelopment of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virus
NazaninKarimi6
 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Sérgio Sacani
 

Recently uploaded (20)

Call Girls Ahmedabad +917728919243 call me Independent Escort Service
Call Girls Ahmedabad +917728919243 call me Independent Escort ServiceCall Girls Ahmedabad +917728919243 call me Independent Escort Service
Call Girls Ahmedabad +917728919243 call me Independent Escort Service
 
COST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptxCOST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptx
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
 
Forensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfForensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdf
 
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
 
Dubai Call Girls Beauty Face Teen O525547819 Call Girls Dubai Young
Dubai Call Girls Beauty Face Teen O525547819 Call Girls Dubai YoungDubai Call Girls Beauty Face Teen O525547819 Call Girls Dubai Young
Dubai Call Girls Beauty Face Teen O525547819 Call Girls Dubai Young
 
development of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virusdevelopment of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virus
 
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....
 
Factory Acceptance Test( FAT).pptx .
Factory Acceptance Test( FAT).pptx       .Factory Acceptance Test( FAT).pptx       .
Factory Acceptance Test( FAT).pptx .
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
 
Grade 7 - Lesson 1 - Microscope and Its Functions
Grade 7 - Lesson 1 - Microscope and Its FunctionsGrade 7 - Lesson 1 - Microscope and Its Functions
Grade 7 - Lesson 1 - Microscope and Its Functions
 
Introduction to Viruses
Introduction to VirusesIntroduction to Viruses
Introduction to Viruses
 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
 
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts ServiceJustdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
 
PSYCHOSOCIAL NEEDS. in nursing II sem pptx
PSYCHOSOCIAL NEEDS. in nursing II sem pptxPSYCHOSOCIAL NEEDS. in nursing II sem pptx
PSYCHOSOCIAL NEEDS. in nursing II sem pptx
 
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICESAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)
 
Thyroid Physiology_Dr.E. Muralinath_ Associate Professor
Thyroid Physiology_Dr.E. Muralinath_ Associate ProfessorThyroid Physiology_Dr.E. Muralinath_ Associate Professor
Thyroid Physiology_Dr.E. Muralinath_ Associate Professor
 

Round Table Introduction: Analytics on 100 TB+ catalogs

  • 1. Analytics on 100 TB+ catalogs Enabling astronomy in the era of massive survey telescopesMario Juric <mjuric@astro.washington.edu> UW Astronomy | DIRAC | eScience @mjuric
  • 2. Zwicky Transient Facility > 1000 images/night, 576 mpix > 300 M detected sources/night > 1 billion objects, 75-250 mea/obj/year > 1 M alerts/night http://ztf.caltech.edu
  • 3. Zwicky Transient Facility > 1 TB/night (raw), 10 TB (processed) > 150 GB sources/night > 20-40 GB alerts/night http://ztf.caltech.edu
  • 4. Zwicky Transient Facility > 2.5 PB images/yr > 37.5 TB of sources/year > 5-10 TB of alerts/year Starting: ~NOW! Uncertainty: ~3x! http://ztf.caltech.edu
  • 5. Spatial Extent: the Entire Sky Example: The sky footprint of early Pan-STARRS PS1 data
  • 6. Spatial Extent: the ~Entire Sky Example: The sky footprint of early Pan-STARRS PS1 data (zoomed in on a ”medium deep field”)
  • 7. New Science: the Time Component > Time series analysis (classification) > Rapid identification and alerting on “interesting” variability > Identification of moving sources Example RR Lyrae light curves from Székely et al. (2007)
  • 8. The Wishlist: What we’re looking for in a DBMS > Must be able to reliably store the data > Must enable efficient batch processing – I.e., ”compute this statistic over all time series”, in ~hours > Must enable fast extraction of individual time series – I.e., ”give me the light curve of X”, in <1s > Must enable fast spatial queries, fast histograms – I.e., “Give me all objects in this area on the sky”, in <1s to start > Must enable easy “cross matching” – Positionally cross-match N catalogs, find neighbors
  • 9. The Wishlist: What we’re looking for in a DBMS > Must support insertions of ~300M rows/night > Must scale to ~100TB+ catalogs in ~3 years > Efficient in multi-user mode > Should (must) be easy to use – Shallow learning curve, ease of install, strong Python APIs – Ideally easily replicated and manageable by astronomers. – SQL-like interface is a plus (declarative queries) > Ideally would like to be able to get it up and running in ~4-6 months.
  • 10. Options We’re Looking At > Relational Databases – Postgres, Oracle, qserv (experimental) – Challenging to have tables of ~100 billion rows (expectation after ~1yr) – Slow time-series extraction > Parquet+Spark – Looks like it may scale. – Not easy to set up, steep learning curve – No native multi-user awareness > Custom solution (”Large Survey Database”; http://lsddb.org) – Partitioned tree of HDF5 files, Parquet before Parquet + Python client – Special snowflake, will need eternal support, no community.
  • 11. Discuss Are there other areas that have to deal with ~billion time series of 100+ measurements? What are the technology choices you use to manage your data sets? What should we be looking at?
  • 12. A Related Problem: Telemetry Databases > ~100+ sensors, <=10 Hz sampling – ~500 MB/night – ~150 GB/yr > Slightly different slicing needs – ”Give me the data from all sensors in the following time window”, as opposed to “give me all the data for the following set of objects” > Simple HDF5 may work
  • 13. The Next Problem (in 2022) The Large Synoptic Survey Telescope An automated 8.4 meter telescope that for 10 years will image half the sky every ~3 days, generate ~50 PB of (raw) imaging data, issue real-time alerts to any changes in the sky (~10 million/night), measure properties of ~40 billion objects in the sky (~1000 times each), and make the results available in a web-accessible database. http://lsst.org