O slideshow foi denunciado.
Utilizamos seu perfil e dados de atividades no LinkedIn para personalizar e exibir anúncios mais relevantes. Altere suas preferências de anúncios quando desejar.

Minerva: Drill Storage Plugin for IPFS

Minerva is a storage plugin of Drill that connects IPFS's decentralized storage and Drill's flexible query engine. Any data file stored on IPFS can be easily accessed from Drill's query interface, just like a file stored on a local disk.

Visit https://github.com/bdchain/Minerva to learn more and try it out!

  • Entre para ver os comentários

Minerva: Drill Storage Plugin for IPFS

  1. 1. Minerva: Drill Storage Plugin for IPFS Run SQL query on data in IPFS Build big data storage block chain (BDSC)
  2. 2. 1. Pinpoint the real address of a dataset, typically an HTTP link; 2. Download the dataset in a client-server mode; 3. Configure a computation environment for big data analysis; 4. Preprocess the dataset (e.g. converting file formats) and develop data analysis algorithms. A Present-day Workflow: Problems with Public Dataset Analytics
  3. 3. 1. Pinpoint the real address of a dataset; 2. Download the dataset; 3. Set up a computation environment powerful enough for big data analysis; 4. Prepare the data, e.g. converting file formats, implementing basic analysis algorithms. Workflow: Caveats:  Links may expire over time due to temporary server failure or permanent website shutdown.  Dataset might be polluted (no clue whether it is the right dataset in your need).  A single website cannot host all the datasets. Problems with Public Dataset Analytics
  4. 4. 1. Locate the dataset, typically via an HTTP link; 2. Download the dataset in a client-server mode; 3. Set up a computation environment powerful enough for big data analysis; 4. Prepare the data, e.g. converting file formats, implementing basic analysis algorithms. Workflow: Caveats:  Datasets are usually huge, demanding a long downloading time;  Client-server mode is not bandwidth efficient;  Data files are usually packaged and compressed in a single dataset archive. A user interested in a part of the dataset has to download all. Problems with Public Dataset Analytics
  5. 5. 1. Locate the dataset, typically via an HTTP link; 2. Download the dataset; 3. Configure a computation environment for big data analysis; 4. Prepare the data, e.g. converting file formats, implementing basic analysis algorithms. Workflow: Caveats:  Expensive storage and computation resources are necessary for large- scale data analytics;  Maintenance and management overhead consume enormous human resources. Problems with Public Dataset Analytics
  6. 6. 1. Locate the dataset, typically via an HTTP link; 2. Download the dataset; 3. Set up a computation environment powerful enough for big data analysis; 4. Preprocess the dataset (e.g. converting file formats) and develop data analysis algorithms. Workflow: Caveats:  Datasets from different origins and different areas of research come in different formats and structures.  The users of datasets might not be proficient in programming;  Repetitive work in data analytics is inevitable when many users happen to process the same dataset. Problems with Public Dataset Analytics
  7. 7. IPFS1 to the Rescue • Decentralization: no single point of failure • Collaboration: sharing resources as well as reusing codes in the community • Fine-grained Content addressing2: get exactly what you need 1: https://ipfs.io/ 2: datasets can be split into blocks and only those of interest need processing.
  8. 8. Drill1 the Distributed Query Engine • Compatibility: supporting standard SQL statements • Flexibility: no metastore, no schema, non-relational data • Scalability: enabling user defined functions • Locality-awareness: pushing processing into the nearby datastores 1: https://drill.apache.org/
  9. 9. Drill and IPFS Combined Drill and IPFS collocation: A distributed network of nodes, each of which runs Drill and IPFS simultaneously. Localhost Peers on network P2P Network Storage Planner Reader / Writer Query engine Version & format management Qri1 2 1: https://qri.io/ 2: https://libp2p.io/
  10. 10. Query Explained: Read SQL input = ? IPFS CID1 of the dataset being queried SQL statement that “reads” data: SELECT * FROM ipfs.`/ipfs/QmAce…f2a/employee.json` Drill query interface 1: Content Identifier, CID. https://github.com/ipld/specs/blob/master/block-layer/CID.md Foreman
  11. 11. Query Explained: Read SQL input = ? IPFS object resolution: ipfs object links QmAce…f2a Links – CIDs of objects (chunks) contained in the “top” object Foreman
  12. 12. Query Explained: Read SQL input = ? DHT A D C B IPFS provider resolution: ipfs dht findprovs QmFHq…32T A D B C Drillbits running IPFS that can provide the data pieces Drill execution plan sent to peer nodes Foreman
  13. 13. Query Explained: Read A D B C SQL input Results = ? Parts of results returned to foreman Results are returned to the user Foreman
  14. 14. Query Explained: Write A D B C SQL input Result SQL statement that “writes” data: CREATE IPFSTABLE ipfs.`create` AS ( SELECT * FROM ipfs.`/ipfs/QmAce…f2a/employee.json` ORDER BY `id` DESC ) DHT A D C B Partial CIDs reassembled into a single CID and returned to the user Actual data operations happen on the node that stores the data locally Partial CIDs of new data pieces sent back to foreman Foreman
  15. 15. User Defined Functions • Format conversion programs and common analysis algorithms can be implemented in the form of User Defined Functions (UDF) and distributed along with the datasets. • Drill can invoke these UDFs using their CIDs, in the same way it locates a dataset on IPFS.
  16. 16. Code Structure IPFS DAG/DHT API IPFS Object API
  17. 17. A Query Example
  18. 18. Performance Evaluation • A 6-node cluster on a cloud service provider, each with 8GB RAM and 4 cores CPU • IPFS running in private network mode • Query file size:100MB-1GB • Query: simple queries like select *, select count(*) • Response time:2-10s • Transactions per second:~10
  19. 19. Performance Evaluation Query completion time under different chunk sizes (left) and parallelization width (right). Dataset 1: 67MB, Dataset 2: 190MB.
  20. 20. Possible Applications • An easy MPP cluster with Minerva • Decentralized data sharing system • Big data analysis for other Dapps running on IPFS
  21. 21. Problems To Be Solved • Performance • DHT operations take too much time, especially on the Internet. • IPFS limits blocks to be 4MB at max, resulting in enormous number of blocks for huge datasets. • Write operations are incomplete • The last step to reassemble the partial CIDs is not yet implemented. • Stability
  22. 22. THANK YOU FOR YOUR TIME! Github: github.com/bdchain/Minerva

×