This presentation gives an overview of the Apache Fluo project. It explains Apache Fluo in terms of it's architecture, functionality and transactions.
Links for further information and connecting
http://www.amazon.com/Michael-Frampton/e/B00NIQDOOM/
https://nz.linkedin.com/pub/mike-frampton/20/630/385
https://open-source-systems.blogspot.com/
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Apache Fluo
1. What Is Apache Fluo ?
● For large scale data set incremental updates
● Open source Apache 2.0 license
● Based upon Apache Accumulo
– Uses Hadoop HDFS to store data
– Uses ZooKeeper for configuration
– Partitions tables into tablets
● It is a distributed system
● Supports cross node transactions
2. What Is Apache Fluo ?
● Allows monitoring of large datasets to
– Identify small changes
– Join changes into the larger data set
– Without processing all data
● Transactions allows many current changes
– Without data corruption
● Fluo uses code based observers which
– Act on table column changes
● Offers a Fluo Java based API
3. What Is Apache Fluo ?
● Use of Fluo is code based and low level
● Fluo uses Hadoop YARN to run its processes
● Fluo uses ZooKeeper to
– Store its meta data
– Store its state information
● Fluo data is stored in Fluo tables on Accumulo ( HDFS)
– Same structure as Accumulo except
– Row has no timestamps
5. Fluo Architecture
● Large scale computation through small scale transactions
● Clients access Fluo through Java API
● Clients ingest data through the API
● Application Oracle processes apply transaction timestamps
● Application worker processes run user code
● User code/observers monitor column changes
● Multiple workers can run the same observers
● Transactions change data, snapshots read data
6. Fluo Architecture
● Fluo provides snapshot isolation
● A snapshot only sees pre committed transactions
● Transaction overlap / collision is possible
● In this case a write skew is possible if
– Different keys are concurrently updated
● Fluo supports scanners to read data ranges or spans
● Fluo has a transaction based LoaderExecutor
– To aid the loading of data
7. Fluo Architecture
● Fluo supports incremental processing via
● Notifications
– Persistent markers set by a transaction that Indicate
– An Observer should run later for a certain row+column
● Observers
– User provided code that is registered to
– Process notifications for a certain column
●
Observer receives row/column that triggered it plus transaction
●
Fluo worker processes running across a cluster
● Will execute Observers
8. Fluo Architecture
● Fluo supports two types of notification
● Strong notification
– Guarantee an observer will run at most once
– When a column is modified
– Even for multiple row+column updates
● Weak notification
– Cause an observer to run at least once
– Observers may run multiple times and/or concurrently
– Based on a single weak notification
10. Fluo Row Locking
● For cross node transactions Fluo uses
– Accumulo conditional mutations
●
Conditional mutations lock entire rows
● On the server side when checking conditions
● Row locks can impact the transaction performance
● May be a problem if
– Many transactions will update separate columns in a row
– Those transactions are very likely to run concurrently
11. Available Books
● See “Big Data Made Easy”
– Apress Jan 2015
●
See “Mastering Apache Spark”
– Packt Oct 2015
●
See “Complete Guide to Open Source Big Data Stack
– “Apress Jan 2018”
● Find the author on Amazon
– www.amazon.com/Michael-Frampton/e/B00NIQDOOM/
●
Connect on LinkedIn
– www.linkedin.com/in/mike-frampton-38563020
12. Connect
● Feel free to connect on LinkedIn
– www.linkedin.com/in/mike-frampton-38563020
● See my open source blog at
– open-source-systems.blogspot.com/
● I am always interested in
– New technology
– Opportunities
– Technology based issues
– Big data integration