Apache Fluo

What Is Apache Fluo ?
● For large scale data set incremental updates
● Open source Apache 2.0 license
● Based upon Apache Accumulo
– Uses Hadoop HDFS to store data
– Uses ZooKeeper for configuration
– Partitions tables into tablets
● It is a distributed system
● Supports cross node transactions

● Allows monitoring of large datasets to
– Identify small changes
– Join changes into the larger data set
– Without processing all data
● Transactions allows many current changes
– Without data corruption
● Fluo uses code based observers which
– Act on table column changes
● Offers a Fluo Java based API

● Use of Fluo is code based and low level
● Fluo uses Hadoop YARN to run its processes
● Fluo uses ZooKeeper to
– Store its meta data
– Store its state information
● Fluo data is stored in Fluo tables on Accumulo ( HDFS)
– Same structure as Accumulo except
– Row has no timestamps

Fluo Architecture
● Large scale computation through small scale transactions
● Clients access Fluo through Java API
● Clients ingest data through the API
● Application Oracle processes apply transaction timestamps
● Application worker processes run user code
● User code/observers monitor column changes
● Multiple workers can run the same observers
● Transactions change data, snapshots read data

Fluo Architecture
● Fluo provides snapshot isolation
● A snapshot only sees pre committed transactions
● Transaction overlap / collision is possible
● In this case a write skew is possible if
– Different keys are concurrently updated
● Fluo supports scanners to read data ranges or spans
● Fluo has a transaction based LoaderExecutor
– To aid the loading of data

Fluo Architecture
● Fluo supports incremental processing via
● Notifications
– Persistent markers set by a transaction that Indicate
– An Observer should run later for a certain row+column
● Observers
– User provided code that is registered to
– Process notifications for a certain column
●
Observer receives row/column that triggered it plus transaction
●
Fluo worker processes running across a cluster
● Will execute Observers

Fluo Architecture
● Fluo supports two types of notification
● Strong notification
– Guarantee an observer will run at most once
– When a column is modified
– Even for multiple row+column updates
● Weak notification
– Cause an observer to run at least once
– Observers may run multiple times and/or concurrently
– Based on a single weak notification

Fluo Row Locking
● For cross node transactions Fluo uses
– Accumulo conditional mutations
●
Conditional mutations lock entire rows
● On the server side when checking conditions
● Row locks can impact the transaction performance
● May be a problem if
– Many transactions will update separate columns in a row
– Those transactions are very likely to run concurrently

Available Books
● See “Big Data Made Easy”
– Apress Jan 2015
●
See “Mastering Apache Spark”
– Packt Oct 2015
●
See “Complete Guide to Open Source Big Data Stack
– “Apress Jan 2018”
● Find the author on Amazon
– www.amazon.com/Michael-Frampton/e/B00NIQDOOM/
●
Connect on LinkedIn
– www.linkedin.com/in/mike-frampton-38563020

Connect
● Feel free to connect on LinkedIn
– www.linkedin.com/in/mike-frampton-38563020
● See my open source blog at
– open-source-systems.blogspot.com/
● I am always interested in
– New technology
– Opportunities
– Technology based issues
– Big data integration

Apache Fluo

Recomendados

Recomendados

Mais conteúdo relacionado

Mais de Mike Frampton

Mais de Mike Frampton (20)

Último

Último (20)

Apache Fluo