This session will go into best practices and detail on how to architect a near real-time application on Hadoop using an end-to-end fraud detection case study as an example. It will discuss various options available for ingest, schema design, processing frameworks, storage handlers and others, available for architecting this fraud detection application and walk through each of the architectural decisions among those choices.
This gives me a lot of perspective regarding the use of Hadoop
Topics are partitioned, each partition ordered and immutable. Messages in a partition have an ID, called Offset. Offset uniquely identifies a message within a partition
Kafka retains all messages for fixed amount of time.
Not waiting for acks from consumers.
The only metadata retained per consumer is the position in the log – the offset
So adding many consumers is cheap
On the other hand, consumers have more responsibility and are more challenging to implement correctly
And “batching” consumers is not a problem
3 partitions, each replicated 3 times.
The choose how many replicas must ACK a message before its considered committed.
This is the tradeoff between speed and reliability
The choose how many replicas must ACK a message before its considered committed.
This is the tradeoff between speed and reliability
can read from one or more partition leader. You can’t have two consumers in same group reading the same partition.
Leaders obviously do more work – but they are balanced between nodes
We reviewed the basic components on the system, and it may seem complex. In the next section we’ll see how simple it actually is to get started with Kafka.