The concept of "Data Lake" is in everyone's mind today. The idea of storing all the data that accumulates in a company in a central location and making it available sounds very interesting at first. But Data Lake can quickly turn from a clear, beautiful mountain lake into a huge pond, especially if it is inexpertly entrusted with all the source data formats that are common in today's enterprises, such as XML, JSON, CSV or unstructured text data. Who, after some time, still has an overview of which data, which format and how they have developed over different versions? Anyone who wants to help themselves from the Data Lake must ask themselves the same questions over and over again: what information is provided, what data types do they have and how has the content changed over time?
Data serialization frameworks such as Apache Avro and Google Protocol Buffer (Protobuf), which enable platform-independent data modeling and data storage, can help. This talk will discuss the possibilities of Avro and Protobuf and show how they can be used in the context of a data lake and what advantages can be achieved. The support on Avro and Protobuf by Big Data and Fast Data platforms is also a topic.
2. Guido
Working at Trivadis for more than 23 years
Consultant, Trainer, Platform Architect for Java,
Oracle, SOA and Big Data / Fast Data
Oracle Groundbreaker Ambassador & Oracle ACE
Director
@gschmutz guidoschmutz.wordpress.com
210th
edition
3.
4. Agenda
• Introduction
• Avro vs. Protobuf
• Serialization in Big Data, Data Lake & Fast Data
• Protobuf and gRPC
• Summary
https://bit.ly/2zz0CV4
4
6. What is Serialization / Deserialization ?
Serialization is the process of turning structured in-memory objects into a byte stream for transmission
over a network or for writing to persistent storage
Deserialization is the reverse process from a byte stream back to a series of structured in-memory
objects
When selecting a data serialization format, the following characteristics should be evaluated:
• Schema support and Schema evolution
• Code generation
• Language support / Interoperability
• Transparent compression
• Splitability
• Support in Big Data / Fast Data Ecosystem
6
7. Where do we need Serialization / Deserialization ?
Service / Client
Logic
Event
Broker
Publish-
Subscribe
Data Lake
Service
{ }
API Logic
REST
Parallel
ProcessingStorage
Raw
serialize deserialize
serializedeserialize
serialize
deserialize
deserialize
serialize
Storage
Refined
Integration
Data Flow
serialize
serializedeserialize
Stream Analyticsdeserialize
serialize
ResultsStream Analytics
Streaming
Source
serialize
7
8. Sample Data Structured used in this presentation
Person (1.0)
• id : integer
• firstName : text
• lastName : text
• title :
enum(unknown,mr,mrs,ms)
• emailAddress : text
• phoneNumber : text
• faxNumber : text
• dateOfBirth : date
• addresses : array<Address>
Address (1.0)
• streetAndNr : text
• zipAndCity : text
{
"id":"1",
"firstName":"Peter",
"lastName":"Sample",
"title":"mr",
"emailAddress":"peter.sample@somecorp.com",
"phoneNumber":"+41 79 345 34 44",
"faxNumber":"+41 31 322 33 22",
"dateOfBirth":"1995-11-10",
"addresses":[
{
"id":"1",
"streetAndNr":"Somestreet 10",
"zipAndCity":"9332 Somecity"
}
]
}
https://github.com/gschmutz/various-demos/tree/master/avro-vs-protobuf
8
10. Google Protocol Buffers
• https://developers.google.com/protocol-buffers/
• Protocol buffers (protobuf) are Google's language-neutral, platform-neutral, extensible mechanism
for serializing structured data
• like XML, but smaller, faster, and simpler
• Schema is needed to generate code
and read/write data
• Support generated code in Java, Python,
Objective-C, C++, Go, Ruby, and C#
• Two different versions: proto2 and proto3
• Presentation based on proto3
• Latest version: 3.13.0
10
11. Apache Avro
• http://avro.apache.org/docs/current/
• Apache Avro™ is a compact, fast, binary data serialization system invented by the makers of Hadoop
• Avro relies on schemas. When data
is read, the schema used when writing
it is always present
• container file for storing persistent data
• Works both with code generation as well
as in a dynamic manner
• Latest version: 1.10.0
11
15. Defining Schema - IDL
@namespace("com.trivadis.avro.person.v1")
protocol PersonIdl {
import idl "Address-v1.avdl";
enum TitleEnum {
Unknown, Mr, Ms, Mrs
}
record Person {
int id;
string firstName;
string lastName;
TitleEnum title;
union { null, string } emailAddress;
union { null, string } phoneNumber;
union { null, string } faxNumber;
date dateOfBirth;
array<com.trivadis.avro.address.v1.Address> addresses;
}
}
@namespace("com.trivadis.avro.address.v1")
protocol AddressIdl {
record Address {
int id;
string streetAndNr;
string zipAndCity;
}
}
Note: JSON Schema can be
generated from IDL Schema using
Avro Tools
address-v1.avdl
Person-v1.avdl
https://avro.apache.org/docs/current/idl.html
15
16. Defining Schema - Specification
• Multiple message types can be defined in
single proto file
• Field Numbers – each field in the message has
a unique number
• used to identify the fields in the message
binary format
• should not be changed once message type is in
use
• 1 – 15 uses single byte, 16 – 2047 uses two
bytes to encode
• Default values are type-specific
• Schema can either be represented as JSON or
by using the IDL
• Avro specifies two serialization encodings:
binary and JSON
• Encoding is done in order of fields defined in
record
• schema used to write data always needs to be
available when the data is read
• Schema can be serialized with the data or
• Schema is made available through registry
16
18. Defining Schema - Style Guides
• Use CamelCase (with an initial capital) for
message names
• Use underscore_separated_names for
field names
• Use CamelCase (with an initial capital) for
enum type names
• Use CAPITALS_WITH_UNDERSCORES for
enum value names
• Use java-style comments for documenting
• Use CamelCase (with an initial capital) for
record names
• Use Camel Case for field names
• Use CamelCase (with an initial capital) for
enum type names
• Use CAPITALS_WITH_UNDERSCORES for
enum value names
• Use java-style comments (IDL) or doc property
(JSON) for documenting
18
20. With Code Generation – Generate the code
• Run the protocol buffer compiler
• One compiler for all supported languages
• Produces classes for the given language
• Run the specific tool for the given language
• For Java
• For C++
• For C#
protoc -I=$SRC_DIR --java_out=$DST_DIR
$SRC_DIR/person-v1.proto
java -jar /path/to/avro-tools-1.8.2.jar
compile schema Person-v1.avsc .
avrogencpp -i cpx.json -o cpx.hh -n c
Microsoft.Hadoop.Avro.Tools codegen
/i:C:SDKsrcMicrosoft.Hadoop.Avro.Tool
sSampleJSONSampleJSONSchema.avsc /o:
20
21. With Code Generation – Using Maven
• Use protobuf-maven-plugin for
generating code at maven build
• Generates to target/generated-sources
• Scans all project dependencies for .proto files
• protoc has to be installed on machine
• Use avro-maven-plugin for generating
code at maven build
• Generates to target/generated-sources
21
22. Using Protobuf and Avro from Java
if you are using Maven, add the following dependency to your POM:
<dependency>
<groupId>org.apache.avro</groupId>
<artifactId>avro</artifactId>
<version>1.10.0</version>
</dependency>
<dependency>
<groupId>com.google.protobuf</groupId>
<artifactId>protobuf-java</artifactId>
<version>3.13.0</version>
</dependency>
22
24. With Code Generation – Serializing
FileOutputStream fos = new FileOutputStream(BIN_FILE_NAME_V1));
ByteArrayOutputStream out = new ByteArrayOutputStream();
DatumWriter<Person> writer = new
SpecificDatumWriter<Person>(Person.getClassSchema());
writer.write(person, EncoderFactory.get().binaryEncoder(out, null));
encoder.flush();
out.close();
byte[] serializedBytes = out.toByteArray();
fos.write(serializedBytes);
FileOutputStream output = new
FileOutputStream(BIN_FILE_NAME_V2);
person.writeTo(output);
24
25. With Code Generation – Deserializing
DatumReader<Person> datumReader = new
SpecificDatumReader<Person>(Person.class);
byte[] bytes = Files.readAllBytes(new File(BIN_FILE_NAME_V1).toPath());
BinaryDecoder decoder = DecoderFactory.get().binaryDecoder(bytes, null);
Person person = datumReader.read(null, decoder);
System.out.println(person.getFirstName());
PersonWrapper.Person person =
PersonWrapper.Person.parseFrom(new
FileInputStream(BIN_FILE_NAME_V1));
System.out.println(person.getFirstName());
25
26. Encoding
• Field position (tag) are used as keys
• Variable length for int32 and int64
• + zig zag for sint32 and sint64
• data is serialized in field order of schema Variable
length, zig-zag for int and long, fixed length for float
and double
Variable length encoding: a method of serializing integers using one or more bytes
Zig Zag encoding: more efficient for negative numbers
26
27. Without code generation
final String schemaLoc = "src/main/avro/Person-v1.avsc";
final File schemaFile = new File(schemaLoc);
final Schema schema = new Schema.Parser().parse(schemaFile);
GenericRecord person1 = new GenericData.Record(schema);
person1.put("id", 1);
person1.put("firstName", "Peter");
person1.put("lastName", "Muster");
person1.put("title", "Mr");
person1.put("emailAddress", "peter.muster@somecorp.com");
person1.put("phoneNumber", "+41 79 345 34 44");
person1.put("faxNumber", "+41 31 322 33 22");
person1.put("dateOfBirth", new LocalDate("1995-11-10"));
27
28. Serializing to Object Container File
• file has schema and all
objects stored in the file
must be according to that
schema
• Objects are stored in
blocks that may be
compressed
final DatumWriter<Person> datumWriter = new
SpecificDatumWriter<>(Person.class);
final DataFileWriter<Person> dataFileWriter =
new DataFileWriter<>(datumWriter);
// use snappy compression
dataFileWriter.setCodec(CodecFactory.snappyCodec());
dataFileWriter.create(persons.get(0).getSchema(),
new File(CONTAINER_FILE_NAME_V1));
// specify block size
dataFileWriter.setSyncInterval(1000);
persons.forEach(person -> {
dataFileWriter.append(person);
});
28
30. Schema Evolution
Person (1.0)
• id : integer
• firstName : text
• lastName : text
• title : enum(unknown,mr,mrs,ms)
• emailAddress : text
• phoneNumber : text
• faxNumber : text
• dateOfBirth : date
• addresses : array<Address>
Address (1.0)
• streetAndNr : text
• zipAndCity : text
Person (1.1)
• id : integer
• firstName : text
• middleName : text
• lastName : text
• title : enum(unknown,mr,mrs,ms)
• emailAddress : text
• phoneNumber : text
• faxNumber : text
• addresses : array<Addresss>
Address (1.0)
• streetAndNr : text
• zipAndCity : text
V1.0 to V1.1
• Adding
middleName
• Remove
faxNumber
30
38. Avro and Kafka – Producing Avro to Kafka
@Configuration
public class KafkaConfig {
private String bootstrapServers;
private String schemaRegistryURL;
@Bean
public Map<String, Object> producerConfigs() {
Map<String, Object> props = new HashMap<>();
props.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, bootstrapServers);
props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, StringSerializer.class);
props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, KafkaAvroSerializer.class);
props.put(KafkaAvroSerializerConfig.SCHEMA_REGISTRY_URL_CONFIG, schemaRegistryURL);
return props;
}
@Bean
public ProducerFactory<String, Person> producerFactory() { .. }
@Bean
public KafkaTemplate<String, Person> kafkaTemplate() {
return new KafkaTemplate<>(producerFactory());
}
@Component
public class PersonEventProducer {
@Autowired
private KafkaTemplate<String, Person> kafkaTemplate;
@Value("${kafka.topic.person}")
String kafkaTopic;
public void produce(Person person) {
kafkaTemplate.send(kafkaTopic, person.getId().toString(), person);
}
}
38
39. Avro and Big Data
• Avro is widely supported by Big Data Frameworks: Hadoop MapReduce, Pig, Hive, Sqoop, Apache
Spark, …
• Spark Avro DataSource for Apache Spark supports using Avro as a source for DataFrames:
https://spark.apache.org/docs/latest/sql-data-sources-avro.html
import com.databricks.spark.avro._
val personDF = spark.read.avro("person-v1.avro")
personDF.createOrReplaceTempView("person")
val subPersonDF =
spark.sql("select * from person where firstName like 'G%'")
libraryDependencies += ”org.apache.spark" %% "spark-avro" % ”2.12:3.0.1"
39
40. There is more! Column-oriented: Apache Parquet and ORC
A logical table can be translated using either
• Row-based layout (Avro, Protobuf, JSON, …)
• Column-oriented layout (Parquet, ORC, …)
Apache Parquet
• collaboration between Twitter and
Cloudera
• Support in Hadoop, Hive, Spark, Apache
NiFi, StreamSets, Apache Pig, …
Apache ORC
• was created by Facebook and
Hortonworks
• Support in Hadoop, Hive, Spark, Apache
NiFi, Apache Pig, Presto, …
A B C
A1 B1 C1
A2 B2 C2
A2 B2 C2
A1 B1 C1 A2 B2 C2 A3 B3 C3
A1 A2 A3 B1 B2 B3 C1 C2 C3
40
41. Parquet and Big Data
• Avro is widely supported by Big Data Frameworks: Hadoop MapReduce, Pig, Hive, Sqoop, Apache
Spark, …
• Spark Avro DataSource for Apache Spark supports using Avro as a source for DataFrames:
https://github.com/databricks/spark-avro
import com.databricks.spark.avro._
val personDF = spark.read.avro("person-v1.avro")
personDF.createOrReplaceTempView("Person")
val subPersonDF =
spark.sql("select * from Person where firstName like 'G%'")
libraryDependencies += ”org.apache.spark" %% "spark-avro" % ”2.12:3.0.1"
41
42. Delta Lake - http://delta.io
• Delta Lake is an open source storage layer that brings reliability to data lakes
• First part of Databricks Platform, now open-sourced
• Delta Lake provides
• Fully compatible with Apache Spark
• ACID transactions
• Update and Delete on Big Data Storage
• Schema enforcement
• Time Travel (Data versioning)
• Scalable metadata handling
• Open Format (Parquet)
• Unified streaming and batch data processing
• Schema Evolution
• Audit History
• Integration with Presto/Athena/Hive/Amazon Redshift/Snowflake for read
42
44. Other Delta Lake Storage Layers
Apache Hudi
• https://hudi.apache.org/
• Ingests & manages storage of large analytical datasets over DFS
Apache Iceberg
• https://iceberg.apache.org
• Open table format for huge analytic datasets
• Adds tables to Presto and Spark that use a high-performance format
44
https://medium.com/@domisj/comparison-of-big-data-storage-layers-delta-vs-apache-hudi-vs-apache-iceberg-part-1-200599645a02
46. Protobuf and gRPC
• https://grpc.io/
• Google's high performance, open-source
universal RPC framework
• layering on top of HTTP/2 and using protocol
buffers to define messages
• Support for Java, C#, C++, Python, Go, Ruby,
Node.js, Objective-C, …
Source: https://thenewstack.io
46
48. Serialization / Deserialization
Service / Client
Logic
Event
Broker
Publish-
Subscribe
Data Lake
Service
{ }
API Logic
REST
Parallel
ProcessingStorage
Raw
serialize deserialize
serializedeserialize
serialize
deserialize
deserialize
serialize
Storage
Refined
Integration
Data Flow
serialize
serializedeserialize
Stream Analyticsdeserialize
serialize
ResultsStream Analytics
Streaming
Source
serialize
48
49. You are welcome to join us at the Expo area.
We're looking forward to meeting you.
Link to the Expo area:
https://www.vinivia-event-
manager.io/e/DOAG/portal/expo/29731