SlideShare uma empresa Scribd logo
1 de 40
Baixar para ler offline
Cobrix – A COBOL data
source for Spark
Ruslan Iushchenko (ABSA), Felipe Melo (ABSA)
• Who we are?
• Motivation
• Mainframe files and copybooks
• Loading simple files
• Loading hierarchical databases
• Performance and results
• Cobrix in ABSA Big Data space
Outline
2/40
About us
. ABSA is a Pan-African financial services provider
- With Apache Spark at the core of its data engineering
. We fill gaps in the Hadoop ecosystem, when we find them
. Contributions to Apache Spark
. Spark-related open-source projects (https://github.com/AbsaOSS)
- Spline - a data lineage tracking and visualization tool
- ABRiS - Avro SerDe for structured APIs
- Atum - Data quality library for Spark
- Enceladus - A dynamic data conformance engine
- Cobrix - A Cobol library for Spark (focus of this presentation)
3/40
The market for Mainframes is strong, with no signs of cooling down.
Mainframes
Are used by 71% of Fortune 500 companies
Are responsible for 87% of all credit card transactions in the world
Are part of the IT infrastructure of 92 out of the 100 biggest banks in the world
Handle 68% of the world’s production IT workloads, while accounting for only 6%
of IT costs.
For companies relying on Mainframes, becoming data-centric can be
prohibitively expensive
High cost of hardware
Expensive business model for data science related activities
Source: http://blog.syncsort.com/2018/06/mainframe/9-mainframe-statistics/
Business Motivation
4/40
Technical Motivation
• The process above takes 11 days for a 600GB file
• Legacy data models (hierarchical)
• Need for performance, scalability, flexibility, etc
• SPOILER alert: we brought it to 1.1 hours 5/40
Mainframes PC
Fixed-length
Text Files
CSVHDFS
1. Extract 2. Transform
4. Load 3. Transform
Proprietary
Tools
• Run analytics / Spark on mainframes
• Message Brokers (e.g. MQ)
• Sqoop
• Proprietary solutions
• But ...
• Pricey
• Slow
• Complex (specially for legacy systems)
• Require human involvement
What can you do?
6/40
How Cobrix can help
• Decreasing human involvement
• Simplifying the manipulation of hierarchical structures
• Providing scalability
• Open-source
7/40
Mainframe file
(EBCDIC)
Schema
(copybook)
Apache
Spark Application
Cobrix Output
(Parquet, JSON, CSV…)df df df
transformations
Writer
...
Cobrix – a custom Spark data source
8/40
A copybook is a schema definition
A data file is a collection of binary records
Name: █ █ █ █ Age: █ █
Company: █ █ █ █ █ █ █ █
Phone #: █ █ █ █ █ █ █ █
Zip: █ █ █ █ █
Name: J O H N Age: 3 2
Company: F O O . C O M
Phone #: + 2 3 1 1 - 3 2 7
Zip: 1 2 0 0 0
A * N J O H N G A 3 2 S H
K D K S I A S S A S K A S
A L , S D F O O . C O M X
L Q O K ( G A } S N B W E
S < N J X I C W L D H J P
A S B C + 2 3 1 1 - 3 2 7
C = D 1 2 0 0 0 F H 0 D .
9/40
Similar to IDLs of Avro, Thrift, Protocol Buffers, etc.
struct Company {
1: required i64 id,
2: required string name,
3: optional list<string> contactPeople
}
Thrift
message Company {
required int64 id = 1;
required string name = 2;
repeated string contact_people = 3;
}
10 COMPANY.
15 ID PIC 9(12) COMP.
15 NAME PIC X(40).
15 CONTACT-PEOPLE PIC X(20)
OCCURS 10.
COBOL
record Company {
int64 id;
string name;
array<string> contactPeople;
}
10/40
val df = spark
.read
.format("cobol")
.option("copybook", "data/example.cob")
.load("data/example")
01 RECORD.
05 COMPANY-ID PIC 9(10).
05 COMPANY-NAME PIC X(40).
05 ADDRESS PIC X(60).
05 REG-NUM PIC X(8).
05 ZIP PIC X(6).
A * N J O H N G A 3 2 S H K D K S I
A S S A S K A S A L , S D F O O . C
O M X L Q O K ( G A } S N B W E S <
N J X I C W L D H J P A S B C + 2 3
COMPANY_ID COMPANY_NAME ADDRESS REG_NUM ZIP
100 ABCD Ltd. 10 Garden st. 8791237 03120
101 ZjkLPj 11 Park ave. 1233971 23111
102 Robotrd Inc. 12 Forest st. 0382979 12000
103 Xingzhoug 8 Mountst. 2389012 31222
104 Example.co 123 Tech str. 3129001 19000
Loading Mainframe Data
11/40
val spark = SparkSession.builder() .appName("Example").getOrCreate()
val df = spark
.read
.format("cobol")
.option("copybook", "data/example.cob")
.load("data/example")
// ...Business logic goes here...
df.write.parquet("data/output")
This App is
● Distributed
● Scalable
● Resilient
EBCDIC to Parquet examples
12/40
A * N J O H N G A 3 2 S H K D K S I A
S S A S K A S A L , S D F O O . C O M
X L Q O K ( G A } S N B W E S < N J X
I C W L D H J P A S B C + 2 3 1 1 - 3
2 7 C = D 1 2 0 0 0 F H 0 D . A * N J
O H N G A 3 2 S H K D K S I A S S A S
K A S A L , S D F O O . C O M X L Q O
K ( G A } S N B W E S < N J X I C W L
D H J P A S B C + 2 B W E S < N J X P
FIRST-NAME: █ █ █ █ █ █ LAST-NAME: █ █ █ █ █COMPANY-NAME: █ █ █ █ █ █ █ █ █ █ █ █ █ █
• Redefined fields AKA
• Unchecked unions
• Untagged unions
• Variant type fields
• Several fields occupy the same space
01 RECORD.
05 IS-COMPANY PIC 9(1).
05 COMPANY.
10 COMPANY-NAME PIC X(40).
05 PERSON REDEFINES COMPANY.
10 FIRST-NAME PIC X(20).
10 LAST-NAME PIC X(20).
05 ADDRESS PIC X(50).
05 ZIP PIC X(6).
Redefined Fields
13/40
01 RECORD.
05 IS-COMPANY PIC 9(1).
05 COMPANY.
10 COMPANY-NAME PIC X(40).
05 PERSON REDEFINES COMPANY.
10 FIRST-NAME PIC X(20).
10 LAST-NAME PIC X(20).
05 ADDRESS PIC X(50).
05 ZIP PIC X(6).
• Cobrix applies all redefines for each
record
• Some fields can clash
• It’s up to the user to apply business logic
to separate correct and wrong data
IS_COMPANY COMPANY PERSON ADDRESS ZIP
1 {“COMPANY_NAME”: “September Ltd.”} {“FIRST_NAME”: “Septem”,
“LAST_NAME”: “ber Ltd.”}
74 Lawn ave., Denver 39023
0 {“COMPANY_NAME”: “Beatrice
Gagliano”}
{“FIRST_NAME”: “Beatrice”,
“LAST_NAME”: “Gagliano”}
10 Garden str. 33113
1 {“COMPANY_NAME”: “Beatrice
Gagliano”}
{“FIRST_NAME”: “Januar”,
“LAST_NAME”: “y Inc.”}
122/1 Park ave. 31234
Redefined Fields
14/40
df.select($"IS_COMPANY",
when($"IS_COMPANY" === true, "COMPANY_NAME")
.otherwise(null).as("COMPANY_NAME"),
when($"IS_COMPANY" === false, "CONTACTS")
.otherwise(null).as("FIRST_NAME")),
...
IS_COMPANY COMPANY_NAME FIRST_NAME LAST_NAME ADDRESS ZIP
1 September Ltd. 74 Lawn ave., Denver 39023
0 Beatrice Gagliano 10 Garden str. 33113
1 January Inc. 122/1 Park ave. 31234
Clean Up Redefined Fields + flatten structs
15/40
Hierarchical DBs
• Several record types
• AKA segments
• Each segment type has its
own schema
• Parent-child relationships
between segments
COMPANY
ID: █ █ █ █ █ █ █ █ █
Name: █ █ █ █ █ █ █ █ █ █ █ █
Address: █ █ █ █ █ █ █ █ █
CONTACT-PERSON
Name: █ █ █ █ █ █ █ █
█ █ █ █ █ █ █ █ █
Phone #: █ █ █ █ █ █ █
Root segment
Child segment
CONTACT-PERSON
Name: █ █ █ █ █ █ █ █
█ █ █ █ █ █ █ █ █
Phone #: █ █ █ █ █ █ █
…
Child segment
16/40
• The combined copybook has to contain all the segments as redefined
fields:
01 COMPANY-DETAILS.
05 SEGMENT-ID PIC X(5).
05 COMPANY-ID PIC X(10).
05 COMPANY.
10 NAME PIC X(15).
10 ADDRESS PIC X(25).
10 REG-NUM PIC 9(8) COMP.
05 CONTACT REDEFINES COMPANY.
10 PHONE-NUMBER PIC X(17).
10 CONTACT-PERSON PIC X(28).
common data
segment 1
segment 2
COMPANY
Name: █ █ █ █ █ █ █ █ █ █ █
Address: █ █ █ █ █ █ █ █ █
Reg-Num: █ █ █ █ █ █
CONTACT
Phone #: █ █ █ █ █ █ █
Contact Person: █ █ █ █
█ █ █ █ █ █ █ █ █ █ █
Defining a Copybook
17/40
• The code snippet for reading the data:
val df = spark
.read
.format("cobol")
.option("copybook", "/path/to/copybook.cpy")
.option("is_record_sequence", "true")
.load("examples/multisegment_data")
Reading all the segments
18/40
• The dataset for the whole copybook:
• Invalid redefines are highlighted
SEGMENT_ID COMPANY_ID COMPANY CONTACT
C 1005918818 [ ABCD Ltd. ] [ invalid ]
P 1005918818 [ invalid ] [ Cliff Wallingford ]
C 1036146222 [ DEFG Ltd. ] [ invalid ]
P 1036146222 [ invalid ] [ Beatrice Gagliano ]
C 1045855294 [ Robotrd Inc. ] [ invalid ]
P 1045855294 [ invalid ] [ Doretha Wallingford ]
P 1045855294 [ invalid ] [ Deshawn Benally ]
P 1045855294 [ invalid ] [ Willis Tumlin ]
C 1057751949 [ Xingzhoug ] [ invalid ]
P 1057751949 [ invalid ] [ Mindy Boettcher ]
Reading all the segments
19/40
A * N J O
H N G A 3
2 S H K D
K S I A S
S A S K A
S A L , S
D F O O .
C O M X L
Q O K ( G
A } S N B
W E S < N
J X I C W
L D H J P
A S B C +
2 3 1 1 -
3 2 7 C =
D 1 2 0 0
0 F H 0 D
. K A I O
D A P D F
C J S C D
A D F R J
F D F C L
COMPANY
ID: 2 8 0 0 0 3 9 4 1
Name: E x a m p l e . c o m █
Address: 1 0 █ G a r d e n
PERSON
Name: J A N E █ █ █ █
R O B E R T S █ █
Phone #: + 9 3 5 2 8 0
PERSON
Name: J A N E █ █ █ █
R O B E R T S █ █
Phone #: + 9 3 5 2 8 0
COMPANY
ID: 2 8 0 0 0 3 9 4 1
Name: E x a m p l e . c o m █
Address: 1 0 █ G a r d e n
COMPANY
ID: 2 8 0 0 0 3 9 4 1
Name: E x a m p l e . c o m █
Address: 1 0 █ G a r d e n
Id Name Address Reg_Num
100 Example.com 10 Garden st. 8791237
101 ZjkLPj 11 Park ave. 1233971
102 Robotrd Inc. 12 Forest st. 0382979
103 Xingzhoug 8 Mountst. 2389012
104 ABCD Ltd. 123 Tech str. 3129001
Company_Id Contact_Person Phone_Number
100 Jane +32186331
100 Colyn +23769123
102 Robert +12389679
102 Teresa +32187912
102 Laura +42198723
Separate segments by dataframes
20/40
• Filter segment #1 (companies)
val dfCompanies =
df.filter($"SEGMENT_ID"==="C")
.select($"COMPANY_ID",
$"COMPANY.NAME".as($"COMPANY_NAME"),
$"COMPANY.ADDRESS",
$"COMPANY.REG_NUM")
Company_Id Company_Name Address Reg_Num
100 ABCD Ltd. 10 Garden st. 8791237
101 ZjkLPj 11 Park ave. 1233971
102 Robotrd Inc. 12 Forest st. 0382979
103 Xingzhoug 8 Mountst. 2389012
104 Example.co 123 Tech str. 3129001
Reading root segments
21/40
• Filter segment #2 (people)
val dfContacts = df
.filter($"SEGMENT_ID"==="P")
.select($"COMPANY_ID",
$"CONTACT.CONTACT_PERSON",
$"CONTACT.PHONE_NUMBER")
Company_Id Contact_Person Phone_Number
100 Marry +32186331
100 Colyn +23769123
102 Robert +12389679
102 Teresa +32187912
102 Laura +42198723
Reading child segments
22/40
Company_Id Company_Name Address Reg_Num
100 ABCD Ltd. 10 Garden st. 8791237
101 ZjkLPj 11 Park ave. 1233971
102 Robotrd Inc. 12 Forest st. 0382979
103 Xingzhoug 8 Mountst. 2389012
104 Example.co 123 Tech str. 3129001
Company_Id Contact_Person Phone_Number
100 Marry +32186331
100 Colyn +23769123
102 Robert +12389679
102 Teresa +32187912
102 Laura +42198723
Company_Id Company_Name Address Reg_Num Contact_Person Phone_Number
100 ABCD Ltd. 10 Garden st. 8791237 Marry +32186331
100 ABCD Ltd. 10 Garden st. 8791237 Colyn +23769123
102 Robotrd Inc. 12 Forest st. 0382979 Robert +12389679
102 Robotrd Inc. 12 Forest st. 0382979 Teresa +32187912
102 Robotrd Inc. 12 Forest st. 0382979 Laura +42198723
The two segments can now be joined by Company_Id
23/40
• Joining segments 1 and 2
val dfJoined =
dfCompanies.join_outer(dfContacts, "COMPANY_ID")
Results:
Company_Id Company_Name Address Reg_Num Contact_Person Phone_Number
100 ABCD Ltd. 10 Garden st. 8791237 Marry +32186331
100 ABCD Ltd. 10 Garden st. 8791237 Colyn +23769123
102 Robotrd Inc. 12 Forest st. 0382979 Robert +12389679
102 Robotrd Inc. 12 Forest st. 0382979 Teresa +32187912
102 Robotrd Inc. 12 Forest st. 0382979 Laura +42198723
Joining in Spark is easy
24/40
• The joined table can also be denormalized for document storage
val dfCombined =
dfJoined
.groupBy($"COMPANY_ID",
$"COMPANY_NAME",
$"ADDRESS",
$"REG_NUM")
.agg(
collect_list(
struct($"CONTACT_PERSON",
$"PHONE_NUMBER"))
.as("CONTACTS"))
{
"COMPANY_ID": "8216281722",
"COMPANY_NAME": "ABCD Ltd.",
"ADDRESS": "74 Lawn ave., New York",
”REG_NUM": "33718594",
"CONTACTS": [
{
"CONTACT_PERSON": "Cassey Norgard",
"PHONE_NUMBER": "+(595) 641 62 32"
},
{
"CONTACT_PERSON": "Verdie Deveau",
"PHONE_NUMBER": "+(721) 636 72 35"
},
{
"CONTACT_PERSON": "Otelia Batman",
"PHONE_NUMBER": "+(813) 342 66 28"
}
]
}
Denormalize data
25/40
Restore parent-child relationships
• In our example we had
COMPANY_ID field that is
present in all segments
• In real copybooks this is not
the case
• What can we do?
COMPANY
ID: █ █ █ █ █ █ █ █ █
Name: █ █ █ █ █ █ █ █ █ █ █ █
Address: █ █ █ █ █ █ █ █ █
CONTACT-PERSON
Name: █ █ █ █ █ █ █ █
█ █ █ █ █ █ █ █ █
Phone #: █ █ █ █ █ █ █
Root segment
Child segment
CONTACT-PERSON
Name: █ █ █ █ █ █ █ █
█ █ █ █ █ █ █ █ █
Phone #: █ █ █ █ █ █ █
…
Child segment
26/40
01 COMPANY-DETAILS.
05 SEGMENT-ID PIC X(5).
05 COMPANY.
10 NAME PIC X(15).
10 ADDRESS PIC X(25).
10 REG-NUM PIC 9(8) COMP.
05 CONTACT REDEFINES COMPANY.
10 PHONE-NUMBER PIC X(17).
10 CONTACT-PERSON PIC X(28).
• If COMPANY_ID is not part
of all segments
Cobrix can generate it for you
val df = spark
.read
.format("cobol")
.option("copybook", "/path/to/copybook.cpy")
.option("is_record_sequence", "true")
.option("segment_field", "SEGMENT-ID")
.option("segment_id_level0", "C")
.option("segment_id_prefix", "ID")
.load("examples/multisegment_data")
No COMPANY-ID
Id Generation
27/40
01 COMPANY-DETAILS.
05 SEGMENT-ID PIC X(5).
05 COMPANY.
10 NAME PIC X(15).
10 ADDRESS PIC X(25).
10 REG-NUM PIC 9(8) COMP.
05 CONTACT REDEFINES COMPANY.
10 PHONE-NUMBER PIC X(17).
10 CONTACT-PERSON PIC X(28).
• Seg0_Id can be used to restore
parent-child relationship
between segments
No COMPANY-ID
SEGMENT_ID Seg0_Id COMPANY CONTACT
C ID_0_0 [ ABCD Ltd. ] [ invalid ]
P ID_0_0 [ invalid ] [ Cliff Wallingford ]
C ID_0_2 [ DEFG Ltd. ] [ invalid ]
P ID_0_2 [ invalid ] [ Beatrice Gagliano ]
C ID_0_4 [ Robotrd Inc. ] [ invalid ]
P ID_0_4 [ invalid ] [ Doretha Wallingford ]
P ID_0_4 [ invalid ] [ Deshawn Benally ]
Id Generation
28/40
• When transferred from a mainframe a hierarchical database becomes
• A sequence of records
• To read next record a previous record should be read first
• A sequential format by it's nature
• How to make it scalable?
A * N J O H N G A 3 2 S H K D K S I A S S A S K A S A L , S D F O O . C O M X L Q O K ( G A } S N B W E S < N J
X I C W L D H J P A S B C + 2 3 1 1 - 3 2 7 C = D 1 2 0 0 0 F H 0 D . K A I O D A P D F C J S C D C D C W E P 1
9 1 2 3 – 3 1 2 2 1 . 3 1 F A D F L 1 7
COMPANY
ID: █ █ █ █ █ █ █ █ █
Name: █ █ █ █ █ █ █ █ █ █ █ █
Address: █ █ █ █ █ █ █ █ █
COMPANY
ID: █ █ █ █ █ █ █ █ █
Name: █ █ █ █ █ █ █ █ █ █ █ █
Address: █ █ █ █ █ █ █ █ █
PERSON
Name: █ █ █ █ █ █ █ █
█ █ █ █ █ █ █ █ █
Phone #: █ █ █ █ █ █ █
A data file
PERSON
Name: J A N E █ █ █ █
R O B E R T S █ █
Phone #: + 9 3 5 2 8 0
PERSON
Name: █ █ █ █ █ █ █ █
█ █ █ █ █ █ █ █ █
Phone #: █ █ █ █ █ █ █
PERSON
Name: █ █ █ █ █ █ █ █
█ █ █ █ █ █ █ █ █
Phone #: █ █ █ █ █ █ █
COMPANY
ID: 2 8 0 0 0 3 9 4 1
Name: E x a m p l e . c o m █
Address: 1 0 █ G a r d e n
Variable Length Records (VLR)
29/40
Performance challenge of VLRs
• Naturally sequential files
• To read next record the prior
record need to be read first
• Each record had a length
field
• Acts as a pointer to the next
record
• No record delimiter when
reading a file from the middle
VLR structure
30/40
0
20
40
60
80
100
120
140
160
180
0 10 20 30 40 50 60 70
MB/s Number of Spark cores
Throughput, variable record length
Sequential processing
10 MB/s
Spark on HDFS
Blocks
Partitions
Data Node
Spark Executor
Blocks
Partitions
Data Node
Spark Executor
Blocks
Partitions
Data Node
Spark Executor
. . .
HDFS Namenode
Spark Driver
31/40
Data node 1 Data node 2 Data node 3
HDFS
Cobrix
1- Read headers
List(offsets,lengths)
Spark
3 - Parse records
In parallel from
parallelized offsets
and lengths
Spark cluster
2 - Parallelize
Offsets and lengths
Parallelizing Sequential Reads
32/40
Data node 1 Data node 2 Data node 3
HDFS
Namenode Cobrix
1- Read VLR 1 header
offset = 3000
length = 1002 - where is
offset 3000
until 3100?
3 - Check nodes
3, 18 and 41.
Spark
4 - Load VLR 1
Preferred location
is node 3
5 - Launch task
On executor
hosted on node 3
Spark cluster
Enabling Data Locality for VLRs
33/40
Throughput when sparse indexes are used
• Experiments were ran
on our lab cluster
• 4 nodes
• 380 cores
• 10 Gbit network
• Scalable – for bigger
files when using more
executors the
throughput is bigger
34/40
0
20
40
60
80
100
120
140
160
180
0 10 20 30 40 50 60 70
MB/s
Number of Spark cores
Throughput, variable record length
10 GB file
20 GB file
40 GB file
Sequential
Comparison versus fixed length record performance
● Distribution and locality is
handled completely by Spark
● Parallelism is achieved using
sparse indexes
35/40
0
20
40
60
80
100
120
140
160
180
0 10 20 30 40 50 60 70
MB/s
Number of Spark cores
Throughput, fixed length records
40 GB file 150 MB/s
0
20
40
60
80
100
120
140
160
180
0 10 20 30 40 50 60 70
MB/s
Number of Spark cores
Throughput, variable record length
40 GB file
Sequential
145 MB/s
10 MB/s
Cobrix in ABSA Data Infrastructure - Batch
Mainframes
180+
sources
HDFS
Enceladus
Cobrix
Spark
Spline
2. Parse
3. Conform
4. Track Lineage5. Re-ingest
6. Consume
36/40
Cobrix in ABSA Data Infrastructure - Stream
Mainframes
180+
sources
Cobrix
Parser ABRiS
Enceladus
Kafka
Spline
1.Ingest 2.Parse 3.Avro
4.Conform
5. Track Lineage
6. Consume
37/40
Cobrix in ABSA Data Infrastructure
Mainframes
180+
sources
HDFS
Cobrix
Parser ABRiS
Enceladus
Kafka
Cobrix
Spark
Spline
1.Ingest
2. Parse
3. Conform
4. Track Lineage5. Re-ingest
6. Consume
2.Parse 3.Avro
4.Conform
5. Track Lineage
6. Consume
Batch
Stream
38/40
● Thanks to the following people the project was made possible and for all
the help along the way:
○ Andrew Baker, Francois Cillers, Adam Smyczek,
Jan Scherbaum, Peter Moon, Clifford Lategan,
Rekha Gorantla, Mohit Suryavanshi, Niel Steyn
• Thanks to the authors of the original COBOL parser:
○ Ian De Beer, Rikus de Milander
(https://github.com/zenaptix-lab/copybookStreams)
Acknowledgment
39/40
• Combine expertise to make access mainframe data in Hadoop seamless
• Our goal is to support the widest range of use cases possible
• Report a bug !
• Request new feature !
• Create a pull request ! " # $
Our home: https://github.com/AbsaOSS/cobrix
Your Contribution is Welcome
40/40

Mais conteúdo relacionado

Mais procurados

Introduction to Apache Camel
Introduction to Apache CamelIntroduction to Apache Camel
Introduction to Apache CamelClaus Ibsen
 
Apache Pulsar Development 101 with Python
Apache Pulsar Development 101 with PythonApache Pulsar Development 101 with Python
Apache Pulsar Development 101 with PythonTimothy Spann
 
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...Helena Edelson
 
Introduction to apache kafka, confluent and why they matter
Introduction to apache kafka, confluent and why they matterIntroduction to apache kafka, confluent and why they matter
Introduction to apache kafka, confluent and why they matterPaolo Castagna
 
Introduction to Cassandra
Introduction to CassandraIntroduction to Cassandra
Introduction to CassandraGokhan Atil
 
Introducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data ScienceIntroducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data ScienceDatabricks
 
From Zero to Hero with Kafka Connect
From Zero to Hero with Kafka ConnectFrom Zero to Hero with Kafka Connect
From Zero to Hero with Kafka Connectconfluent
 
Beyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFramesBeyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFramesDatabricks
 
20090713 Hbase Schema Design Case Studies
20090713 Hbase Schema Design Case Studies20090713 Hbase Schema Design Case Studies
20090713 Hbase Schema Design Case StudiesEvan Liu
 
Apache Spark Introduction
Apache Spark IntroductionApache Spark Introduction
Apache Spark Introductionsudhakara st
 
Serverless Kafka and Spark in a Multi-Cloud Lakehouse Architecture
Serverless Kafka and Spark in a Multi-Cloud Lakehouse ArchitectureServerless Kafka and Spark in a Multi-Cloud Lakehouse Architecture
Serverless Kafka and Spark in a Multi-Cloud Lakehouse ArchitectureKai Wähner
 
Python Data Wrangling: Preparing for the Future
Python Data Wrangling: Preparing for the FuturePython Data Wrangling: Preparing for the Future
Python Data Wrangling: Preparing for the FutureWes McKinney
 
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...Databricks
 
The Art and Science of DDS Data Modelling
The Art and Science of DDS Data ModellingThe Art and Science of DDS Data Modelling
The Art and Science of DDS Data ModellingAngelo Corsaro
 
DevOps for Databricks
DevOps for DatabricksDevOps for Databricks
DevOps for DatabricksDatabricks
 
Kafka Summit NYC 2017 - Data Processing at LinkedIn with Apache Kafka
Kafka Summit NYC 2017 - Data Processing at LinkedIn with Apache KafkaKafka Summit NYC 2017 - Data Processing at LinkedIn with Apache Kafka
Kafka Summit NYC 2017 - Data Processing at LinkedIn with Apache Kafkaconfluent
 
The Apache Spark File Format Ecosystem
The Apache Spark File Format EcosystemThe Apache Spark File Format Ecosystem
The Apache Spark File Format EcosystemDatabricks
 
Introduction to Spark Internals
Introduction to Spark InternalsIntroduction to Spark Internals
Introduction to Spark InternalsPietro Michiardi
 
Redis + Kafka = Performance at Scale | Julien Ruaux, Redis Labs
Redis + Kafka = Performance at Scale | Julien Ruaux, Redis LabsRedis + Kafka = Performance at Scale | Julien Ruaux, Redis Labs
Redis + Kafka = Performance at Scale | Julien Ruaux, Redis LabsHostedbyConfluent
 

Mais procurados (20)

Introduction to Apache Camel
Introduction to Apache CamelIntroduction to Apache Camel
Introduction to Apache Camel
 
Apache Pulsar Development 101 with Python
Apache Pulsar Development 101 with PythonApache Pulsar Development 101 with Python
Apache Pulsar Development 101 with Python
 
Building Data Lakes with AWS
Building Data Lakes with AWSBuilding Data Lakes with AWS
Building Data Lakes with AWS
 
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
 
Introduction to apache kafka, confluent and why they matter
Introduction to apache kafka, confluent and why they matterIntroduction to apache kafka, confluent and why they matter
Introduction to apache kafka, confluent and why they matter
 
Introduction to Cassandra
Introduction to CassandraIntroduction to Cassandra
Introduction to Cassandra
 
Introducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data ScienceIntroducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data Science
 
From Zero to Hero with Kafka Connect
From Zero to Hero with Kafka ConnectFrom Zero to Hero with Kafka Connect
From Zero to Hero with Kafka Connect
 
Beyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFramesBeyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFrames
 
20090713 Hbase Schema Design Case Studies
20090713 Hbase Schema Design Case Studies20090713 Hbase Schema Design Case Studies
20090713 Hbase Schema Design Case Studies
 
Apache Spark Introduction
Apache Spark IntroductionApache Spark Introduction
Apache Spark Introduction
 
Serverless Kafka and Spark in a Multi-Cloud Lakehouse Architecture
Serverless Kafka and Spark in a Multi-Cloud Lakehouse ArchitectureServerless Kafka and Spark in a Multi-Cloud Lakehouse Architecture
Serverless Kafka and Spark in a Multi-Cloud Lakehouse Architecture
 
Python Data Wrangling: Preparing for the Future
Python Data Wrangling: Preparing for the FuturePython Data Wrangling: Preparing for the Future
Python Data Wrangling: Preparing for the Future
 
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
 
The Art and Science of DDS Data Modelling
The Art and Science of DDS Data ModellingThe Art and Science of DDS Data Modelling
The Art and Science of DDS Data Modelling
 
DevOps for Databricks
DevOps for DatabricksDevOps for Databricks
DevOps for Databricks
 
Kafka Summit NYC 2017 - Data Processing at LinkedIn with Apache Kafka
Kafka Summit NYC 2017 - Data Processing at LinkedIn with Apache KafkaKafka Summit NYC 2017 - Data Processing at LinkedIn with Apache Kafka
Kafka Summit NYC 2017 - Data Processing at LinkedIn with Apache Kafka
 
The Apache Spark File Format Ecosystem
The Apache Spark File Format EcosystemThe Apache Spark File Format Ecosystem
The Apache Spark File Format Ecosystem
 
Introduction to Spark Internals
Introduction to Spark InternalsIntroduction to Spark Internals
Introduction to Spark Internals
 
Redis + Kafka = Performance at Scale | Julien Ruaux, Redis Labs
Redis + Kafka = Performance at Scale | Julien Ruaux, Redis LabsRedis + Kafka = Performance at Scale | Julien Ruaux, Redis Labs
Redis + Kafka = Performance at Scale | Julien Ruaux, Redis Labs
 

Semelhante a Cobrix – A COBOL data source for Spark

Relational Database to Apache Spark (and sometimes back again)
Relational Database to Apache Spark (and sometimes back again)Relational Database to Apache Spark (and sometimes back again)
Relational Database to Apache Spark (and sometimes back again)Ed Thewlis
 
BSSML16 L6. Basic Data Transformations
BSSML16 L6. Basic Data TransformationsBSSML16 L6. Basic Data Transformations
BSSML16 L6. Basic Data TransformationsBigML, Inc
 
Working with Complex Types in DataFrames: Optics to the Rescue
Working with Complex Types in DataFrames: Optics to the RescueWorking with Complex Types in DataFrames: Optics to the Rescue
Working with Complex Types in DataFrames: Optics to the RescueDatabricks
 
Online | MongoDB Atlas on GCP Workshop
Online | MongoDB Atlas on GCP Workshop Online | MongoDB Atlas on GCP Workshop
Online | MongoDB Atlas on GCP Workshop Natasha Wilson
 
Lighting talk neo4j fosdem 2011
Lighting talk neo4j fosdem 2011Lighting talk neo4j fosdem 2011
Lighting talk neo4j fosdem 2011Jordi Valverde
 
SQL DDL: tricks and tips (JProf#27, Minsk, 24th September)
SQL DDL: tricks and tips (JProf#27, Minsk, 24th September)SQL DDL: tricks and tips (JProf#27, Minsk, 24th September)
SQL DDL: tricks and tips (JProf#27, Minsk, 24th September)Mikalai Sitsko
 
When to no sql and when to know sql javaone
When to no sql and when to know sql   javaoneWhen to no sql and when to know sql   javaone
When to no sql and when to know sql javaoneSimon Elliston Ball
 
L1 Intro to Relational DBMS LP.pdfIntro to Relational .docx
L1 Intro to Relational DBMS LP.pdfIntro to Relational .docxL1 Intro to Relational DBMS LP.pdfIntro to Relational .docx
L1 Intro to Relational DBMS LP.pdfIntro to Relational .docxDIPESH30
 
SQL Database Design & Querying
SQL Database Design & QueryingSQL Database Design & Querying
SQL Database Design & QueryingCobain Schofield
 
Advanced app building with PowerApps expressions and rules
Advanced app building with PowerApps expressions and rulesAdvanced app building with PowerApps expressions and rules
Advanced app building with PowerApps expressions and rulesMicrosoft Tech Community
 
Practical Recipes for Daily DBA Activities using DB2 9 and 10 for z/OS
Practical Recipes for Daily DBA Activities using DB2 9 and 10 for z/OSPractical Recipes for Daily DBA Activities using DB2 9 and 10 for z/OS
Practical Recipes for Daily DBA Activities using DB2 9 and 10 for z/OSCuneyt Goksu
 
ComputeFest 2012: Intro To R for Physical Sciences
ComputeFest 2012: Intro To R for Physical SciencesComputeFest 2012: Intro To R for Physical Sciences
ComputeFest 2012: Intro To R for Physical Sciencesalexstorer
 
Application Development & Database Choices: Postgres Support for non Relation...
Application Development & Database Choices: Postgres Support for non Relation...Application Development & Database Choices: Postgres Support for non Relation...
Application Development & Database Choices: Postgres Support for non Relation...EDB
 
Scylla Summit 2017: Scylla for Mass Simultaneous Sensor Data Processing of ME...
Scylla Summit 2017: Scylla for Mass Simultaneous Sensor Data Processing of ME...Scylla Summit 2017: Scylla for Mass Simultaneous Sensor Data Processing of ME...
Scylla Summit 2017: Scylla for Mass Simultaneous Sensor Data Processing of ME...ScyllaDB
 
OrientDB - cloud barcamp Libero Cloud
OrientDB - cloud barcamp Libero CloudOrientDB - cloud barcamp Libero Cloud
OrientDB - cloud barcamp Libero CloudLuigi Dell'Aquila
 
Scylla Summit 2017: How to Use Gocql to Execute Queries and What the Driver D...
Scylla Summit 2017: How to Use Gocql to Execute Queries and What the Driver D...Scylla Summit 2017: How to Use Gocql to Execute Queries and What the Driver D...
Scylla Summit 2017: How to Use Gocql to Execute Queries and What the Driver D...ScyllaDB
 

Semelhante a Cobrix – A COBOL data source for Spark (20)

Relational Database to Apache Spark (and sometimes back again)
Relational Database to Apache Spark (and sometimes back again)Relational Database to Apache Spark (and sometimes back again)
Relational Database to Apache Spark (and sometimes back again)
 
PDBC
PDBCPDBC
PDBC
 
BSSML16 L6. Basic Data Transformations
BSSML16 L6. Basic Data TransformationsBSSML16 L6. Basic Data Transformations
BSSML16 L6. Basic Data Transformations
 
Sql server building a database ppt 12
Sql server building a database ppt 12Sql server building a database ppt 12
Sql server building a database ppt 12
 
Working with Complex Types in DataFrames: Optics to the Rescue
Working with Complex Types in DataFrames: Optics to the RescueWorking with Complex Types in DataFrames: Optics to the Rescue
Working with Complex Types in DataFrames: Optics to the Rescue
 
Online | MongoDB Atlas on GCP Workshop
Online | MongoDB Atlas on GCP Workshop Online | MongoDB Atlas on GCP Workshop
Online | MongoDB Atlas on GCP Workshop
 
Lighting talk neo4j fosdem 2011
Lighting talk neo4j fosdem 2011Lighting talk neo4j fosdem 2011
Lighting talk neo4j fosdem 2011
 
SQL DDL: tricks and tips (JProf#27, Minsk, 24th September)
SQL DDL: tricks and tips (JProf#27, Minsk, 24th September)SQL DDL: tricks and tips (JProf#27, Minsk, 24th September)
SQL DDL: tricks and tips (JProf#27, Minsk, 24th September)
 
DDL,DML,1stNF
DDL,DML,1stNFDDL,DML,1stNF
DDL,DML,1stNF
 
sfdfds
sfdfdssfdfds
sfdfds
 
When to no sql and when to know sql javaone
When to no sql and when to know sql   javaoneWhen to no sql and when to know sql   javaone
When to no sql and when to know sql javaone
 
L1 Intro to Relational DBMS LP.pdfIntro to Relational .docx
L1 Intro to Relational DBMS LP.pdfIntro to Relational .docxL1 Intro to Relational DBMS LP.pdfIntro to Relational .docx
L1 Intro to Relational DBMS LP.pdfIntro to Relational .docx
 
SQL Database Design & Querying
SQL Database Design & QueryingSQL Database Design & Querying
SQL Database Design & Querying
 
Advanced app building with PowerApps expressions and rules
Advanced app building with PowerApps expressions and rulesAdvanced app building with PowerApps expressions and rules
Advanced app building with PowerApps expressions and rules
 
Practical Recipes for Daily DBA Activities using DB2 9 and 10 for z/OS
Practical Recipes for Daily DBA Activities using DB2 9 and 10 for z/OSPractical Recipes for Daily DBA Activities using DB2 9 and 10 for z/OS
Practical Recipes for Daily DBA Activities using DB2 9 and 10 for z/OS
 
ComputeFest 2012: Intro To R for Physical Sciences
ComputeFest 2012: Intro To R for Physical SciencesComputeFest 2012: Intro To R for Physical Sciences
ComputeFest 2012: Intro To R for Physical Sciences
 
Application Development & Database Choices: Postgres Support for non Relation...
Application Development & Database Choices: Postgres Support for non Relation...Application Development & Database Choices: Postgres Support for non Relation...
Application Development & Database Choices: Postgres Support for non Relation...
 
Scylla Summit 2017: Scylla for Mass Simultaneous Sensor Data Processing of ME...
Scylla Summit 2017: Scylla for Mass Simultaneous Sensor Data Processing of ME...Scylla Summit 2017: Scylla for Mass Simultaneous Sensor Data Processing of ME...
Scylla Summit 2017: Scylla for Mass Simultaneous Sensor Data Processing of ME...
 
OrientDB - cloud barcamp Libero Cloud
OrientDB - cloud barcamp Libero CloudOrientDB - cloud barcamp Libero Cloud
OrientDB - cloud barcamp Libero Cloud
 
Scylla Summit 2017: How to Use Gocql to Execute Queries and What the Driver D...
Scylla Summit 2017: How to Use Gocql to Execute Queries and What the Driver D...Scylla Summit 2017: How to Use Gocql to Execute Queries and What the Driver D...
Scylla Summit 2017: How to Use Gocql to Execute Queries and What the Driver D...
 

Mais de DataWorks Summit

Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal SystemDataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExampleDataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureDataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudDataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouDataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
 

Mais de DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Último

SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 

Último (20)

SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 

Cobrix – A COBOL data source for Spark

  • 1. Cobrix – A COBOL data source for Spark Ruslan Iushchenko (ABSA), Felipe Melo (ABSA)
  • 2. • Who we are? • Motivation • Mainframe files and copybooks • Loading simple files • Loading hierarchical databases • Performance and results • Cobrix in ABSA Big Data space Outline 2/40
  • 3. About us . ABSA is a Pan-African financial services provider - With Apache Spark at the core of its data engineering . We fill gaps in the Hadoop ecosystem, when we find them . Contributions to Apache Spark . Spark-related open-source projects (https://github.com/AbsaOSS) - Spline - a data lineage tracking and visualization tool - ABRiS - Avro SerDe for structured APIs - Atum - Data quality library for Spark - Enceladus - A dynamic data conformance engine - Cobrix - A Cobol library for Spark (focus of this presentation) 3/40
  • 4. The market for Mainframes is strong, with no signs of cooling down. Mainframes Are used by 71% of Fortune 500 companies Are responsible for 87% of all credit card transactions in the world Are part of the IT infrastructure of 92 out of the 100 biggest banks in the world Handle 68% of the world’s production IT workloads, while accounting for only 6% of IT costs. For companies relying on Mainframes, becoming data-centric can be prohibitively expensive High cost of hardware Expensive business model for data science related activities Source: http://blog.syncsort.com/2018/06/mainframe/9-mainframe-statistics/ Business Motivation 4/40
  • 5. Technical Motivation • The process above takes 11 days for a 600GB file • Legacy data models (hierarchical) • Need for performance, scalability, flexibility, etc • SPOILER alert: we brought it to 1.1 hours 5/40 Mainframes PC Fixed-length Text Files CSVHDFS 1. Extract 2. Transform 4. Load 3. Transform Proprietary Tools
  • 6. • Run analytics / Spark on mainframes • Message Brokers (e.g. MQ) • Sqoop • Proprietary solutions • But ... • Pricey • Slow • Complex (specially for legacy systems) • Require human involvement What can you do? 6/40
  • 7. How Cobrix can help • Decreasing human involvement • Simplifying the manipulation of hierarchical structures • Providing scalability • Open-source 7/40
  • 8. Mainframe file (EBCDIC) Schema (copybook) Apache Spark Application Cobrix Output (Parquet, JSON, CSV…)df df df transformations Writer ... Cobrix – a custom Spark data source 8/40
  • 9. A copybook is a schema definition A data file is a collection of binary records Name: █ █ █ █ Age: █ █ Company: █ █ █ █ █ █ █ █ Phone #: █ █ █ █ █ █ █ █ Zip: █ █ █ █ █ Name: J O H N Age: 3 2 Company: F O O . C O M Phone #: + 2 3 1 1 - 3 2 7 Zip: 1 2 0 0 0 A * N J O H N G A 3 2 S H K D K S I A S S A S K A S A L , S D F O O . C O M X L Q O K ( G A } S N B W E S < N J X I C W L D H J P A S B C + 2 3 1 1 - 3 2 7 C = D 1 2 0 0 0 F H 0 D . 9/40
  • 10. Similar to IDLs of Avro, Thrift, Protocol Buffers, etc. struct Company { 1: required i64 id, 2: required string name, 3: optional list<string> contactPeople } Thrift message Company { required int64 id = 1; required string name = 2; repeated string contact_people = 3; } 10 COMPANY. 15 ID PIC 9(12) COMP. 15 NAME PIC X(40). 15 CONTACT-PEOPLE PIC X(20) OCCURS 10. COBOL record Company { int64 id; string name; array<string> contactPeople; } 10/40
  • 11. val df = spark .read .format("cobol") .option("copybook", "data/example.cob") .load("data/example") 01 RECORD. 05 COMPANY-ID PIC 9(10). 05 COMPANY-NAME PIC X(40). 05 ADDRESS PIC X(60). 05 REG-NUM PIC X(8). 05 ZIP PIC X(6). A * N J O H N G A 3 2 S H K D K S I A S S A S K A S A L , S D F O O . C O M X L Q O K ( G A } S N B W E S < N J X I C W L D H J P A S B C + 2 3 COMPANY_ID COMPANY_NAME ADDRESS REG_NUM ZIP 100 ABCD Ltd. 10 Garden st. 8791237 03120 101 ZjkLPj 11 Park ave. 1233971 23111 102 Robotrd Inc. 12 Forest st. 0382979 12000 103 Xingzhoug 8 Mountst. 2389012 31222 104 Example.co 123 Tech str. 3129001 19000 Loading Mainframe Data 11/40
  • 12. val spark = SparkSession.builder() .appName("Example").getOrCreate() val df = spark .read .format("cobol") .option("copybook", "data/example.cob") .load("data/example") // ...Business logic goes here... df.write.parquet("data/output") This App is ● Distributed ● Scalable ● Resilient EBCDIC to Parquet examples 12/40
  • 13. A * N J O H N G A 3 2 S H K D K S I A S S A S K A S A L , S D F O O . C O M X L Q O K ( G A } S N B W E S < N J X I C W L D H J P A S B C + 2 3 1 1 - 3 2 7 C = D 1 2 0 0 0 F H 0 D . A * N J O H N G A 3 2 S H K D K S I A S S A S K A S A L , S D F O O . C O M X L Q O K ( G A } S N B W E S < N J X I C W L D H J P A S B C + 2 B W E S < N J X P FIRST-NAME: █ █ █ █ █ █ LAST-NAME: █ █ █ █ █COMPANY-NAME: █ █ █ █ █ █ █ █ █ █ █ █ █ █ • Redefined fields AKA • Unchecked unions • Untagged unions • Variant type fields • Several fields occupy the same space 01 RECORD. 05 IS-COMPANY PIC 9(1). 05 COMPANY. 10 COMPANY-NAME PIC X(40). 05 PERSON REDEFINES COMPANY. 10 FIRST-NAME PIC X(20). 10 LAST-NAME PIC X(20). 05 ADDRESS PIC X(50). 05 ZIP PIC X(6). Redefined Fields 13/40
  • 14. 01 RECORD. 05 IS-COMPANY PIC 9(1). 05 COMPANY. 10 COMPANY-NAME PIC X(40). 05 PERSON REDEFINES COMPANY. 10 FIRST-NAME PIC X(20). 10 LAST-NAME PIC X(20). 05 ADDRESS PIC X(50). 05 ZIP PIC X(6). • Cobrix applies all redefines for each record • Some fields can clash • It’s up to the user to apply business logic to separate correct and wrong data IS_COMPANY COMPANY PERSON ADDRESS ZIP 1 {“COMPANY_NAME”: “September Ltd.”} {“FIRST_NAME”: “Septem”, “LAST_NAME”: “ber Ltd.”} 74 Lawn ave., Denver 39023 0 {“COMPANY_NAME”: “Beatrice Gagliano”} {“FIRST_NAME”: “Beatrice”, “LAST_NAME”: “Gagliano”} 10 Garden str. 33113 1 {“COMPANY_NAME”: “Beatrice Gagliano”} {“FIRST_NAME”: “Januar”, “LAST_NAME”: “y Inc.”} 122/1 Park ave. 31234 Redefined Fields 14/40
  • 15. df.select($"IS_COMPANY", when($"IS_COMPANY" === true, "COMPANY_NAME") .otherwise(null).as("COMPANY_NAME"), when($"IS_COMPANY" === false, "CONTACTS") .otherwise(null).as("FIRST_NAME")), ... IS_COMPANY COMPANY_NAME FIRST_NAME LAST_NAME ADDRESS ZIP 1 September Ltd. 74 Lawn ave., Denver 39023 0 Beatrice Gagliano 10 Garden str. 33113 1 January Inc. 122/1 Park ave. 31234 Clean Up Redefined Fields + flatten structs 15/40
  • 16. Hierarchical DBs • Several record types • AKA segments • Each segment type has its own schema • Parent-child relationships between segments COMPANY ID: █ █ █ █ █ █ █ █ █ Name: █ █ █ █ █ █ █ █ █ █ █ █ Address: █ █ █ █ █ █ █ █ █ CONTACT-PERSON Name: █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ Phone #: █ █ █ █ █ █ █ Root segment Child segment CONTACT-PERSON Name: █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ Phone #: █ █ █ █ █ █ █ … Child segment 16/40
  • 17. • The combined copybook has to contain all the segments as redefined fields: 01 COMPANY-DETAILS. 05 SEGMENT-ID PIC X(5). 05 COMPANY-ID PIC X(10). 05 COMPANY. 10 NAME PIC X(15). 10 ADDRESS PIC X(25). 10 REG-NUM PIC 9(8) COMP. 05 CONTACT REDEFINES COMPANY. 10 PHONE-NUMBER PIC X(17). 10 CONTACT-PERSON PIC X(28). common data segment 1 segment 2 COMPANY Name: █ █ █ █ █ █ █ █ █ █ █ Address: █ █ █ █ █ █ █ █ █ Reg-Num: █ █ █ █ █ █ CONTACT Phone #: █ █ █ █ █ █ █ Contact Person: █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ Defining a Copybook 17/40
  • 18. • The code snippet for reading the data: val df = spark .read .format("cobol") .option("copybook", "/path/to/copybook.cpy") .option("is_record_sequence", "true") .load("examples/multisegment_data") Reading all the segments 18/40
  • 19. • The dataset for the whole copybook: • Invalid redefines are highlighted SEGMENT_ID COMPANY_ID COMPANY CONTACT C 1005918818 [ ABCD Ltd. ] [ invalid ] P 1005918818 [ invalid ] [ Cliff Wallingford ] C 1036146222 [ DEFG Ltd. ] [ invalid ] P 1036146222 [ invalid ] [ Beatrice Gagliano ] C 1045855294 [ Robotrd Inc. ] [ invalid ] P 1045855294 [ invalid ] [ Doretha Wallingford ] P 1045855294 [ invalid ] [ Deshawn Benally ] P 1045855294 [ invalid ] [ Willis Tumlin ] C 1057751949 [ Xingzhoug ] [ invalid ] P 1057751949 [ invalid ] [ Mindy Boettcher ] Reading all the segments 19/40
  • 20. A * N J O H N G A 3 2 S H K D K S I A S S A S K A S A L , S D F O O . C O M X L Q O K ( G A } S N B W E S < N J X I C W L D H J P A S B C + 2 3 1 1 - 3 2 7 C = D 1 2 0 0 0 F H 0 D . K A I O D A P D F C J S C D A D F R J F D F C L COMPANY ID: 2 8 0 0 0 3 9 4 1 Name: E x a m p l e . c o m █ Address: 1 0 █ G a r d e n PERSON Name: J A N E █ █ █ █ R O B E R T S █ █ Phone #: + 9 3 5 2 8 0 PERSON Name: J A N E █ █ █ █ R O B E R T S █ █ Phone #: + 9 3 5 2 8 0 COMPANY ID: 2 8 0 0 0 3 9 4 1 Name: E x a m p l e . c o m █ Address: 1 0 █ G a r d e n COMPANY ID: 2 8 0 0 0 3 9 4 1 Name: E x a m p l e . c o m █ Address: 1 0 █ G a r d e n Id Name Address Reg_Num 100 Example.com 10 Garden st. 8791237 101 ZjkLPj 11 Park ave. 1233971 102 Robotrd Inc. 12 Forest st. 0382979 103 Xingzhoug 8 Mountst. 2389012 104 ABCD Ltd. 123 Tech str. 3129001 Company_Id Contact_Person Phone_Number 100 Jane +32186331 100 Colyn +23769123 102 Robert +12389679 102 Teresa +32187912 102 Laura +42198723 Separate segments by dataframes 20/40
  • 21. • Filter segment #1 (companies) val dfCompanies = df.filter($"SEGMENT_ID"==="C") .select($"COMPANY_ID", $"COMPANY.NAME".as($"COMPANY_NAME"), $"COMPANY.ADDRESS", $"COMPANY.REG_NUM") Company_Id Company_Name Address Reg_Num 100 ABCD Ltd. 10 Garden st. 8791237 101 ZjkLPj 11 Park ave. 1233971 102 Robotrd Inc. 12 Forest st. 0382979 103 Xingzhoug 8 Mountst. 2389012 104 Example.co 123 Tech str. 3129001 Reading root segments 21/40
  • 22. • Filter segment #2 (people) val dfContacts = df .filter($"SEGMENT_ID"==="P") .select($"COMPANY_ID", $"CONTACT.CONTACT_PERSON", $"CONTACT.PHONE_NUMBER") Company_Id Contact_Person Phone_Number 100 Marry +32186331 100 Colyn +23769123 102 Robert +12389679 102 Teresa +32187912 102 Laura +42198723 Reading child segments 22/40
  • 23. Company_Id Company_Name Address Reg_Num 100 ABCD Ltd. 10 Garden st. 8791237 101 ZjkLPj 11 Park ave. 1233971 102 Robotrd Inc. 12 Forest st. 0382979 103 Xingzhoug 8 Mountst. 2389012 104 Example.co 123 Tech str. 3129001 Company_Id Contact_Person Phone_Number 100 Marry +32186331 100 Colyn +23769123 102 Robert +12389679 102 Teresa +32187912 102 Laura +42198723 Company_Id Company_Name Address Reg_Num Contact_Person Phone_Number 100 ABCD Ltd. 10 Garden st. 8791237 Marry +32186331 100 ABCD Ltd. 10 Garden st. 8791237 Colyn +23769123 102 Robotrd Inc. 12 Forest st. 0382979 Robert +12389679 102 Robotrd Inc. 12 Forest st. 0382979 Teresa +32187912 102 Robotrd Inc. 12 Forest st. 0382979 Laura +42198723 The two segments can now be joined by Company_Id 23/40
  • 24. • Joining segments 1 and 2 val dfJoined = dfCompanies.join_outer(dfContacts, "COMPANY_ID") Results: Company_Id Company_Name Address Reg_Num Contact_Person Phone_Number 100 ABCD Ltd. 10 Garden st. 8791237 Marry +32186331 100 ABCD Ltd. 10 Garden st. 8791237 Colyn +23769123 102 Robotrd Inc. 12 Forest st. 0382979 Robert +12389679 102 Robotrd Inc. 12 Forest st. 0382979 Teresa +32187912 102 Robotrd Inc. 12 Forest st. 0382979 Laura +42198723 Joining in Spark is easy 24/40
  • 25. • The joined table can also be denormalized for document storage val dfCombined = dfJoined .groupBy($"COMPANY_ID", $"COMPANY_NAME", $"ADDRESS", $"REG_NUM") .agg( collect_list( struct($"CONTACT_PERSON", $"PHONE_NUMBER")) .as("CONTACTS")) { "COMPANY_ID": "8216281722", "COMPANY_NAME": "ABCD Ltd.", "ADDRESS": "74 Lawn ave., New York", ”REG_NUM": "33718594", "CONTACTS": [ { "CONTACT_PERSON": "Cassey Norgard", "PHONE_NUMBER": "+(595) 641 62 32" }, { "CONTACT_PERSON": "Verdie Deveau", "PHONE_NUMBER": "+(721) 636 72 35" }, { "CONTACT_PERSON": "Otelia Batman", "PHONE_NUMBER": "+(813) 342 66 28" } ] } Denormalize data 25/40
  • 26. Restore parent-child relationships • In our example we had COMPANY_ID field that is present in all segments • In real copybooks this is not the case • What can we do? COMPANY ID: █ █ █ █ █ █ █ █ █ Name: █ █ █ █ █ █ █ █ █ █ █ █ Address: █ █ █ █ █ █ █ █ █ CONTACT-PERSON Name: █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ Phone #: █ █ █ █ █ █ █ Root segment Child segment CONTACT-PERSON Name: █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ Phone #: █ █ █ █ █ █ █ … Child segment 26/40
  • 27. 01 COMPANY-DETAILS. 05 SEGMENT-ID PIC X(5). 05 COMPANY. 10 NAME PIC X(15). 10 ADDRESS PIC X(25). 10 REG-NUM PIC 9(8) COMP. 05 CONTACT REDEFINES COMPANY. 10 PHONE-NUMBER PIC X(17). 10 CONTACT-PERSON PIC X(28). • If COMPANY_ID is not part of all segments Cobrix can generate it for you val df = spark .read .format("cobol") .option("copybook", "/path/to/copybook.cpy") .option("is_record_sequence", "true") .option("segment_field", "SEGMENT-ID") .option("segment_id_level0", "C") .option("segment_id_prefix", "ID") .load("examples/multisegment_data") No COMPANY-ID Id Generation 27/40
  • 28. 01 COMPANY-DETAILS. 05 SEGMENT-ID PIC X(5). 05 COMPANY. 10 NAME PIC X(15). 10 ADDRESS PIC X(25). 10 REG-NUM PIC 9(8) COMP. 05 CONTACT REDEFINES COMPANY. 10 PHONE-NUMBER PIC X(17). 10 CONTACT-PERSON PIC X(28). • Seg0_Id can be used to restore parent-child relationship between segments No COMPANY-ID SEGMENT_ID Seg0_Id COMPANY CONTACT C ID_0_0 [ ABCD Ltd. ] [ invalid ] P ID_0_0 [ invalid ] [ Cliff Wallingford ] C ID_0_2 [ DEFG Ltd. ] [ invalid ] P ID_0_2 [ invalid ] [ Beatrice Gagliano ] C ID_0_4 [ Robotrd Inc. ] [ invalid ] P ID_0_4 [ invalid ] [ Doretha Wallingford ] P ID_0_4 [ invalid ] [ Deshawn Benally ] Id Generation 28/40
  • 29. • When transferred from a mainframe a hierarchical database becomes • A sequence of records • To read next record a previous record should be read first • A sequential format by it's nature • How to make it scalable? A * N J O H N G A 3 2 S H K D K S I A S S A S K A S A L , S D F O O . C O M X L Q O K ( G A } S N B W E S < N J X I C W L D H J P A S B C + 2 3 1 1 - 3 2 7 C = D 1 2 0 0 0 F H 0 D . K A I O D A P D F C J S C D C D C W E P 1 9 1 2 3 – 3 1 2 2 1 . 3 1 F A D F L 1 7 COMPANY ID: █ █ █ █ █ █ █ █ █ Name: █ █ █ █ █ █ █ █ █ █ █ █ Address: █ █ █ █ █ █ █ █ █ COMPANY ID: █ █ █ █ █ █ █ █ █ Name: █ █ █ █ █ █ █ █ █ █ █ █ Address: █ █ █ █ █ █ █ █ █ PERSON Name: █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ Phone #: █ █ █ █ █ █ █ A data file PERSON Name: J A N E █ █ █ █ R O B E R T S █ █ Phone #: + 9 3 5 2 8 0 PERSON Name: █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ Phone #: █ █ █ █ █ █ █ PERSON Name: █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ Phone #: █ █ █ █ █ █ █ COMPANY ID: 2 8 0 0 0 3 9 4 1 Name: E x a m p l e . c o m █ Address: 1 0 █ G a r d e n Variable Length Records (VLR) 29/40
  • 30. Performance challenge of VLRs • Naturally sequential files • To read next record the prior record need to be read first • Each record had a length field • Acts as a pointer to the next record • No record delimiter when reading a file from the middle VLR structure 30/40 0 20 40 60 80 100 120 140 160 180 0 10 20 30 40 50 60 70 MB/s Number of Spark cores Throughput, variable record length Sequential processing 10 MB/s
  • 31. Spark on HDFS Blocks Partitions Data Node Spark Executor Blocks Partitions Data Node Spark Executor Blocks Partitions Data Node Spark Executor . . . HDFS Namenode Spark Driver 31/40
  • 32. Data node 1 Data node 2 Data node 3 HDFS Cobrix 1- Read headers List(offsets,lengths) Spark 3 - Parse records In parallel from parallelized offsets and lengths Spark cluster 2 - Parallelize Offsets and lengths Parallelizing Sequential Reads 32/40
  • 33. Data node 1 Data node 2 Data node 3 HDFS Namenode Cobrix 1- Read VLR 1 header offset = 3000 length = 1002 - where is offset 3000 until 3100? 3 - Check nodes 3, 18 and 41. Spark 4 - Load VLR 1 Preferred location is node 3 5 - Launch task On executor hosted on node 3 Spark cluster Enabling Data Locality for VLRs 33/40
  • 34. Throughput when sparse indexes are used • Experiments were ran on our lab cluster • 4 nodes • 380 cores • 10 Gbit network • Scalable – for bigger files when using more executors the throughput is bigger 34/40 0 20 40 60 80 100 120 140 160 180 0 10 20 30 40 50 60 70 MB/s Number of Spark cores Throughput, variable record length 10 GB file 20 GB file 40 GB file Sequential
  • 35. Comparison versus fixed length record performance ● Distribution and locality is handled completely by Spark ● Parallelism is achieved using sparse indexes 35/40 0 20 40 60 80 100 120 140 160 180 0 10 20 30 40 50 60 70 MB/s Number of Spark cores Throughput, fixed length records 40 GB file 150 MB/s 0 20 40 60 80 100 120 140 160 180 0 10 20 30 40 50 60 70 MB/s Number of Spark cores Throughput, variable record length 40 GB file Sequential 145 MB/s 10 MB/s
  • 36. Cobrix in ABSA Data Infrastructure - Batch Mainframes 180+ sources HDFS Enceladus Cobrix Spark Spline 2. Parse 3. Conform 4. Track Lineage5. Re-ingest 6. Consume 36/40
  • 37. Cobrix in ABSA Data Infrastructure - Stream Mainframes 180+ sources Cobrix Parser ABRiS Enceladus Kafka Spline 1.Ingest 2.Parse 3.Avro 4.Conform 5. Track Lineage 6. Consume 37/40
  • 38. Cobrix in ABSA Data Infrastructure Mainframes 180+ sources HDFS Cobrix Parser ABRiS Enceladus Kafka Cobrix Spark Spline 1.Ingest 2. Parse 3. Conform 4. Track Lineage5. Re-ingest 6. Consume 2.Parse 3.Avro 4.Conform 5. Track Lineage 6. Consume Batch Stream 38/40
  • 39. ● Thanks to the following people the project was made possible and for all the help along the way: ○ Andrew Baker, Francois Cillers, Adam Smyczek, Jan Scherbaum, Peter Moon, Clifford Lategan, Rekha Gorantla, Mohit Suryavanshi, Niel Steyn • Thanks to the authors of the original COBOL parser: ○ Ian De Beer, Rikus de Milander (https://github.com/zenaptix-lab/copybookStreams) Acknowledgment 39/40
  • 40. • Combine expertise to make access mainframe data in Hadoop seamless • Our goal is to support the widest range of use cases possible • Report a bug ! • Request new feature ! • Create a pull request ! " # $ Our home: https://github.com/AbsaOSS/cobrix Your Contribution is Welcome 40/40