The financial industry operates on a variety of different data and computing platforms. Integrating these different sources into a centralized data lake is crucial to support reporting and analytics tools.
Apache Spark is becoming the tool of choice for big data integration analytics due it’s scalable nature and because it supports processing data from a variety of data sources and formats such as JSON, Parquet, Kafka, etc. However, one of the most common platforms in the financial industry is the mainframe, which does not provide easy interoperability with other platforms.
COBOL is the most used language in the mainframe environment. It was designed in 1959 and evolved in parallel to other programming languages, thus, having its own constructs and primitives. Furthermore, data produced by COBOL has EBCDIC encoding and has a different binary representation of numeric data types.
We have developed Cobrix, a library that extends Spark SQL API to allow direct reading from binary files generated by mainframes.
While projects like Sqoop focus on transferring relational data by providing direct connectors to a mainframe, Cobrix can be used to parse and load hierarchical data (from IMS for instance) after it is transferred from a mainframe by dumping records to a binary file. Schema should be provided as a COBOL copybook. It can contain nested structures and arrays. We present how the schema mapping between COBOL and Spark was done, and how it was used in the implementation of Spark COBOL data source. We also present use cases of simple and multi-segment files to illustrate how we use the library to load data from mainframes into our Hadoop data lake.
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Cobrix – A COBOL data source for Spark
1. Cobrix – A COBOL data
source for Spark
Ruslan Iushchenko (ABSA), Felipe Melo (ABSA)
2. • Who we are?
• Motivation
• Mainframe files and copybooks
• Loading simple files
• Loading hierarchical databases
• Performance and results
• Cobrix in ABSA Big Data space
Outline
2/40
3. About us
. ABSA is a Pan-African financial services provider
- With Apache Spark at the core of its data engineering
. We fill gaps in the Hadoop ecosystem, when we find them
. Contributions to Apache Spark
. Spark-related open-source projects (https://github.com/AbsaOSS)
- Spline - a data lineage tracking and visualization tool
- ABRiS - Avro SerDe for structured APIs
- Atum - Data quality library for Spark
- Enceladus - A dynamic data conformance engine
- Cobrix - A Cobol library for Spark (focus of this presentation)
3/40
4. The market for Mainframes is strong, with no signs of cooling down.
Mainframes
Are used by 71% of Fortune 500 companies
Are responsible for 87% of all credit card transactions in the world
Are part of the IT infrastructure of 92 out of the 100 biggest banks in the world
Handle 68% of the world’s production IT workloads, while accounting for only 6%
of IT costs.
For companies relying on Mainframes, becoming data-centric can be
prohibitively expensive
High cost of hardware
Expensive business model for data science related activities
Source: http://blog.syncsort.com/2018/06/mainframe/9-mainframe-statistics/
Business Motivation
4/40
5. Technical Motivation
• The process above takes 11 days for a 600GB file
• Legacy data models (hierarchical)
• Need for performance, scalability, flexibility, etc
• SPOILER alert: we brought it to 1.1 hours 5/40
Mainframes PC
Fixed-length
Text Files
CSVHDFS
1. Extract 2. Transform
4. Load 3. Transform
Proprietary
Tools
6. • Run analytics / Spark on mainframes
• Message Brokers (e.g. MQ)
• Sqoop
• Proprietary solutions
• But ...
• Pricey
• Slow
• Complex (specially for legacy systems)
• Require human involvement
What can you do?
6/40
7. How Cobrix can help
• Decreasing human involvement
• Simplifying the manipulation of hierarchical structures
• Providing scalability
• Open-source
7/40
9. A copybook is a schema definition
A data file is a collection of binary records
Name: █ █ █ █ Age: █ █
Company: █ █ █ █ █ █ █ █
Phone #: █ █ █ █ █ █ █ █
Zip: █ █ █ █ █
Name: J O H N Age: 3 2
Company: F O O . C O M
Phone #: + 2 3 1 1 - 3 2 7
Zip: 1 2 0 0 0
A * N J O H N G A 3 2 S H
K D K S I A S S A S K A S
A L , S D F O O . C O M X
L Q O K ( G A } S N B W E
S < N J X I C W L D H J P
A S B C + 2 3 1 1 - 3 2 7
C = D 1 2 0 0 0 F H 0 D .
9/40
10. Similar to IDLs of Avro, Thrift, Protocol Buffers, etc.
struct Company {
1: required i64 id,
2: required string name,
3: optional list<string> contactPeople
}
Thrift
message Company {
required int64 id = 1;
required string name = 2;
repeated string contact_people = 3;
}
10 COMPANY.
15 ID PIC 9(12) COMP.
15 NAME PIC X(40).
15 CONTACT-PEOPLE PIC X(20)
OCCURS 10.
COBOL
record Company {
int64 id;
string name;
array<string> contactPeople;
}
10/40
11. val df = spark
.read
.format("cobol")
.option("copybook", "data/example.cob")
.load("data/example")
01 RECORD.
05 COMPANY-ID PIC 9(10).
05 COMPANY-NAME PIC X(40).
05 ADDRESS PIC X(60).
05 REG-NUM PIC X(8).
05 ZIP PIC X(6).
A * N J O H N G A 3 2 S H K D K S I
A S S A S K A S A L , S D F O O . C
O M X L Q O K ( G A } S N B W E S <
N J X I C W L D H J P A S B C + 2 3
COMPANY_ID COMPANY_NAME ADDRESS REG_NUM ZIP
100 ABCD Ltd. 10 Garden st. 8791237 03120
101 ZjkLPj 11 Park ave. 1233971 23111
102 Robotrd Inc. 12 Forest st. 0382979 12000
103 Xingzhoug 8 Mountst. 2389012 31222
104 Example.co 123 Tech str. 3129001 19000
Loading Mainframe Data
11/40
12. val spark = SparkSession.builder() .appName("Example").getOrCreate()
val df = spark
.read
.format("cobol")
.option("copybook", "data/example.cob")
.load("data/example")
// ...Business logic goes here...
df.write.parquet("data/output")
This App is
● Distributed
● Scalable
● Resilient
EBCDIC to Parquet examples
12/40
13. A * N J O H N G A 3 2 S H K D K S I A
S S A S K A S A L , S D F O O . C O M
X L Q O K ( G A } S N B W E S < N J X
I C W L D H J P A S B C + 2 3 1 1 - 3
2 7 C = D 1 2 0 0 0 F H 0 D . A * N J
O H N G A 3 2 S H K D K S I A S S A S
K A S A L , S D F O O . C O M X L Q O
K ( G A } S N B W E S < N J X I C W L
D H J P A S B C + 2 B W E S < N J X P
FIRST-NAME: █ █ █ █ █ █ LAST-NAME: █ █ █ █ █COMPANY-NAME: █ █ █ █ █ █ █ █ █ █ █ █ █ █
• Redefined fields AKA
• Unchecked unions
• Untagged unions
• Variant type fields
• Several fields occupy the same space
01 RECORD.
05 IS-COMPANY PIC 9(1).
05 COMPANY.
10 COMPANY-NAME PIC X(40).
05 PERSON REDEFINES COMPANY.
10 FIRST-NAME PIC X(20).
10 LAST-NAME PIC X(20).
05 ADDRESS PIC X(50).
05 ZIP PIC X(6).
Redefined Fields
13/40
14. 01 RECORD.
05 IS-COMPANY PIC 9(1).
05 COMPANY.
10 COMPANY-NAME PIC X(40).
05 PERSON REDEFINES COMPANY.
10 FIRST-NAME PIC X(20).
10 LAST-NAME PIC X(20).
05 ADDRESS PIC X(50).
05 ZIP PIC X(6).
• Cobrix applies all redefines for each
record
• Some fields can clash
• It’s up to the user to apply business logic
to separate correct and wrong data
IS_COMPANY COMPANY PERSON ADDRESS ZIP
1 {“COMPANY_NAME”: “September Ltd.”} {“FIRST_NAME”: “Septem”,
“LAST_NAME”: “ber Ltd.”}
74 Lawn ave., Denver 39023
0 {“COMPANY_NAME”: “Beatrice
Gagliano”}
{“FIRST_NAME”: “Beatrice”,
“LAST_NAME”: “Gagliano”}
10 Garden str. 33113
1 {“COMPANY_NAME”: “Beatrice
Gagliano”}
{“FIRST_NAME”: “Januar”,
“LAST_NAME”: “y Inc.”}
122/1 Park ave. 31234
Redefined Fields
14/40
15. df.select($"IS_COMPANY",
when($"IS_COMPANY" === true, "COMPANY_NAME")
.otherwise(null).as("COMPANY_NAME"),
when($"IS_COMPANY" === false, "CONTACTS")
.otherwise(null).as("FIRST_NAME")),
...
IS_COMPANY COMPANY_NAME FIRST_NAME LAST_NAME ADDRESS ZIP
1 September Ltd. 74 Lawn ave., Denver 39023
0 Beatrice Gagliano 10 Garden str. 33113
1 January Inc. 122/1 Park ave. 31234
Clean Up Redefined Fields + flatten structs
15/40
18. • The code snippet for reading the data:
val df = spark
.read
.format("cobol")
.option("copybook", "/path/to/copybook.cpy")
.option("is_record_sequence", "true")
.load("examples/multisegment_data")
Reading all the segments
18/40
19. • The dataset for the whole copybook:
• Invalid redefines are highlighted
SEGMENT_ID COMPANY_ID COMPANY CONTACT
C 1005918818 [ ABCD Ltd. ] [ invalid ]
P 1005918818 [ invalid ] [ Cliff Wallingford ]
C 1036146222 [ DEFG Ltd. ] [ invalid ]
P 1036146222 [ invalid ] [ Beatrice Gagliano ]
C 1045855294 [ Robotrd Inc. ] [ invalid ]
P 1045855294 [ invalid ] [ Doretha Wallingford ]
P 1045855294 [ invalid ] [ Deshawn Benally ]
P 1045855294 [ invalid ] [ Willis Tumlin ]
C 1057751949 [ Xingzhoug ] [ invalid ]
P 1057751949 [ invalid ] [ Mindy Boettcher ]
Reading all the segments
19/40
20. A * N J O
H N G A 3
2 S H K D
K S I A S
S A S K A
S A L , S
D F O O .
C O M X L
Q O K ( G
A } S N B
W E S < N
J X I C W
L D H J P
A S B C +
2 3 1 1 -
3 2 7 C =
D 1 2 0 0
0 F H 0 D
. K A I O
D A P D F
C J S C D
A D F R J
F D F C L
COMPANY
ID: 2 8 0 0 0 3 9 4 1
Name: E x a m p l e . c o m █
Address: 1 0 █ G a r d e n
PERSON
Name: J A N E █ █ █ █
R O B E R T S █ █
Phone #: + 9 3 5 2 8 0
PERSON
Name: J A N E █ █ █ █
R O B E R T S █ █
Phone #: + 9 3 5 2 8 0
COMPANY
ID: 2 8 0 0 0 3 9 4 1
Name: E x a m p l e . c o m █
Address: 1 0 █ G a r d e n
COMPANY
ID: 2 8 0 0 0 3 9 4 1
Name: E x a m p l e . c o m █
Address: 1 0 █ G a r d e n
Id Name Address Reg_Num
100 Example.com 10 Garden st. 8791237
101 ZjkLPj 11 Park ave. 1233971
102 Robotrd Inc. 12 Forest st. 0382979
103 Xingzhoug 8 Mountst. 2389012
104 ABCD Ltd. 123 Tech str. 3129001
Company_Id Contact_Person Phone_Number
100 Jane +32186331
100 Colyn +23769123
102 Robert +12389679
102 Teresa +32187912
102 Laura +42198723
Separate segments by dataframes
20/40
21. • Filter segment #1 (companies)
val dfCompanies =
df.filter($"SEGMENT_ID"==="C")
.select($"COMPANY_ID",
$"COMPANY.NAME".as($"COMPANY_NAME"),
$"COMPANY.ADDRESS",
$"COMPANY.REG_NUM")
Company_Id Company_Name Address Reg_Num
100 ABCD Ltd. 10 Garden st. 8791237
101 ZjkLPj 11 Park ave. 1233971
102 Robotrd Inc. 12 Forest st. 0382979
103 Xingzhoug 8 Mountst. 2389012
104 Example.co 123 Tech str. 3129001
Reading root segments
21/40
23. Company_Id Company_Name Address Reg_Num
100 ABCD Ltd. 10 Garden st. 8791237
101 ZjkLPj 11 Park ave. 1233971
102 Robotrd Inc. 12 Forest st. 0382979
103 Xingzhoug 8 Mountst. 2389012
104 Example.co 123 Tech str. 3129001
Company_Id Contact_Person Phone_Number
100 Marry +32186331
100 Colyn +23769123
102 Robert +12389679
102 Teresa +32187912
102 Laura +42198723
Company_Id Company_Name Address Reg_Num Contact_Person Phone_Number
100 ABCD Ltd. 10 Garden st. 8791237 Marry +32186331
100 ABCD Ltd. 10 Garden st. 8791237 Colyn +23769123
102 Robotrd Inc. 12 Forest st. 0382979 Robert +12389679
102 Robotrd Inc. 12 Forest st. 0382979 Teresa +32187912
102 Robotrd Inc. 12 Forest st. 0382979 Laura +42198723
The two segments can now be joined by Company_Id
23/40
24. • Joining segments 1 and 2
val dfJoined =
dfCompanies.join_outer(dfContacts, "COMPANY_ID")
Results:
Company_Id Company_Name Address Reg_Num Contact_Person Phone_Number
100 ABCD Ltd. 10 Garden st. 8791237 Marry +32186331
100 ABCD Ltd. 10 Garden st. 8791237 Colyn +23769123
102 Robotrd Inc. 12 Forest st. 0382979 Robert +12389679
102 Robotrd Inc. 12 Forest st. 0382979 Teresa +32187912
102 Robotrd Inc. 12 Forest st. 0382979 Laura +42198723
Joining in Spark is easy
24/40
25. • The joined table can also be denormalized for document storage
val dfCombined =
dfJoined
.groupBy($"COMPANY_ID",
$"COMPANY_NAME",
$"ADDRESS",
$"REG_NUM")
.agg(
collect_list(
struct($"CONTACT_PERSON",
$"PHONE_NUMBER"))
.as("CONTACTS"))
{
"COMPANY_ID": "8216281722",
"COMPANY_NAME": "ABCD Ltd.",
"ADDRESS": "74 Lawn ave., New York",
”REG_NUM": "33718594",
"CONTACTS": [
{
"CONTACT_PERSON": "Cassey Norgard",
"PHONE_NUMBER": "+(595) 641 62 32"
},
{
"CONTACT_PERSON": "Verdie Deveau",
"PHONE_NUMBER": "+(721) 636 72 35"
},
{
"CONTACT_PERSON": "Otelia Batman",
"PHONE_NUMBER": "+(813) 342 66 28"
}
]
}
Denormalize data
25/40
26. Restore parent-child relationships
• In our example we had
COMPANY_ID field that is
present in all segments
• In real copybooks this is not
the case
• What can we do?
COMPANY
ID: █ █ █ █ █ █ █ █ █
Name: █ █ █ █ █ █ █ █ █ █ █ █
Address: █ █ █ █ █ █ █ █ █
CONTACT-PERSON
Name: █ █ █ █ █ █ █ █
█ █ █ █ █ █ █ █ █
Phone #: █ █ █ █ █ █ █
Root segment
Child segment
CONTACT-PERSON
Name: █ █ █ █ █ █ █ █
█ █ █ █ █ █ █ █ █
Phone #: █ █ █ █ █ █ █
…
Child segment
26/40
27. 01 COMPANY-DETAILS.
05 SEGMENT-ID PIC X(5).
05 COMPANY.
10 NAME PIC X(15).
10 ADDRESS PIC X(25).
10 REG-NUM PIC 9(8) COMP.
05 CONTACT REDEFINES COMPANY.
10 PHONE-NUMBER PIC X(17).
10 CONTACT-PERSON PIC X(28).
• If COMPANY_ID is not part
of all segments
Cobrix can generate it for you
val df = spark
.read
.format("cobol")
.option("copybook", "/path/to/copybook.cpy")
.option("is_record_sequence", "true")
.option("segment_field", "SEGMENT-ID")
.option("segment_id_level0", "C")
.option("segment_id_prefix", "ID")
.load("examples/multisegment_data")
No COMPANY-ID
Id Generation
27/40
28. 01 COMPANY-DETAILS.
05 SEGMENT-ID PIC X(5).
05 COMPANY.
10 NAME PIC X(15).
10 ADDRESS PIC X(25).
10 REG-NUM PIC 9(8) COMP.
05 CONTACT REDEFINES COMPANY.
10 PHONE-NUMBER PIC X(17).
10 CONTACT-PERSON PIC X(28).
• Seg0_Id can be used to restore
parent-child relationship
between segments
No COMPANY-ID
SEGMENT_ID Seg0_Id COMPANY CONTACT
C ID_0_0 [ ABCD Ltd. ] [ invalid ]
P ID_0_0 [ invalid ] [ Cliff Wallingford ]
C ID_0_2 [ DEFG Ltd. ] [ invalid ]
P ID_0_2 [ invalid ] [ Beatrice Gagliano ]
C ID_0_4 [ Robotrd Inc. ] [ invalid ]
P ID_0_4 [ invalid ] [ Doretha Wallingford ]
P ID_0_4 [ invalid ] [ Deshawn Benally ]
Id Generation
28/40
29. • When transferred from a mainframe a hierarchical database becomes
• A sequence of records
• To read next record a previous record should be read first
• A sequential format by it's nature
• How to make it scalable?
A * N J O H N G A 3 2 S H K D K S I A S S A S K A S A L , S D F O O . C O M X L Q O K ( G A } S N B W E S < N J
X I C W L D H J P A S B C + 2 3 1 1 - 3 2 7 C = D 1 2 0 0 0 F H 0 D . K A I O D A P D F C J S C D C D C W E P 1
9 1 2 3 – 3 1 2 2 1 . 3 1 F A D F L 1 7
COMPANY
ID: █ █ █ █ █ █ █ █ █
Name: █ █ █ █ █ █ █ █ █ █ █ █
Address: █ █ █ █ █ █ █ █ █
COMPANY
ID: █ █ █ █ █ █ █ █ █
Name: █ █ █ █ █ █ █ █ █ █ █ █
Address: █ █ █ █ █ █ █ █ █
PERSON
Name: █ █ █ █ █ █ █ █
█ █ █ █ █ █ █ █ █
Phone #: █ █ █ █ █ █ █
A data file
PERSON
Name: J A N E █ █ █ █
R O B E R T S █ █
Phone #: + 9 3 5 2 8 0
PERSON
Name: █ █ █ █ █ █ █ █
█ █ █ █ █ █ █ █ █
Phone #: █ █ █ █ █ █ █
PERSON
Name: █ █ █ █ █ █ █ █
█ █ █ █ █ █ █ █ █
Phone #: █ █ █ █ █ █ █
COMPANY
ID: 2 8 0 0 0 3 9 4 1
Name: E x a m p l e . c o m █
Address: 1 0 █ G a r d e n
Variable Length Records (VLR)
29/40
30. Performance challenge of VLRs
• Naturally sequential files
• To read next record the prior
record need to be read first
• Each record had a length
field
• Acts as a pointer to the next
record
• No record delimiter when
reading a file from the middle
VLR structure
30/40
0
20
40
60
80
100
120
140
160
180
0 10 20 30 40 50 60 70
MB/s Number of Spark cores
Throughput, variable record length
Sequential processing
10 MB/s
31. Spark on HDFS
Blocks
Partitions
Data Node
Spark Executor
Blocks
Partitions
Data Node
Spark Executor
Blocks
Partitions
Data Node
Spark Executor
. . .
HDFS Namenode
Spark Driver
31/40
32. Data node 1 Data node 2 Data node 3
HDFS
Cobrix
1- Read headers
List(offsets,lengths)
Spark
3 - Parse records
In parallel from
parallelized offsets
and lengths
Spark cluster
2 - Parallelize
Offsets and lengths
Parallelizing Sequential Reads
32/40
33. Data node 1 Data node 2 Data node 3
HDFS
Namenode Cobrix
1- Read VLR 1 header
offset = 3000
length = 1002 - where is
offset 3000
until 3100?
3 - Check nodes
3, 18 and 41.
Spark
4 - Load VLR 1
Preferred location
is node 3
5 - Launch task
On executor
hosted on node 3
Spark cluster
Enabling Data Locality for VLRs
33/40
34. Throughput when sparse indexes are used
• Experiments were ran
on our lab cluster
• 4 nodes
• 380 cores
• 10 Gbit network
• Scalable – for bigger
files when using more
executors the
throughput is bigger
34/40
0
20
40
60
80
100
120
140
160
180
0 10 20 30 40 50 60 70
MB/s
Number of Spark cores
Throughput, variable record length
10 GB file
20 GB file
40 GB file
Sequential
35. Comparison versus fixed length record performance
● Distribution and locality is
handled completely by Spark
● Parallelism is achieved using
sparse indexes
35/40
0
20
40
60
80
100
120
140
160
180
0 10 20 30 40 50 60 70
MB/s
Number of Spark cores
Throughput, fixed length records
40 GB file 150 MB/s
0
20
40
60
80
100
120
140
160
180
0 10 20 30 40 50 60 70
MB/s
Number of Spark cores
Throughput, variable record length
40 GB file
Sequential
145 MB/s
10 MB/s
36. Cobrix in ABSA Data Infrastructure - Batch
Mainframes
180+
sources
HDFS
Enceladus
Cobrix
Spark
Spline
2. Parse
3. Conform
4. Track Lineage5. Re-ingest
6. Consume
36/40
39. ● Thanks to the following people the project was made possible and for all
the help along the way:
○ Andrew Baker, Francois Cillers, Adam Smyczek,
Jan Scherbaum, Peter Moon, Clifford Lategan,
Rekha Gorantla, Mohit Suryavanshi, Niel Steyn
• Thanks to the authors of the original COBOL parser:
○ Ian De Beer, Rikus de Milander
(https://github.com/zenaptix-lab/copybookStreams)
Acknowledgment
39/40
40. • Combine expertise to make access mainframe data in Hadoop seamless
• Our goal is to support the widest range of use cases possible
• Report a bug !
• Request new feature !
• Create a pull request ! " # $
Our home: https://github.com/AbsaOSS/cobrix
Your Contribution is Welcome
40/40