Datastage Introduction To Data Warehousing

What is DataStage?
• Design jobs for Extraction, Transformation, and Loading (ETL)
• Ideal tool for data integration projects – such as, data warehouses,
data marts, and system migrations
• Import, export, create, and managed metadata for use within jobs
• Schedule, run, and monitor jobs all within DataStage
• Administer your DataStage development and execution
environments

Developing in DataStage
• Define global and project properties in Administrator
• Import meta data into Manager
• Build job in Designer
• Compile Designer
• Validate, run, and monitor in Director

Project Properties
• Projects can be created and deleted in Administrator
• Project properties and defaults are set in Administrator

Setting Project Properties
• To set project properties, log onto Administrator, select your project, and then
click “Properties”

What Is Metadata?
Data
TargetSource Transform
Meta Data
Repository
Meta
Data
Meta
Data

Import and Export
• Any object in Manager can be exported to a file
• Can export whole projects
• Use for backup
• Sometimes used for version control
• Can be used to move DataStage objects from one project to another
• Use to share DataStage jobs and projects with other developers

Export Procedure
• In Manager, click “Export>DataStage Components”
• Select DataStage objects for export
• Specified type of export: DSX, XML
• Specify file path on client machine

Import Procedure
• In Manager, click “Import>DataStage Components”
• Select DataStage objects for import

Metadata Import
• Import format and column destinations from sequential files
• Import relational table column destinations
• Imported as “Table Definitions”
• Table definitions can be loaded into job stages

Sequential File Import Procedure
• In Manager, click Import>Table Definitions>Sequential File
Definitions
• Select directory containing sequential file and then the file
• Select Manager category
• Examined format and column definitions and edit is necessary

What Is a Job?
• Executable DataStage program
• Created in DataStage Designer, but can use components from
Manager
• Built using a graphical user interface
• Compiles into Orchestrate shell language (OSH)

Job Development Overview
• In Manager, import metadata defining sources and targets
• In Designer, add stages defining data extractions and loads
• And Transformers and other stages to defined data transformations
• Add linkss defining the flow of data from sources to targets
• Compiled the job
• In Director, validate, run, and monitor your job

Designer Toolbar
Provides quick access to the main functions of Designer
Job
properties Compile
Show/hide metadata markers

Adding Stages and Links
• Stages can be dragged from the tools palette or from the stage type
branch of the repository view
• Links can be drawn from the tools palette or by right clicking and
dragging from one stage to another

Drag Stages and Links Using Palette

Editing a Sequential Source Stage

Transformer Stage
• Used to define constraints, derivations, and column mappings
• A column mapping maps an input column to an output column
• In this module will just defined column mappings (no derivations)

Adding Job Parameters
• Makes the job more flexible
• Parameters can be:
– Used in constraints and derivations
– Used in directory and file names
• Parameter values are determined at run time

Adding Job Documentation
• Job Properties
– Short and long descriptions
– Shows in Manager
• Annotation stage
– Is a stage on the tool palette
– Shows on the job GUI (work area)

Annotation Stage on the Palette

Final Job Work Area with Documentation

Prerequisite to Job Execution
Result from Designer compile

DataStage Director
• Can schedule, validating, and run jobs
• Can be invoked from DataStage Manager or Designer
– Tools > Run Director

Run Options – Parameters and Limits

Other Director Functions
• Schedule job to run on a particular date/time
• Clear job log
• Set Director options
– Row limits
– Abort after x warnings

Process Flow
• Administrator – add/delete projects, set defaults
• Manager – import meta data, backup projects
• Designer – assemble jobs, compile, and execute
• Director – execute jobs, examine job run logs

Administrator – Licensing and Timeout

Administrator – Project Creation/Removal
Functions
specific to a
project.

Administrator – Project Properties
RCP for parallel
jobs should be
enabled
Variables for
parallel
processing

Administrator – Environment Variables
Variables are
category
specific

OSH is what is
run by the EE
Framework

Export Objects to MetaStage
Push meta data
to MetaStage

Designer Workspace
Can execute the
job from Designer

DataStage Generated OSH
The EE
Framework
runs OSH

Director – Executing Jobs
Messages from
previous run in
different color

Stages
Can now customize the Designer’s palette
Select desired stages and drag
to favorites

Popular Developer Stages
Row
generator
Peek

Row Generator
• Can build test data
Repeatable
property
Edit row in
column tab

Peek
• Displays field values
– Will be displayed in job log or sent to a file
– Skip records option
– Can control number of records to be displayed
• Can be used as stub stage for iterative development (more later)

Why EE is so Effective
• Parallel processing paradigm
– More hardware, faster processing
– Level of parallelization is determined by a configuration file read
at runtime
• Emphasis on memory
– Data read into memory and lookups performed like hash table

Parallel Processing Systems
• DataStage EE Enables parallel processing = executing your application on multiple
CPUs simultaneously
– If you add more resources
(CPUs, RAM, and disks) you increase system performance
• Example system containing
6 CPUs (or processing nodes)
and disks
1 2
3 4
5 6

Scaleable Systems: Examples
Three main types of scalable systems
• Symmetric Multiprocessors (SMP): shared memory and disk
• Clusters: UNIX systems connected via networks
• MPP: Massively Parallel Processing
note

SMP: Shared Everything
• Multiple CPUs with a single operating system
• Programs communicate using shared memory
• All CPUs share system resources
(OS, memory with single linear address space, disks, I/O)
When used with Enterprise Edition:
• Data transport uses shared memory
• Simplified startup
cpu cpu
cpu cpu

Source
Transform
Target
Data
Warehouse
Operational Data
Archived Data
Clean Load
Disk Disk Disk
Traditional approach to batch processing:
• Write to disk and read from disk before each processing operation
• Sub-optimal utilization of resources
• a 10 GB stream leads to 70 GB of I/O
• processing resources can sit idle during I/O
• Very complex to manage (lots and lots of small jobs)
• Becomes impractical with big data volumes
Traditional Batch Processing

Data Pipelining
• Transform, clean and load processes are executing simultaneously on the same processor
• rows are moving forward through the flow
Source
Target
Data
Warehouse
Operational Data
Transform
Archived Data
Clean Load
• Start a downstream process while an upstream process is still running.
• This eliminates intermediate storing to disk, which is critical for big data.
• This also keeps the processors busy.
• Still has limits on scalability
Pipeline Multiprocessing

Data Partitioning
Transform
Source
Data
Transform
Transform
Transform
Node 1
Node 2
Node 3
Node 4
A-F
G- M
N-T
U-Z
• Break up big data into partitions
• Run one partition on each processor
• 4X times faster on 4 processors -
With data big enough:
100X faster on 100 processors
• This is exactly how the parallel
databases work!
• Data Partitioning requires the
same transform to all partitions:
Aaron Abbott and Zygmund Zorn
undergo the same transform
Partition Parallelism

Putting It All Together: Parallel Dataflow
Source
Target
Transform Clean Load
Pipelining
Partitioning
Source
Data
Data
Warehouse
Combining Parallelism Types

Putting It All Together: Parallel Dataflow
with Repartioning on-the-fly
Without Landing To Disk!
Source
Target
Transform Clean Load
Pipelining
Source
Data Data
Warehouse
Partitioning
Repartitioning
A-F
G- M
N-T
U-Z
Customer last name Customer zip code Credit card number
Repartitioning
Repartitioning

EE Program Elements
• Dataset: uniform set of rows in the Framework's internal representation
- Three flavors:
1. file sets *.fs : stored on multiple Unix files as flat files
2. persistent: *.ds : stored on multiple Unix files in Framework format
read and written using the DataSet Stage
3. virtual: *.v : links, in Framework format, NOT stored on disk
- The Framework processes only datasets—hence possible need for Import
- Different datasets typically have different schemas
- Convention: "dataset" = Framework data set.
• Partition: subset of rows in a dataset earmarked for processing by the same node
(virtual CPU, declared in a configuration file).
- All the partitions of a dataset follow the same schema: that of the dataset

Orchestrate Program
(sequential dataflow)
Orchestrate Application Framew ork
and Runtim e System
Import
Clean 1
Clean 2
Merge Analyze
C onfiguration File
C entralized Error Handling
and Event Logging
Parallel access to data in files
Parallel access to data in R DBMS
Inter-node com m unications
Parallel pipelining
Parallelization of operations
Import
Clean1
Merge Analyze
Clean2
Relational D ata
Perform ance
Visualization
Flat Files
Orchestrate Framework:
Provides application scalability
DataStage Enterprise Edition:
Best-of-breed scalable data integration platform
No limitations on data volumes or throughput
DataStage EE Architecture
DataStage:
Provides data integration platform

Introduction to DataStage EE
• DSEE:
– Automatically scales to fit the machine
– Handles data flow among multiple CPU’s and disks
• With DSEE you can:
– Create applications for SMP’s, clusters and MPP’s…
Enterprise Edition is architecture-neutral
– Access relational databases in parallel
– Execute external applications in parallel
– Store data across multiple disks and nodes

Developer assembles data flow using the Designer
Job Design VS. Execution
…and gets: parallel access, propagation, transformation, and load.
The design is good for 1 node, 4 nodes,
or N nodes. To change # nodes, just swap configuration file.
No need to modify or recompile the design

Partitioners and Collectors
• Partitioners distribute rows into partitions
– implement data-partition parallelism
• Collectors = inverse partitioners
• Live on input links of stages running
– in parallel (partitioners)
– sequentially (collectors)
• Use a choice of methods

Example Partitioning Icons
partitioner

Types of Sequential Data Stages
• Sequential
– Fixed or variable length
• File Set
• Lookup File Set
• Data Set

Sequential Stage Introduction
• The EE Framework processes only datasets
• For files other than datasets, such as flat files, Enterprise Edition
must perform import and export operations – this is performed by
import and export OSH operators generated by Sequential or FileSet
stages
• During import or export DataStage performs format translations –
into, or out of, the EE internal format
• Data is described to the Framework in a schema

How the Sequential Stage Works
• Generates Import/Export operators, depending on whether stage is
source or target
• Performs direct C++ file I/O streams

Using the Sequential File Stage
Importing/Exporting Data
Both import and export of general files (text, binary) are performed by the
SequentialFile Stage.
– Data import:
Data export
EE internal format
EE internal format

Working With Flat Files
• Sequential File Stage
– Normally will execute in sequential mode
– Can be parallel if reading multiple files (file pattern option)
– Can use multiple readers within a node
– DSEE needs to know
• How file is divided into rows
• How row is divided into columns

Processes Needed to Import Data
• Recordization
– Divides input stream into records
– Set on the format tab
• Columnization
– Divides the record into columns
– Default set on the format tab but can be overridden on the
columns tab
– Can be “incomplete” if using a schema or not even specified in
the stage if using RCP

File Format Example
F ie ld 1
F ie ld 1
F ie ld 1
F ie ld 1
F ie ld 1
F ie ld 1
,
,
,
,
,
,
L a s t f ie ld
L a s t f ie ld
n l
n l,
F i e l d D e l i m i t e r
F i n a l D e l i m i t e r = c o m m a
F i n a l D e l i m i t e r = e n d
R e c o r d d e l i m i t e r

Sequential File Stage
• To set the properties, use stage editor
– Page (general, input/output)
– Tabs (format, columns)
• Sequential stage link rules
– One input link
– One output links (except for reject link definition)
– One reject link
• Will reject any records not matching meta data in the
column definitions

Job Design Using Sequential Stages
Stage categories

General Tab – Sequential Source
Multiple output
links
Show records

Properties – Multiple Files
Click to add more files
having the same meta
data.

Properties - Multiple Readers
Multiple readers option
allows you to set number
of readers

Format Tab
File into records
Record into columns

Reject Link
• Reject mode = output
• Source
– All records not matching the meta data (the column definitions)
• Target
– All records that are rejected for any reason
• Meta data – one column, datatype = raw

File Set Stage
• Can read or write file sets
• Files suffixed by .fs
• File set consists of:
1. Descriptor file – contains location of raw data files + meta data
2. Individual raw data files
• Can be processed in parallel

File Set Stage Example
Descriptor file

File Set Usage
• Why use a file set?
– 2G limit on some file systems
– Need to distribute data among nodes to prevent overruns
– If used in parallel, runs faster that sequential file

Lookup File Set Stage
• Can create file sets
• Usually used in conjunction with Lookup stages

Lookup File Set > Properties
Key column
specified
Key column
dropped in
descriptor file

Data Set
• Operating system (Framework) file
• Suffixed by .ds
• Referred to by a control file
• Managed by Data Set Management utility from GUI (Manager,
Designer, Director)
• Represents persistent data
• Key to good performance in set of linked jobs

Persistent Datasets
• Accessed from/to disk with DataSet Stage.
• Two parts:
– Descriptor file:
• contains metadata, data location, but NOT the data itself
– Data file(s)
• contain the data
• multiple Unix files (one per node), accessible in parallel
input.ds
node1:/local/disk1/…
node2:/local/disk2/…
record (
partno: int32;
description:
string;
)

Data Set Stage
Is the data partitioned?

Engine Data Translation
• Occurs on import
– From sequential files or file sets
– From RDBMS
• Occurs on export
– From datasets to file sets or sequential files
– From datasets to RDBMS
• Engine most efficient when processing internally formatted records
(I.e. data contained in datasets)

Managing DataSets
• GUI (Manager, Designer, Director) – tools > data set management
• Alternative methods
– Orchadmin
• Unix command line utility
• List records
• Remove data sets (will remove all components)
– Dsrecords
• Lists number of records in a dataset

Data Set Management
Display data
Schema

Data Set Management From Unix
• Alternative method of managing file sets and data sets
– Dsrecords
• Gives record count
– Unix command-line utility
– $ dsrecords ds_name
I.e.. $ dsrecords myDS.ds
156999 records
– Orchadmin
• Manages EE persistent data sets
– Unix command-line utility
I.e. $ orchadmin rm myDataSet.ds

Job Presentation
Document using the
annotation stage

Job Properties Documentation
Description shows in DS Manager
and MetaStage
Organize jobs into
categories

Naming conventions
• Stages named after the
– Data they access
– Function they perform
– DO NOT leave defaulted stage names like Sequential_File_0
• Links named for the data they carry
– DO NOT leave defaulted link names like DSLink3

Stage and Link Names
Stages and links
renamed to data
they handle

Create Reusable Job Components
• Use Enterprise Edition shared containers
when feasible
Container

Use Iterative Job Design
• Use copy or peek stage as stub
• Test job in phases – small first, then increasing in complexity
• Use Peek stage to examine records

Copy or Peek Stage Stub
Copy stage

Transformer Stage Techniques
• Suggestions -
– Always include reject link.
– Always test for null value before using a column in a function.
– Try to use RCP and only map columns that have a derivation other than a
copy. More on RCP later.
– Be aware of Column and Stage variable Data Types.
• Often user does not pay attention to Stage Variable type.
– Avoid type conversions.
• Try to maintain the data type as imported.

The Copy Stage
With 1 link in, 1 link out:
the Copy Stage is the ultimate "no-op" (place-holder):
– Partitioners
– Sort / Remove Duplicates
– Rename, Drop column
… can be inserted on:
– input link (Partitioning): Partitioners, Sort, Remove Duplicates)
– output link (Mapping page): Rename, Drop.
Sometimes replace the transformer:
– Rename,
– Drop,
– Implicit type Conversions

Developing Jobs
1. Keep it simple
• Jobs with many stages are hard to debug and maintain.
1. Start small and Build to final Solution
• Use view data, copy, and peek.
• Start from source and work out.
• Develop with a 1 node configuration file.
1. Solve the business problem before the performance problem.
• Don’t worry too much about partitioning until the sequential flow works as
expected.
1. If you have to write to Disk use a Persistent Data set.

Good Things to Have in each Job
• Use job parameters
• Some helpful environmental variables to add to job parameters
– $APT_DUMP_SCORE
• Report OSH to message log
– $APT_CONFIG_FILE
• Establishes runtime parameters to EE engine; I.e. Degree of
parallelization

Setting Job Parameters
Click to add
environment
variables

DUMP SCORE Output
Double-click
Mapping
Node--> partition
Setting APT_DUMP_SCORE yields:
Partitoner
And
Collector

Use Multiple Configuration Files
• Make a set for 1X, 2X,….
• Use different ones for test versus production
• Include as a parameter in each job

Parallel Database Connectivity
TraditionalTraditional
Client-ServerClient-Server Enterprise EditionEnterprise Edition
SortSort
ClientClient
Parallel RDBMS
ClientClient
ClientClient
ClientClient
ClientClient
Parallel RDBMS
Only RDBMS is running in parallel
Each application has only one connection
Parallel server runs APPLICATIONS
Application has parallel connections to RDBMS
ClientClient
LoadLoad

RDBMS Access
Supported Databases
Enterprise Edition provides high performance /
scalable interfaces for:
• DB2
• Informix
• Oracle
• Teradata

RDBMS Access
• Automatically convert RDBMS table layouts to/from Enterprise Edition
Table Definitions
• RDBMS nulls converted to/from nullable field values
• Support for standard SQL syntax for specifying:
– field list for SELECT statement
– filter for WHERE clause
• Can write an explicit SQL query to access RDBMS
• EE supplies additional information in the SQL query

RDBMS Stages
• DB2/UDB Enterprise
• Informix Enterprise
• Oracle Enterprise
• Teradata Enterprise

RDBMS Usage
• As a source
– Extract data from table (stream link)
– Extract as table, generated SQL, or user-defined SQL
– User-defined can perform joins, access views
– Lookup (reference link)
– Normal lookup is memory-based (all table data read into memory)
– Can perform one lookup at a time in DBMS (sparse option)
– Continue/drop/fail options
• As a target
– Inserts
– Upserts (Inserts and updates)
– Loader

RDBMS Source – Stream Link
Stream link

DBMS Source - User-defined SQL
Columns in SQL
statement must
match the meta data
in columns tab

DBMS Source – Reference Link
Reject link

Lookup Reject Link
“Output” option
automatically creates
the reject link

Null Handling
• Must handle null condition if lookup record is not found and
“continue” option is chosen
• Can be done in a transformer stage

Lookup Stage Mapping
Link name

Lookup Stage Properties
Referenc
e link
Must have same column
name in input and reference
links. You will get the results
of the lookup in the output
column.

DBMS As Target
• Write Methods
– Delete
– Load
– Upsert
– Write (DB2)
• Write mode for load method
– Truncate
– Create
– Replace
– Append

Target Properties
Upsert mode
determines
options
Generated code
can be copied

Checking for Nulls
• Use Transformer stage to test for fields with null values (Use IsNull
functions)
• In Transformer, can reject or load default value

Concepts
• The Enterprise Edition Platform
– Script language - OSH (generated by DataStage Parallel Canvas, and run by
DataStage Director)
– Communication - conductor,section leaders,players.
– Configuration files (only one active at a time, describes H/W)
– Meta data - schemas/tables
– Schema propagation - RCP
– EE extensibility - Buildop, Wrapper
– Datasets (data in Framework's internal representation)

Output Data Set schema:
prov_num:int16;
member_num:int8;
custid:int32;
Input Data Set schema:
prov_num:int16;
member_num:int8;
custid:int32;
EE Stages Involve A Series Of Processing Steps
Input
Interface
Partitioner
Business
Logic
Output
Interface
EE Stage
• Piece of Application
Logic Running Against
Individual Records
• Parallel or Sequential
DS-EE Stage Elements

• EE Delivers Parallelism in
Two Ways
– Pipeline
– Partition
• Block Buffering Between
Components
– Eliminates Need for Program
Load Balancing
– Maintains Orderly Data Flow
Dual Parallelism Eliminates Bottlenecks!
Pipeline
Partition
Producer
Consumer
DSEE Stage Execution

Stages Control Partition Parallelism
• Execution Mode (sequential/parallel) is controlled by Stage
– default = parallel for most Ascential-supplied Stages
– Developer can override default mode
– Parallel Stage inserts the default partitioner (Auto) on its input links
– Sequential Stage inserts the default collector (Auto) on its input links
– Developer can override default
• execution mode (parallel/sequential) of Stage > Advanced tab
• choice of partitioner/collector on Input > Partitioning tab

How Parallel Is It?
• Degree of parallelism is determined by the configuration
file
– Total number of logical nodes in default pool, or a
subset if using "constraints".
• Constraints are assigned to specific pools as defined in
configuration file and can be referenced in the stage

OSH
• DataStage EE GUI generates OSH scripts
– Ability to view OSH turned on in Administrator
– OSH can be viewed in Designer using job properties
• The Framework executes OSH
• What is OSH?
– Orchestrate shell
– Has a UNIX command-line interface

OSH Script
• An osh script is a quoted string which specifies:
– The operators and connections of a single Orchestrate step
– In its simplest form, it is:
osh “op < in.ds > out.ds”
• Where:
– op is an Orchestrate operator
– in.ds is the input data set
– out.ds is the output data set

OSH Operators
• OSH Operator is an instance of a C++ class inheriting from
APT_Operator
• Developers can create new operators
• Examples of existing operators:
– Import
– Export
– RemoveDups

Enable Visible OSH in Administrator
Will be enabled
for all projects

View OSH in Designer
Schema
Operator

• Operators
• Datasets: set of rows processed by Framework
– Orchestrate data sets:
– persistent (terminal) *.ds, and
– virtual (internal) *.v.
– Also: flat “file sets” *.fs
• Schema: data description (metadata) for datasets and links.
Elements of a Framework Program

• Consist of Partitioned Data and Schema
• Can be Persistent (*.ds) or Virtual (*.v, Link)
• Overcome 2 GB File Limit
=
What you program: What gets processed:
.
.
.
Multiple files per partition
Each file up to 2GBytes (or larger)
Operator
A
Operator
A
Operator
A
Operator
A
Node 1 Node 2 Node 3 Node 4
data files
of x.ds
$ osh “operator_A > x.ds“
GUI
OSH
Datasets
What gets generated:
Operator
A

Computing Architectures: Definition
Clusters and MPP Systems
Shared Disk Shared Nothing
Uniprocessor
Dedicated Disk
Shared Memory
SMP System
(Symmetric Multiprocessor)
DiskDisk
CPU
Memory
CPU CPU CPU CPU
CPU
Disk
Memory
CPU
Disk
Memory
CPU
Disk
Memory
CPU
Disk
Memory

Job Execution: Orchestrate
Processing Node
Processing Node
• Conductor - initial DS/EE process
– Step Composer
– Creates Section Leader processes (one/node)
– Consolidates massages, outputs them
– Manages orderly shutdown.
• Section Leader
– Forks Players processes (one/Stage)
– Manages up/down communication.
Players
– The actual processes associated with Stages
– Combined players: one process only
– Send stderr to SL
– Establish connections to other players for data
flow
Conductor Node
C
SL
PP P
SL
PP P

Working with Configuration Files
• You can easily switch between config files:
• '1-node' file - for sequential execution, lighter reports—handy for testing
• 'MedN-nodes' file - aims at a mix of pipeline and data-partitioned parallelism
• 'BigN-nodes' file - aims at full data-partitioned parallelism
• Only one file is active while a step is running
• The Framework queries (first) the environment variable:
$APT_CONFIG_FILE
# nodes declared in the config file needs not match # CPUs
• Same configuration file can be used in development and target
machines

Scheduling Nodes, Processes, and CPUs
• DS/EE does not:
– know how many CPUs are available
– schedule
• Who knows what?
• Who does what?
– DS/EE creates (Nodes*Ops) Unix processes
– The O/S schedules these processes on the CPUs
Nodes = # logical nodes declared in config. file
Ops = # ops. (approx. # blue boxes in V.O.)
Processes = # Unix processes
CPUs = # available CPUs
Nodes Ops Processes CPUs
User Y N
Orchestrate Y Y Nodes * Ops N
O/S " Y

Parallel to parallel flow may incur reshuffling:
Records may jump between nodes
node
1
node
2
partitioner
Re-Partitioning

Partitioning Methods
• Auto
• Hash
• Entire
• Range
• Range Map

• Collectors combine partitions of a dataset into a single
input stream to a sequential Stage
data partitions
collector
sequential Stage
...
–Collectors do NOT synchronize data
Collectors

Partitioning and Repartitioning Are Visible On Job
Design

Partitioning and Collecting Icons
Partitioner Collector

Reading Messages in Director
• Set APT_DUMP_SCORE to true
• Can be specified as job parameter
• Messages sent to Director log
• If set, parallel job will produce a report showing the operators,
processes, and datasets in the running job

Messages With APT_DUMP_SCORE = True

Transformed Data
• Transformed data is:
– Outgoing column is a derivation that may, or may not, include incoming
fields or parts of incoming fields
– May be comprised of system variables
• Frequently uses functions performed on something (ie. incoming columns)
– Divided into categories – I.e.
• Date and time
• Mathematical
• Logical
• Null handling
• More

Stages Review
• Stages that can transform data
– Transformer
• Parallel
• Basic (from Parallel palette)
– Aggregator (discussed in later module)
• Sample stages that do not transform data
– Sequential
– FileSet
– DataSet
– DBMS

Transformer Stage Functions
• Control data flow
• Create derivations

Flow Control
• Separate records flow down links based on data condition –
specified in Transformer stage constraints
• Transformer stage can filter records
• Other stages can filter records but do not exhibit advanced flow
control
– Sequential can send bad records down reject link
– Lookup can reject records based on lookup failure
– Filter can select records based on data value

Rejecting Data
• Reject option on sequential stage
– Data does not agree with meta data
– Output consists of one column with binary data type
• Reject links (from Lookup stage) result from the drop option of the property “If
Not Found”
– Lookup “failed”
– All columns on reject link (no column mapping option)
• Reject constraints are controlled from the constraint editor of the transformer
– Can control column mapping
– Use the “Other/Log” checkbox

Rejecting Data Example
“If Not
Found”
property
Constraint
– Other/log
option
Property
Reject Mode
= Output

Transformer Stage Variables
• First of transformer stage entities to execute
• Execute in order from top to bottom
– Can write a program by using one stage variable to point to the
results of a previous stage variable
• Multi-purpose
– Counters
– Hold values for previous rows to make comparison
– Hold derivations to be used in multiple field dervations
– Can be used to control execution of constraints

Stage Variables
Show/Hide
button

Transforming Data
• Derivations
– Using expressions
– Using functions
• Date/time
• Transformer Stage Issues
– Sometimes require sorting before the transformer stage – I.e.
using stage variable as accumulator and need to break on
change of column value
• Checking for nulls

Checking for Nulls
• Nulls can get introduced into the dataflow because of failed lookups
and the way in which you chose to handle this condition
• Can be handled in constraints, derivations, stage variables, or a
combination of these

Transformer - Handling Rejects
Constraint Rejects
– All expressions are false and
reject row is checked

Transformer: Execution Order
• Derivations in stage variables are executed first
• Constraints are executed before derivations
• Column derivations in earlier links are executed before later links
• Derivations in higher columns are executed before lower columns

Parallel Palette - Two Transformers
• All > Processing >
• Transformer
• Is the non-Universe transformer
• Has a specific set of functions
• No DS routines available
• Parallel > Processing
• Basic Transformer
• Makes server style transforms
available on the parallel palette
• Can use DS routines
• Program in Basic for both transformers

Transformer Functions From Derivation Editor
• Date & Time
• Logical
• Null Handling
• Number
• String
• Type Conversion

Sorting Data
• Important because
– Some stages require sorted input
– Some stages may run faster – I.e Aggregator
• Can be performed
– Option within stages (use input > partitioning tab and set
partitioning to anything other than auto)
– As a separate stage (more complex sorts)

Sorting Alternatives
• Alternative representation of same flow:

Sort Stage - Outputs
• Specifies how the output is derived

Sort Specification Options
• Input Link Property
– Limited functionality
– Max memory/partition is 20 MB, then spills to scratch
• Sort Stage
– Tunable to use more memory before spilling to scratch.
• Note: Spread I/O by adding more scratch file systems to each node
of the APT_CONFIG_FILE

Removing Duplicates
• Can be done by Sort stage
– Use unique option
OR
• Remove Duplicates stage
– Has more sophisticated ways to remove duplicates

Combining Data
• There are two ways to combine data:
– Horizontally:
Several input links; one output link (+ optional rejects) made of columns
from different input links. E.g.,
• Joins
• Lookup
• Merge
– Vertically:
One input link, one output link with column combining values from all input
rows. E.g.,
• Aggregator

Join, Lookup & Merge Stages
• These "three Stages" combine two or more input links according to
values of user-designated "key" column(s).
• They differ mainly in:
– Memory usage
– Treatment of rows with unmatched key values
– Input requirements (sorted, de-duplicated)

Not all Links are Created Equal
Joins Lookup Merge
Primary Input: port 0 Left Source Master
Secondary Input(s): ports 1,… Right LU Table(s) Update(s)
• Enterprise Edition distinguishes between:
- The Primary Input (Framework port 0)
- Secondary - in some cases "Reference" (other ports)
• Naming convention:
Tip:
Check "Input Ordering" tab to make sure intended Primary is listed first

Join Stage Editor
One of four variants:
– Inner
– Left Outer
– Right Outer
– Full Outer
Several key columns
allowed
Link Order
immaterial for Inner
and Full Outer Joins
(but VERY important
for Left/Right Outer
and Lookup and
Merge)

1. The Join Stage
Four types:
• 2 sorted input links, 1 output link
– "left outer" on primary input, "right outer" on secondary input
– Pre-sort make joins "lightweight": few rows need to be in RAM
• Inner
• Left Outer
• Right Outer
• Full Outer

2. The Lookup Stage
Combines:
– one source link with
– one or more duplicate-free table links
no pre-sort necessary
allows multiple keys LUTs
flexible exception handling for
source input rows with no match
Source
input
One or more
tables (LUTs)
Output Reject
Lookup
0
1
2
0
1

The Lookup Stage
• Lookup Tables should be small enough to fit into physical
memory (otherwise, performance hit due to paging)
• On an MPP you should partition the lookup tables using entire
partitioning method, or partition them the same way you
partition the source link
• On an SMP, no physical duplication of a Lookup Table occurs

The Lookup Stage
• Lookup File Set
– Like a persistent data set only it contains
metadata about the key.
– Useful for staging lookup tables
• RDBMS LOOKUP
– NORMAL
• Loads to an in memory hash table first
– SPARSE
• Select for each row.
• Might become a performance
bottleneck.

3. The Merge Stage
• Combines
– one sorted, duplicate-free master (primary) link with
– one or more sorted update (secondary) links.
– Pre-sort makes merge "lightweight": few rows need to be in RAM (as with joins, but
opposite to lookup).
• Follows the Master-Update model:
– Master row and one or more updates row are merged if they have the same value in user-
specified key column(s).
– A non-key column occurs in several inputs? The lowest input port number prevails (e.g.,
master over update; update values are ignored)
– Unmatched ("Bad") master rows can be either
• kept
• dropped
– Unmatched ("Bad") update rows in input link can be captured in a "reject" link
– Matched update rows are consumed.

The Merge Stage
Allows composite keys
Multiple update links
Matched update rows are consumed
Unmatched updates can be captured
Lightweight
Space/time tradeoff: presorts vs. in-RAM table
Master One or more
updates
Output Rejects
Merge
0
0
21
21

In this table:
• , <comma> = separator between primary and secondary input links
(out and reject links)
Synopsis:Joins, Lookup, & Merge
Joins Lookup Merge
Model RDBMS-style relational Source - in RAM LU Table Master -Update(s)
Memory usage light heavy light
# and names of Inputs exactly 2: 1 left, 1 right 1 Source, N LU Tables 1 Master, N Update(s)
Mandatory Input Sort both inputs no all inputs
Duplicates in primary input OK (x-product) OK Warning!
Duplicates in secondary input(s) OK (x-product) Warning! OK only when N = 1
Options on unmatched primary NONE [fail] | continue | drop | reject [keep] | drop
Options on unmatched secondary NONE NONE capture in reject set(s)
On match, secondary entries are reusable reusable consumed
# Outputs 1 1 out, (1 reject) 1 out, (N rejects)
Captured in reject set(s) Nothing (N/A) unmatched primary entries unmatched secondary entries

The Aggregator Stage
Purpose: Perform data aggregations
Specify:
• Zero or more key columns that define the aggregation units (or
groups)
• Columns to be aggregated
• Aggregation functions:
count (nulls/non-nulls) sum
max/min/range
• The grouping method (hash table or pre-sort) is a performance
issue

Grouping Methods
• Hash: results for each aggregation group are stored in a hash table, and the table is
written out after all input has been processed
– doesn’t require sorted data
– good when number of unique groups is small. Running tally for each group’s
aggregate calculations need to fit easily into memory. Require about 1KB/group
of RAM.
– Example: average family income by state, requires .05MB of RAM
• Sort: results for only a single aggregation group are kept in memory; when new group
is seen (key value changes), current group written out.
– requires input sorted by grouping keys
– can handle unlimited numbers of groups
– Example: average daily balance by credit card

Aggregator Functions
• Sum
• Min, max
• Mean
• Missing value count
• Non-missing value count
• Percent coefficient of variation

Aggregation Types
Aggregation types

Containers
• Two varieties
– Local
– Shared
• Local
– Simplifies a large, complex diagram
• Shared
– Creates reusable object that many jobs can include

Creating a Container
• Create a job
• Select (loop) portions to containerize
• Edit > Construct container > local or shared

Configuration File Concepts
• Determine the processing nodes and disk space connected to each
node
• When system changes, need only change the configuration file – no
need to recompile jobs
• When DataStage job runs, platform reads configuration file
– Platform automatically scales the application to fit the system

Processing Nodes Are
• Locations on which the framework runs applications
• Logical rather than physical construct
• Do not necessarily correspond to the number of CPUs in your
system
– Typically one node for two CPUs
• Can define one processing node for multiple physical nodes or
multiple processing nodes for one physical node

Optimizing Parallelism
• Degree of parallelism determined by number of nodes defined
• Parallelism should be optimized, not maximized
– Increasing parallelism distributes work load but also increases
Framework overhead
• Hardware influences degree of parallelism possible
• System hardware partially determines configuration

More Factors to Consider
• Communication amongst operators
– Should be optimized by your configuration
– Operators exchanging large amounts of data should be assigned
to nodes communicating by shared memory or high-speed link
• SMP – leave some processors for operating system
• Desirable to equalize partitioning of data
• Use an experimental approach
– Start with small data sets
– Try different parallelism while scaling up data set sizes

Configuration File
• Text file containing string data that is passed to the Framework
– Sits on server side
– Can be displayed and edited
• Name and location found in environmental variable
APT_CONFIG_FILE
• Components
– Node
– Fast name
– Pools
– Resource

Node Options
• Node name – name of a processing node used by EE
– Typically the network name
– Use command uname –n to obtain network name
• Fastname –
– Name of node as referred to by fastest network in the system
– Operators use physical node name to open connections
– NOTE: for SMP, all CPUs share single connection to network
• Pools
– Names of pools to which this node is assigned
– Used to logically group nodes
– Can also be used to group resources
• Resource
– Disk
– Scratchdisk

Sample Configuration File
{
node “Node1"
{
fastname "BlackHole"
pools "" "node1"
resource disk "/usr/dsadm/Ascential/DataStage/Datasets" {pools "" }
resource scratchdisk "/usr/dsadm/Ascential/DataStage/Scratch" {pools "" }
}
}

Disk Pools
• Disk pools allocate storage
• By default, EE uses the default
pool, specified by “”
pool "bigdata"

Sorting Requirements
Resource pools can also be specified for sorting:
• The Sort stage looks first for scratch disk resources in a
“sort” pool, and then in the default disk pool

Resource Types
• Disk
• Scratchdisk
• DB2
• Oracle
• Saswork
• Sortwork
• Can exist in a pool
– Groups resources together

Using Different Configurations
Lookup stage where DBMS is using a sparse lookup type

Building a Configuration File
• Scoping the hardware:
– Is the hardware configuration SMP, Cluster, or MPP?
– Define each node structure (an SMP would be single node):
• Number of CPUs
• CPU speed
• Available memory
• Available page/swap space
• Connectivity (network/back-panel speed)
– Is the machine dedicated to EE? If not, what other applications are running on
it?
– Get a breakdown of the resource usage (vmstat, mpstat, iostat)
– Are there other configuration restrictions? E.g. DB only runs on certain nodes
and ETL cannot run on them?

Wrappers vs. Buildop vs. Custom
• Wrappers are good if you cannot or do not want to modify the
application and performance is not critical.
• Buildops: good if you need custom coding but do not need
dynamic (runtime-based) input and output interfaces.
• Custom (C++ coding using framework API): good if you need custom
coding and need dynamic input and output interfaces.

Building “Wrapped” Stages
You can “wrapper” a legacy executable:
• Binary
• Unix command
• Shell script
… and turn it into a Enterprise Edition stage capable, among other things, of
parallel execution…
As long as the legacy executable is:
• amenable to data-partition parallelism
» no dependencies between rows
• pipe-safe
» can read rows sequentially
» no random access to data

Wrappers (Cont’d)
Wrappers are treated as a black box
• EE has no knowledge of contents
• EE has no means of managing anything that occurs inside the wrapper
• EE only knows how to export data to and import data from the wrapper
• User must know at design time the intended behavior of the wrapper and its
schema interface
• If the wrappered application needs to see all records prior to processing, it cannot
run in parallel.

LS Example
• Can this command be wrappered?

Creating a Wrapper
Used in this job ---
To create the “ls” stage

Creating Wrapped Stages
From Manager:
Right-Click on Stage Type
> New Parallel Stage > Wrapped
We will "Wrapper” an existing
Unix executables – the ls
command
Wrapper Starting Point

Wrapper - General Page
Unix command to be wrapped
Name of stage

Conscientiously
maintaining the Creator
page for all your wrapped
stages will eventually earn
you the thanks of others.
The "Creator" Page

Wrapper – Properties Page
• If your stage will have properties appear, complete the Properties
page
This will be the name of
the property as it
appears in your stage

Wrapper - Wrapped Page
Interfaces – input and output columns -
these should first be entered into the
table definitions meta data (DS
Manager); let’s do that now.

• Layout interfaces describe what columns the stage:
– Needs for its inputs (if any)
– Creates for its outputs (if any)
– Should be created as tables with columns in Manager
Interface schemas

Column Definition for Wrapper Interface

How Does the Wrapping Work?
– Define the schema for export and
import
• Schemas become interface
schemas of the operator and allow
for by-name column access
import
export
stdout or
named pipe
stdin or
named pipe
UNIX executable
output schema
input schema

Update the Wrapper Interfaces
• This wrapper will have no input interface – i.e. no input link. The
location will come as a job parameter that will be passed to the
appropriate stage property. Therefore, only the Output tab entry is
needed.

Job Run
• Show file from Designer palette

Wrapper Story: Cobol Application
• Hardware Environment:
– IBM SP2, 2 nodes with 4 CPU’s per node.
• Software:
– DB2/EEE, COBOL, EE
• Original COBOL Application:
– Extracted source table, performed lookup against table in DB2, and Loaded results to
target table.
– 4 hours 20 minutes sequential execution
• Enterprise Edition Solution:
– Used EE to perform Parallel DB2 Extracts and Loads
– Used EE to execute COBOL application in Parallel
– EE Framework handled data transfer between
DB2/EEE and COBOL application
– 30 minutes 8-way parallel execution

Buildops
Buildop provides a simple means of extending beyond the functionality provided by EE,
but does not use an existing executable (like the wrapper)
Reasons to use Buildop include:
• Speed / Performance
• Complex business logic that cannot be easily represented
using existing stages
– Lookups across a range of values
– Surrogate key generation
– Rolling aggregates
• Build once and reusable everywhere within project, no
shared container necessary
• Can combine functionality from different stages into one

BuildOps
– The DataStage programmer encapsulates the business logic
– The Enterprise Edition interface called “buildop” automatically
performs the tedious, error-prone tasks: invoke needed header files,
build the necessary “plumbing” for a correct and efficient parallel
execution.
– Exploits extensibility of EE Framework

From Manager (or Designer):
Repository pane:
Right-Click on Stage Type
> New Parallel Stage > {Custom | Build | Wrapped}
• "Build" stages
from within Enterprise Edition
• "Wrapping” existing “Unix”
executables
BuildOp Process Overview

General Page
Identical
to Wrappers,
except:
Under the Build
Tab, your program!

Logic Tab for Business Logic
Enter Business C/C++
logic and arithmetic in four
pages under the Logic tab
Main code section goes in
Per-Record page- it will be
applied to all rows
NOTE: Code will need to
be Ansi C/C++ compliant.
If code does not compile
outside of EE, it won’t
compile within EE either!

Code Sections under Logic Tab
Temporary
variables
declared [and
initialized] here
Logic here is executed
once BEFORE
processing the FIRST
row
Logic here is executed
once AFTER
processing the LAST
row

I/O and Transfer
Under Interface tab: Input, Output & Transfer pages
First line:
output 0
Optional renaming
of
output port from
default "out0"
Write row
Input page: 'Auto Read'
Read next row
In-Repository
Table Definition
'False' setting,
not to interfere
with Transfer page

I/O and Transfer
• Transfer all columns from input to output.
• If page left blank or Auto Transfer = "False" (and RCP = "False")
Only columns in output Table Definition are written
First line:
Transfer of index 0

BuildOp Simple Example
• Example - sumNoTransfer
– Add input columns "a" and "b"; ignore other columns
that might be present in input
– Produce a new "sum" column
– Do not transfer input columns
sumNoTransfer
a:int32; b:int32
sum:int32

NO TRANSFER
- RCP set to "False" in stage definition
and
- Transfer page left blank, or Auto Transfer = "False"
• Effects:
- input columns "a" and "b" are not transferred
- only new column "sum" is transferred
From Peek:
No Transfer

Transfer
TRANSFER
- RCP set to "True" in stage definition
or
- Auto Transfer set to "True"
• Effects:
- new column "sum" is transferred, as well as
- input columns "a" and "b" and
- input column "ignored" (present in input, but
not mentioned in stage)

Columns vs. Temporary C++ Variables
Columns
• DS-EE type
• Defined in Table Definitions
• Value refreshed from row
to row
Temp C++ variables
• C/C++ type
• Need declaration (in
Definitions or Pre-Loop
page)
• Value persistent
throughout "loop" over
rows, unless modified in
code

Custom Stage
• Reasons for a custom stage:
– Add EE operator not already in DataStage EE
– Build your own Operator and add to DataStage EE
• Use EE API
• Use Custom Stage to add new operator to EE canvas

Custom Stage
DataStage Manager > select Stage Types branch > right click

Custom Stage
Name of Orchestrate
operator to be used
Number of input and
output links allowed

Custom Stage – Properties Tab

Establishing Meta Data
• Data definitions
– Recordization and columnization
– Fields have properties that can be set at individual field level
• Data types in GUI are translated to types used by EE
– Described as properties on the format/columns tab (outputs or inputs pages)
OR
– Using a schema file (can be full or partial)
• Schemas
– Can be imported into Manager
– Can be pointed to by some job stages (i.e. Sequential)

Data Formatting – Record Level
• Format tab
• Meta data described on a record basis
• Record level properties

Data Formatting – Column Level
• Defaults for all columns

Column Overrides
• Edit row from within the columns tab
• Set individual column properties

Extended Column Properties
Field
and
string
settings

Extended Properties – String Type
• Note the ability to convert ASCII to EBCDIC

Editing Columns
Properties depend on
the data type

Schema
• Alternative way to specify column definitions for data used in EE jobs
• Written in a plain text file
• Can be written as a partial record definition
• Can be imported into the DataStage repository

Creating a Schema
• Using a text editor
– Follow correct syntax for definitions
– OR
• Import from an existing data set or file set
– On DataStage Manager import > Table Definitions > Orchestrate
Schema Definitions
– Select checkbox for a file with .fs or .ds

Importing a Schema
Schema location can be
on the server or local
work station

Data Types
• Date
• Decimal
• Floating point
• Integer
• String
• Time
• Timestamp
• Vector
• Subrecord
• Raw
• Tagged

Runtime Column Propagation
• DataStage EE is flexible about meta data. It can cope with the situation where
meta data isn’t fully defined. You can define part of your schema and specify
that, if your job encounters extra columns that are not defined in the meta data
when it actually runs, it will adopt these extra columns and propagate them
through the rest of the job. This is known as runtime column propagation (RCP).
• RCP is always on at runtime.
• Design and compile time column mapping enforcement.
– RCP is off by default.
– Enable first at project level. (Administrator project properties)
– Enable at job level. (job properties General tab)
– Enable at Stage. (Link Output Column tab)

Enabling RCP at Stage Level
• Go to output link’s columns tab
• For transformer you can find the output links columns tab by first going to stage
properties

Using RCP with Sequential Stages
• To utilize runtime column propagation in the sequential stage you
must use the “use schema” option
• Stages with this restriction:
– Sequential
– File Set
– External Source
– External Target

• When RCP is Disabled
– DataStage Designer will enforce Stage Input Column to Output
Column mappings.
– At job compile time modify operators are inserted on output links in the
generated osh.

• When RCP is Enabled
– DataStage Designer will not enforce mapping rules.
– No Modify operator inserted at compile time.
– Danger of runtime error if column names incoming do not match
column names outgoing link – case sensitivity.

Job Control Options
• Manually write job control
– Code generated in Basic
– Use the job control tab on the job properties page
– Generates basic code which you can modify
• Job Sequencer
– Build a controlling job much the same way you build other jobs
– Comprised of stages and links
– No basic coding

Job Sequencer
• Build like a regular job
• Type “Job Sequence”
• Has stages and links
• Job Activity stage represents
a DataStage job
• Links represent passing
control
Stages

Example
Job Activity
stage –
contains
conditional
triggers

Job Activity Properties
Job parameters
to be passed
Job to be executed –
select from dropdown

Job Activity Trigger
• Trigger appears as a link in the diagram
• Custom options let you define the code

Options
• Use custom option for conditionals
– Execute if job run successful or warnings only
• Can add “wait for file” to execute
• Add “execute command” stage to drop real tables and rename new
tables to current tables

Job Activity With Multiple Links
Different links
having different
triggers

Sequencer Stage
• Build job sequencer to control job for the collections application
Can be set to
all or any

Notification
Notification Stage

Sample DataStage log from Mail Notification
• Sample DataStage log from Mail Notification

Notification Activity Message
• E-Mail Message

Parallel Environment Variables

Environment Variables Stage Specific

Environment Variables Compiler

The Director
Typical Job Log Messages:
• Environment variables
• Configuration File information
• Framework Info/Warning/Error messages
• Output from the Peek Stage
• Additional info with "Reporting" environments
• Tracing/Debug output
– Must compile job in trace mode
– Adds overhead

• Job Properties, from Menu Bar of Designer
• Director will
prompt you
before each
run
Job Level Environmental Variables

Troubleshooting
If you get an error during compile, check the following:
• Compilation problems
– If Transformer used, check C++ compiler, LD_LIRBARY_PATH
– If Buildop errors try buildop from command line
– Some stages may not support RCP – can cause column mismatch .
– Use the Show Error and More buttons
– Examine Generated OSH
– Check environment variables settings
• Very little integrity checking during compile, should run validate from Director.
Highlights source of error

Generating Test Data
• Row Generator stage can be used
– Column definitions
– Data type dependent
• Row Generator plus lookup stages provides good way to create
robust test data from pattern files

Thank You !!!Thank You !!!
For More Information click below link:
Follow Us on:
http://vibranttechnologies.co.in/datastage-classes-in-mumbai.html

Datastage Introduction To Data Warehousing

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Destaque

Destaque (20)

Semelhante a Datastage Introduction To Data Warehousing

Semelhante a Datastage Introduction To Data Warehousing (20)

Mais de Vibrant Technologies & Computers

Mais de Vibrant Technologies & Computers (20)

Último

Último (20)

Datastage Introduction To Data Warehousing

Notas do Editor