SlideShare uma empresa Scribd logo
1 de 301
Introduction
To DataStage
What is DataStage?
• Design jobs for Extraction, Transformation, and Loading (ETL)
• Ideal tool for data integration projects – such as, data warehouses,
data marts, and system migrations
• Import, export, create, and managed metadata for use within jobs
• Schedule, run, and monitor jobs all within DataStage
• Administer your DataStage development and execution
environments
DataStage Server and Clients
DataStage Administrator
Client Logon
DataStage Manager
DataStage Designer
DataStage Director
Developing in DataStage
• Define global and project properties in Administrator
• Import meta data into Manager
• Build job in Designer
• Compile Designer
• Validate, run, and monitor in Director
DataStage Projects
Project Properties
• Projects can be created and deleted in Administrator
• Project properties and defaults are set in Administrator
Setting Project Properties
• To set project properties, log onto Administrator, select your project, and then
click “Properties”
Licensing Tab
Projects General Tab
Environment Variables
Permissions Tab
Tracing Tab
Tunables Tab
Parallel Tab
What Is Metadata?
Data
TargetSource Transform
Meta Data
Repository
Meta
Data
Meta
Data
DataStage Manager
Import and Export
• Any object in Manager can be exported to a file
• Can export whole projects
• Use for backup
• Sometimes used for version control
• Can be used to move DataStage objects from one project to another
• Use to share DataStage jobs and projects with other developers
Export Procedure
• In Manager, click “Export>DataStage Components”
• Select DataStage objects for export
• Specified type of export: DSX, XML
• Specify file path on client machine
Exporting DataStage Objects
Exporting DataStage Objects
Import Procedure
• In Manager, click “Import>DataStage Components”
• Select DataStage objects for import
Importing DataStage Objects
Import Options
Metadata Import
• Import format and column destinations from sequential files
• Import relational table column destinations
• Imported as “Table Definitions”
• Table definitions can be loaded into job stages
Sequential File Import Procedure
• In Manager, click Import>Table Definitions>Sequential File
Definitions
• Select directory containing sequential file and then the file
• Select Manager category
• Examined format and column definitions and edit is necessary
Manager Table Definition
Importing Sequential Metadata
What Is a Job?
• Executable DataStage program
• Created in DataStage Designer, but can use components from
Manager
• Built using a graphical user interface
• Compiles into Orchestrate shell language (OSH)
Job Development Overview
• In Manager, import metadata defining sources and targets
• In Designer, add stages defining data extractions and loads
• And Transformers and other stages to defined data transformations
• Add linkss defining the flow of data from sources to targets
• Compiled the job
• In Director, validate, run, and monitor your job
Designer Work Area
Designer Toolbar
Provides quick access to the main functions of Designer
Job
properties Compile
Show/hide metadata markers
Tools Palette
Adding Stages and Links
• Stages can be dragged from the tools palette or from the stage type
branch of the repository view
• Links can be drawn from the tools palette or by right clicking and
dragging from one stage to another
Designer - Create New Job
Drag Stages and Links Using Palette
Assign Meta Data
Editing a Sequential Source Stage
Editing a Sequential Target
Transformer Stage
• Used to define constraints, derivations, and column mappings
• A column mapping maps an input column to an output column
• In this module will just defined column mappings (no derivations)
Transformer Stage Elements
Create Column Mappings
Creating Stage Variables
Result
Adding Job Parameters
• Makes the job more flexible
• Parameters can be:
– Used in constraints and derivations
– Used in directory and file names
• Parameter values are determined at run time
Adding Job Documentation
• Job Properties
– Short and long descriptions
– Shows in Manager
• Annotation stage
– Is a stage on the tool palette
– Shows on the job GUI (work area)
Job Properties Documentation
Annotation Stage on the Palette
Annotation Stage Properties
Final Job Work Area with Documentation
Compiling a Job
Errors or Successful Message
Prerequisite to Job Execution
Result from Designer compile
DataStage Director
• Can schedule, validating, and run jobs
• Can be invoked from DataStage Manager or Designer
– Tools > Run Director
Running Your Job
Run Options – Parameters and Limits
Director Log View
Message Details are Available
Other Director Functions
• Schedule job to run on a particular date/time
• Clear job log
• Set Director options
– Row limits
– Abort after x warnings
Process Flow
• Administrator – add/delete projects, set defaults
• Manager – import meta data, backup projects
• Designer – assemble jobs, compile, and execute
• Director – execute jobs, examine job run logs
Administrator – Licensing and Timeout
Administrator – Project Creation/Removal
Functions
specific to a
project.
Administrator – Project Properties
RCP for parallel
jobs should be
enabled
Variables for
parallel
processing
Administrator – Environment Variables
Variables are
category
specific
OSH is what is
run by the EE
Framework
DataStage Manager
Export Objects to MetaStage
Push meta data
to MetaStage
Designer Workspace
Can execute the
job from Designer
DataStage Generated OSH
The EE
Framework
runs OSH
Director – Executing Jobs
Messages from
previous run in
different color
Stages
Can now customize the Designer’s palette
Select desired stages and drag
to favorites
Popular Developer Stages
Row
generator
Peek
Row Generator
• Can build test data
Repeatable
property
Edit row in
column tab
Peek
• Displays field values
– Will be displayed in job log or sent to a file
– Skip records option
– Can control number of records to be displayed
• Can be used as stub stage for iterative development (more later)
Why EE is so Effective
• Parallel processing paradigm
– More hardware, faster processing
– Level of parallelization is determined by a configuration file read
at runtime
• Emphasis on memory
– Data read into memory and lookups performed like hash table
Parallel Processing Systems
• DataStage EE Enables parallel processing = executing your application on multiple
CPUs simultaneously
– If you add more resources
(CPUs, RAM, and disks) you increase system performance
• Example system containing
6 CPUs (or processing nodes)
and disks
1 2
3 4
5 6
Scaleable Systems: Examples
Three main types of scalable systems
• Symmetric Multiprocessors (SMP): shared memory and disk
• Clusters: UNIX systems connected via networks
• MPP: Massively Parallel Processing
note
SMP: Shared Everything
• Multiple CPUs with a single operating system
• Programs communicate using shared memory
• All CPUs share system resources
(OS, memory with single linear address space, disks, I/O)
When used with Enterprise Edition:
• Data transport uses shared memory
• Simplified startup
cpu cpu
cpu cpu
Source
Transform
Target
Data
Warehouse
Operational Data
Archived Data
Clean Load
Disk Disk Disk
Traditional approach to batch processing:
• Write to disk and read from disk before each processing operation
• Sub-optimal utilization of resources
• a 10 GB stream leads to 70 GB of I/O
• processing resources can sit idle during I/O
• Very complex to manage (lots and lots of small jobs)
• Becomes impractical with big data volumes
Traditional Batch Processing
Data Pipelining
• Transform, clean and load processes are executing simultaneously on the same processor
• rows are moving forward through the flow
Source
Target
Data
Warehouse
Operational Data
Transform
Archived Data
Clean Load
• Start a downstream process while an upstream process is still running.
• This eliminates intermediate storing to disk, which is critical for big data.
• This also keeps the processors busy.
• Still has limits on scalability
Pipeline Multiprocessing
Data Partitioning
Transform
Source
Data
Transform
Transform
Transform
Node 1
Node 2
Node 3
Node 4
A-F
G- M
N-T
U-Z
• Break up big data into partitions
• Run one partition on each processor
• 4X times faster on 4 processors -
With data big enough:
100X faster on 100 processors
• This is exactly how the parallel
databases work!
• Data Partitioning requires the
same transform to all partitions:
Aaron Abbott and Zygmund Zorn
undergo the same transform
Partition Parallelism
Putting It All Together: Parallel Dataflow
Source
Target
Transform Clean Load
Pipelining
Partitioning
Source
Data
Data
Warehouse
Combining Parallelism Types
Putting It All Together: Parallel Dataflow
with Repartioning on-the-fly
Without Landing To Disk!
Source
Target
Transform Clean Load
Pipelining
Source
Data Data
Warehouse
Partitioning
Repartitioning
A-F
G- M
N-T
U-Z
Customer last name Customer zip code Credit card number
Repartitioning
Repartitioning
EE Program Elements
• Dataset: uniform set of rows in the Framework's internal representation
- Three flavors:
1. file sets *.fs : stored on multiple Unix files as flat files
2. persistent: *.ds : stored on multiple Unix files in Framework format
read and written using the DataSet Stage
3. virtual: *.v : links, in Framework format, NOT stored on disk
- The Framework processes only datasets—hence possible need for Import
- Different datasets typically have different schemas
- Convention: "dataset" = Framework data set.
• Partition: subset of rows in a dataset earmarked for processing by the same node
(virtual CPU, declared in a configuration file).
- All the partitions of a dataset follow the same schema: that of the dataset
Orchestrate Program
(sequential dataflow)
Orchestrate Application Framew ork
and Runtim e System
Import
Clean 1
Clean 2
Merge Analyze
C onfiguration File
C entralized Error Handling
and Event Logging
Parallel access to data in files
Parallel access to data in R DBMS
Inter-node com m unications
Parallel pipelining
Parallelization of operations
Import
Clean1
Merge Analyze
Clean2
Relational D ata
Perform ance
Visualization
Flat Files
Orchestrate Framework:
Provides application scalability
DataStage Enterprise Edition:
Best-of-breed scalable data integration platform
No limitations on data volumes or throughput
DataStage EE Architecture
DataStage:
Provides data integration platform
Introduction to DataStage EE
• DSEE:
– Automatically scales to fit the machine
– Handles data flow among multiple CPU’s and disks
• With DSEE you can:
– Create applications for SMP’s, clusters and MPP’s…
Enterprise Edition is architecture-neutral
– Access relational databases in parallel
– Execute external applications in parallel
– Store data across multiple disks and nodes
Developer assembles data flow using the Designer
Job Design VS. Execution
…and gets: parallel access, propagation, transformation, and load.
The design is good for 1 node, 4 nodes,
or N nodes. To change # nodes, just swap configuration file.
No need to modify or recompile the design
Partitioners and Collectors
• Partitioners distribute rows into partitions
– implement data-partition parallelism
• Collectors = inverse partitioners
• Live on input links of stages running
– in parallel (partitioners)
– sequentially (collectors)
• Use a choice of methods
Example Partitioning Icons
partitioner
Types of Sequential Data Stages
• Sequential
– Fixed or variable length
• File Set
• Lookup File Set
• Data Set
Sequential Stage Introduction
• The EE Framework processes only datasets
• For files other than datasets, such as flat files, Enterprise Edition
must perform import and export operations – this is performed by
import and export OSH operators generated by Sequential or FileSet
stages
• During import or export DataStage performs format translations –
into, or out of, the EE internal format
• Data is described to the Framework in a schema
How the Sequential Stage Works
• Generates Import/Export operators, depending on whether stage is
source or target
• Performs direct C++ file I/O streams
Using the Sequential File Stage
Importing/Exporting Data
Both import and export of general files (text, binary) are performed by the
SequentialFile Stage.
– Data import:
Data export
EE internal format
EE internal format
Working With Flat Files
• Sequential File Stage
– Normally will execute in sequential mode
– Can be parallel if reading multiple files (file pattern option)
– Can use multiple readers within a node
– DSEE needs to know
• How file is divided into rows
• How row is divided into columns
Processes Needed to Import Data
• Recordization
– Divides input stream into records
– Set on the format tab
• Columnization
– Divides the record into columns
– Default set on the format tab but can be overridden on the
columns tab
– Can be “incomplete” if using a schema or not even specified in
the stage if using RCP
File Format Example
F ie ld 1
F ie ld 1
F ie ld 1
F ie ld 1
F ie ld 1
F ie ld 1
,
,
,
,
,
,
L a s t f ie ld
L a s t f ie ld
n l
n l,
F i e l d D e l i m i t e r
F i n a l D e l i m i t e r = c o m m a
F i n a l D e l i m i t e r = e n d
R e c o r d d e l i m i t e r
Sequential File Stage
• To set the properties, use stage editor
– Page (general, input/output)
– Tabs (format, columns)
• Sequential stage link rules
– One input link
– One output links (except for reject link definition)
– One reject link
• Will reject any records not matching meta data in the
column definitions
Job Design Using Sequential Stages
Stage categories
General Tab – Sequential Source
Multiple output
links
Show records
Properties – Multiple Files
Click to add more files
having the same meta
data.
Properties - Multiple Readers
Multiple readers option
allows you to set number
of readers
Format Tab
File into records
Record into columns
Read Methods
Reject Link
• Reject mode = output
• Source
– All records not matching the meta data (the column definitions)
• Target
– All records that are rejected for any reason
• Meta data – one column, datatype = raw
File Set Stage
• Can read or write file sets
• Files suffixed by .fs
• File set consists of:
1. Descriptor file – contains location of raw data files + meta data
2. Individual raw data files
• Can be processed in parallel
File Set Stage Example
Descriptor file
File Set Usage
• Why use a file set?
– 2G limit on some file systems
– Need to distribute data among nodes to prevent overruns
– If used in parallel, runs faster that sequential file
Lookup File Set Stage
• Can create file sets
• Usually used in conjunction with Lookup stages
Lookup File Set > Properties
Key column
specified
Key column
dropped in
descriptor file
Data Set
• Operating system (Framework) file
• Suffixed by .ds
• Referred to by a control file
• Managed by Data Set Management utility from GUI (Manager,
Designer, Director)
• Represents persistent data
• Key to good performance in set of linked jobs
Persistent Datasets
• Accessed from/to disk with DataSet Stage.
• Two parts:
– Descriptor file:
• contains metadata, data location, but NOT the data itself
– Data file(s)
• contain the data
• multiple Unix files (one per node), accessible in parallel
input.ds
node1:/local/disk1/…
node2:/local/disk2/…
record (
partno: int32;
description:
string;
)
Data Set Stage
Is the data partitioned?
Engine Data Translation
• Occurs on import
– From sequential files or file sets
– From RDBMS
• Occurs on export
– From datasets to file sets or sequential files
– From datasets to RDBMS
• Engine most efficient when processing internally formatted records
(I.e. data contained in datasets)
Managing DataSets
• GUI (Manager, Designer, Director) – tools > data set management
• Alternative methods
– Orchadmin
• Unix command line utility
• List records
• Remove data sets (will remove all components)
– Dsrecords
• Lists number of records in a dataset
Data Set Management
Display data
Schema
Data Set Management From Unix
• Alternative method of managing file sets and data sets
– Dsrecords
• Gives record count
– Unix command-line utility
– $ dsrecords ds_name
I.e.. $ dsrecords myDS.ds
156999 records
– Orchadmin
• Manages EE persistent data sets
– Unix command-line utility
I.e. $ orchadmin rm myDataSet.ds
Job Presentation
Document using the
annotation stage
Job Properties Documentation
Description shows in DS Manager
and MetaStage
Organize jobs into
categories
Naming conventions
• Stages named after the
– Data they access
– Function they perform
– DO NOT leave defaulted stage names like Sequential_File_0
• Links named for the data they carry
– DO NOT leave defaulted link names like DSLink3
Stage and Link Names
Stages and links
renamed to data
they handle
Create Reusable Job Components
• Use Enterprise Edition shared containers
when feasible
Container
Use Iterative Job Design
• Use copy or peek stage as stub
• Test job in phases – small first, then increasing in complexity
• Use Peek stage to examine records
Copy or Peek Stage Stub
Copy stage
Transformer Stage Techniques
• Suggestions -
– Always include reject link.
– Always test for null value before using a column in a function.
– Try to use RCP and only map columns that have a derivation other than a
copy. More on RCP later.
– Be aware of Column and Stage variable Data Types.
• Often user does not pay attention to Stage Variable type.
– Avoid type conversions.
• Try to maintain the data type as imported.
The Copy Stage
With 1 link in, 1 link out:
the Copy Stage is the ultimate "no-op" (place-holder):
– Partitioners
– Sort / Remove Duplicates
– Rename, Drop column
… can be inserted on:
– input link (Partitioning): Partitioners, Sort, Remove Duplicates)
– output link (Mapping page): Rename, Drop.
Sometimes replace the transformer:
– Rename,
– Drop,
– Implicit type Conversions
Developing Jobs
1. Keep it simple
• Jobs with many stages are hard to debug and maintain.
1. Start small and Build to final Solution
• Use view data, copy, and peek.
• Start from source and work out.
• Develop with a 1 node configuration file.
1. Solve the business problem before the performance problem.
• Don’t worry too much about partitioning until the sequential flow works as
expected.
1. If you have to write to Disk use a Persistent Data set.
Final Result
Good Things to Have in each Job
• Use job parameters
• Some helpful environmental variables to add to job parameters
– $APT_DUMP_SCORE
• Report OSH to message log
– $APT_CONFIG_FILE
• Establishes runtime parameters to EE engine; I.e. Degree of
parallelization
Setting Job Parameters
Click to add
environment
variables
DUMP SCORE Output
Double-click
Mapping
Node--> partition
Setting APT_DUMP_SCORE yields:
Partitoner
And
Collector
Use Multiple Configuration Files
• Make a set for 1X, 2X,….
• Use different ones for test versus production
• Include as a parameter in each job
Parallel Database Connectivity
TraditionalTraditional
Client-ServerClient-Server Enterprise EditionEnterprise Edition
SortSort
ClientClient
Parallel RDBMS
ClientClient
ClientClient
ClientClient
ClientClient
Parallel RDBMS
Only RDBMS is running in parallel
Each application has only one connection
Parallel server runs APPLICATIONS
Application has parallel connections to RDBMS
ClientClient
LoadLoad
RDBMS Access
Supported Databases
Enterprise Edition provides high performance /
scalable interfaces for:
• DB2
• Informix
• Oracle
• Teradata
RDBMS Access
• Automatically convert RDBMS table layouts to/from Enterprise Edition
Table Definitions
• RDBMS nulls converted to/from nullable field values
• Support for standard SQL syntax for specifying:
– field list for SELECT statement
– filter for WHERE clause
• Can write an explicit SQL query to access RDBMS
• EE supplies additional information in the SQL query
RDBMS Stages
• DB2/UDB Enterprise
• Informix Enterprise
• Oracle Enterprise
• Teradata Enterprise
RDBMS Usage
• As a source
– Extract data from table (stream link)
– Extract as table, generated SQL, or user-defined SQL
– User-defined can perform joins, access views
– Lookup (reference link)
– Normal lookup is memory-based (all table data read into memory)
– Can perform one lookup at a time in DBMS (sparse option)
– Continue/drop/fail options
• As a target
– Inserts
– Upserts (Inserts and updates)
– Loader
RDBMS Source – Stream Link
Stream link
DBMS Source - User-defined SQL
Columns in SQL
statement must
match the meta data
in columns tab
DBMS Source – Reference Link
Reject link
Lookup Reject Link
“Output” option
automatically creates
the reject link
Null Handling
• Must handle null condition if lookup record is not found and
“continue” option is chosen
• Can be done in a transformer stage
Lookup Stage Mapping
Link name
Lookup Stage Properties
Referenc
e link
Must have same column
name in input and reference
links. You will get the results
of the lookup in the output
column.
DBMS as a Target
DBMS As Target
• Write Methods
– Delete
– Load
– Upsert
– Write (DB2)
• Write mode for load method
– Truncate
– Create
– Replace
– Append
Target Properties
Upsert mode
determines
options
Generated code
can be copied
Checking for Nulls
• Use Transformer stage to test for fields with null values (Use IsNull
functions)
• In Transformer, can reject or load default value
Concepts
• The Enterprise Edition Platform
– Script language - OSH (generated by DataStage Parallel Canvas, and run by
DataStage Director)
– Communication - conductor,section leaders,players.
– Configuration files (only one active at a time, describes H/W)
– Meta data - schemas/tables
– Schema propagation - RCP
– EE extensibility - Buildop, Wrapper
– Datasets (data in Framework's internal representation)
Output Data Set schema:
prov_num:int16;
member_num:int8;
custid:int32;
Input Data Set schema:
prov_num:int16;
member_num:int8;
custid:int32;
EE Stages Involve A Series Of Processing Steps
Input
Interface
Partitioner
Business
Logic
Output
Interface
EE Stage
• Piece of Application
Logic Running Against
Individual Records
• Parallel or Sequential
DS-EE Stage Elements
• EE Delivers Parallelism in
Two Ways
– Pipeline
– Partition
• Block Buffering Between
Components
– Eliminates Need for Program
Load Balancing
– Maintains Orderly Data Flow
Dual Parallelism Eliminates Bottlenecks!
Pipeline
Partition
Producer
Consumer
DSEE Stage Execution
Stages Control Partition Parallelism
• Execution Mode (sequential/parallel) is controlled by Stage
– default = parallel for most Ascential-supplied Stages
– Developer can override default mode
– Parallel Stage inserts the default partitioner (Auto) on its input links
– Sequential Stage inserts the default collector (Auto) on its input links
– Developer can override default
• execution mode (parallel/sequential) of Stage > Advanced tab
• choice of partitioner/collector on Input > Partitioning tab
How Parallel Is It?
• Degree of parallelism is determined by the configuration
file
– Total number of logical nodes in default pool, or a
subset if using "constraints".
• Constraints are assigned to specific pools as defined in
configuration file and can be referenced in the stage
OSH
• DataStage EE GUI generates OSH scripts
– Ability to view OSH turned on in Administrator
– OSH can be viewed in Designer using job properties
• The Framework executes OSH
• What is OSH?
– Orchestrate shell
– Has a UNIX command-line interface
OSH Script
• An osh script is a quoted string which specifies:
– The operators and connections of a single Orchestrate step
– In its simplest form, it is:
osh “op < in.ds > out.ds”
• Where:
– op is an Orchestrate operator
– in.ds is the input data set
– out.ds is the output data set
OSH Operators
• OSH Operator is an instance of a C++ class inheriting from
APT_Operator
• Developers can create new operators
• Examples of existing operators:
– Import
– Export
– RemoveDups
Enable Visible OSH in Administrator
Will be enabled
for all projects
View OSH in Designer
Schema
Operator
• Operators
• Datasets: set of rows processed by Framework
– Orchestrate data sets:
– persistent (terminal) *.ds, and
– virtual (internal) *.v.
– Also: flat “file sets” *.fs
• Schema: data description (metadata) for datasets and links.
Elements of a Framework Program
• Consist of Partitioned Data and Schema
• Can be Persistent (*.ds) or Virtual (*.v, Link)
• Overcome 2 GB File Limit
=
What you program: What gets processed:
.
.
.
Multiple files per partition
Each file up to 2GBytes (or larger)
Operator
A
Operator
A
Operator
A
Operator
A
Node 1 Node 2 Node 3 Node 4
data files
of x.ds
$ osh “operator_A > x.ds“
GUI
OSH
Datasets
What gets generated:
Operator
A
Computing Architectures: Definition
Clusters and MPP Systems
Shared Disk Shared Nothing
Uniprocessor
Dedicated Disk
Shared Memory
SMP System
(Symmetric Multiprocessor)
DiskDisk
CPU
Memory
CPU CPU CPU CPU
CPU
Disk
Memory
CPU
Disk
Memory
CPU
Disk
Memory
CPU
Disk
Memory
Job Execution: Orchestrate
Processing Node
Processing Node
• Conductor - initial DS/EE process
– Step Composer
– Creates Section Leader processes (one/node)
– Consolidates massages, outputs them
– Manages orderly shutdown.
• Section Leader
– Forks Players processes (one/Stage)
– Manages up/down communication.
Players
– The actual processes associated with Stages
– Combined players: one process only
– Send stderr to SL
– Establish connections to other players for data
flow
Conductor Node
C
SL
PP P
SL
PP P
Working with Configuration Files
• You can easily switch between config files:
• '1-node' file - for sequential execution, lighter reports—handy for testing
• 'MedN-nodes' file - aims at a mix of pipeline and data-partitioned parallelism
• 'BigN-nodes' file - aims at full data-partitioned parallelism
• Only one file is active while a step is running
• The Framework queries (first) the environment variable:
$APT_CONFIG_FILE
# nodes declared in the config file needs not match # CPUs
• Same configuration file can be used in development and target
machines
Scheduling Nodes, Processes, and CPUs
• DS/EE does not:
– know how many CPUs are available
– schedule
• Who knows what?
• Who does what?
– DS/EE creates (Nodes*Ops) Unix processes
– The O/S schedules these processes on the CPUs
Nodes = # logical nodes declared in config. file
Ops = # ops. (approx. # blue boxes in V.O.)
Processes = # Unix processes
CPUs = # available CPUs
Nodes Ops Processes CPUs
User Y N
Orchestrate Y Y Nodes * Ops N
O/S " Y
Parallel to parallel flow may incur reshuffling:
Records may jump between nodes
node
1
node
2
partitioner
Re-Partitioning
Partitioning Methods
• Auto
• Hash
• Entire
• Range
• Range Map
• Collectors combine partitions of a dataset into a single
input stream to a sequential Stage
data partitions
collector
sequential Stage
...
–Collectors do NOT synchronize data
Collectors
Partitioning and Repartitioning Are Visible On Job
Design
Partitioning and Collecting Icons
Partitioner Collector
Reading Messages in Director
• Set APT_DUMP_SCORE to true
• Can be specified as job parameter
• Messages sent to Director log
• If set, parallel job will produce a report showing the operators,
processes, and datasets in the running job
Messages With APT_DUMP_SCORE = True
Transformed Data
• Transformed data is:
– Outgoing column is a derivation that may, or may not, include incoming
fields or parts of incoming fields
– May be comprised of system variables
• Frequently uses functions performed on something (ie. incoming columns)
– Divided into categories – I.e.
• Date and time
• Mathematical
• Logical
• Null handling
• More
Stages Review
• Stages that can transform data
– Transformer
• Parallel
• Basic (from Parallel palette)
– Aggregator (discussed in later module)
• Sample stages that do not transform data
– Sequential
– FileSet
– DataSet
– DBMS
Transformer Stage Functions
• Control data flow
• Create derivations
Flow Control
• Separate records flow down links based on data condition –
specified in Transformer stage constraints
• Transformer stage can filter records
• Other stages can filter records but do not exhibit advanced flow
control
– Sequential can send bad records down reject link
– Lookup can reject records based on lookup failure
– Filter can select records based on data value
Rejecting Data
• Reject option on sequential stage
– Data does not agree with meta data
– Output consists of one column with binary data type
• Reject links (from Lookup stage) result from the drop option of the property “If
Not Found”
– Lookup “failed”
– All columns on reject link (no column mapping option)
• Reject constraints are controlled from the constraint editor of the transformer
– Can control column mapping
– Use the “Other/Log” checkbox
Rejecting Data Example
“If Not
Found”
property
Constraint
– Other/log
option
Property
Reject Mode
= Output
Transformer Stage Properties
Transformer Stage Variables
• First of transformer stage entities to execute
• Execute in order from top to bottom
– Can write a program by using one stage variable to point to the
results of a previous stage variable
• Multi-purpose
– Counters
– Hold values for previous rows to make comparison
– Hold derivations to be used in multiple field dervations
– Can be used to control execution of constraints
Stage Variables
Show/Hide
button
Transforming Data
• Derivations
– Using expressions
– Using functions
• Date/time
• Transformer Stage Issues
– Sometimes require sorting before the transformer stage – I.e.
using stage variable as accumulator and need to break on
change of column value
• Checking for nulls
Checking for Nulls
• Nulls can get introduced into the dataflow because of failed lookups
and the way in which you chose to handle this condition
• Can be handled in constraints, derivations, stage variables, or a
combination of these
Transformer - Handling Rejects
Constraint Rejects
– All expressions are false and
reject row is checked
Transformer: Execution Order
• Derivations in stage variables are executed first
• Constraints are executed before derivations
• Column derivations in earlier links are executed before later links
• Derivations in higher columns are executed before lower columns
Parallel Palette - Two Transformers
• All > Processing >
• Transformer
• Is the non-Universe transformer
• Has a specific set of functions
• No DS routines available
• Parallel > Processing
• Basic Transformer
• Makes server style transforms
available on the parallel palette
• Can use DS routines
• Program in Basic for both transformers
Transformer Functions From Derivation Editor
• Date & Time
• Logical
• Null Handling
• Number
• String
• Type Conversion
Sorting Data
• Important because
– Some stages require sorted input
– Some stages may run faster – I.e Aggregator
• Can be performed
– Option within stages (use input > partitioning tab and set
partitioning to anything other than auto)
– As a separate stage (more complex sorts)
Sorting Alternatives
• Alternative representation of same flow:
Sort Option on Stage Link
Sort Stage
Sort Stage - Outputs
• Specifies how the output is derived
Sort Specification Options
• Input Link Property
– Limited functionality
– Max memory/partition is 20 MB, then spills to scratch
• Sort Stage
– Tunable to use more memory before spilling to scratch.
• Note: Spread I/O by adding more scratch file systems to each node
of the APT_CONFIG_FILE
Removing Duplicates
• Can be done by Sort stage
– Use unique option
OR
• Remove Duplicates stage
– Has more sophisticated ways to remove duplicates
Combining Data
• There are two ways to combine data:
– Horizontally:
Several input links; one output link (+ optional rejects) made of columns
from different input links. E.g.,
• Joins
• Lookup
• Merge
– Vertically:
One input link, one output link with column combining values from all input
rows. E.g.,
• Aggregator
Join, Lookup & Merge Stages
• These "three Stages" combine two or more input links according to
values of user-designated "key" column(s).
• They differ mainly in:
– Memory usage
– Treatment of rows with unmatched key values
– Input requirements (sorted, de-duplicated)
Not all Links are Created Equal
Joins Lookup Merge
Primary Input: port 0 Left Source Master
Secondary Input(s): ports 1,… Right LU Table(s) Update(s)
• Enterprise Edition distinguishes between:
- The Primary Input (Framework port 0)
- Secondary - in some cases "Reference" (other ports)
• Naming convention:
Tip:
Check "Input Ordering" tab to make sure intended Primary is listed first
Join Stage Editor
One of four variants:
– Inner
– Left Outer
– Right Outer
– Full Outer
Several key columns
allowed
Link Order
immaterial for Inner
and Full Outer Joins
(but VERY important
for Left/Right Outer
and Lookup and
Merge)
1. The Join Stage
Four types:
• 2 sorted input links, 1 output link
– "left outer" on primary input, "right outer" on secondary input
– Pre-sort make joins "lightweight": few rows need to be in RAM
• Inner
• Left Outer
• Right Outer
• Full Outer
2. The Lookup Stage
Combines:
– one source link with
– one or more duplicate-free table links
no pre-sort necessary
allows multiple keys LUTs
flexible exception handling for
source input rows with no match
Source
input
One or more
tables (LUTs)
Output Reject
Lookup
0
1
2
0
1
The Lookup Stage
• Lookup Tables should be small enough to fit into physical
memory (otherwise, performance hit due to paging)
• On an MPP you should partition the lookup tables using entire
partitioning method, or partition them the same way you
partition the source link
• On an SMP, no physical duplication of a Lookup Table occurs
The Lookup Stage
• Lookup File Set
– Like a persistent data set only it contains
metadata about the key.
– Useful for staging lookup tables
• RDBMS LOOKUP
– NORMAL
• Loads to an in memory hash table first
– SPARSE
• Select for each row.
• Might become a performance
bottleneck.
3. The Merge Stage
• Combines
– one sorted, duplicate-free master (primary) link with
– one or more sorted update (secondary) links.
– Pre-sort makes merge "lightweight": few rows need to be in RAM (as with joins, but
opposite to lookup).
• Follows the Master-Update model:
– Master row and one or more updates row are merged if they have the same value in user-
specified key column(s).
– A non-key column occurs in several inputs? The lowest input port number prevails (e.g.,
master over update; update values are ignored)
– Unmatched ("Bad") master rows can be either
• kept
• dropped
– Unmatched ("Bad") update rows in input link can be captured in a "reject" link
– Matched update rows are consumed.
The Merge Stage
Allows composite keys
Multiple update links
Matched update rows are consumed
Unmatched updates can be captured
Lightweight
Space/time tradeoff: presorts vs. in-RAM table
Master One or more
updates
Output Rejects
Merge
0
0
21
21
In this table:
• , <comma> = separator between primary and secondary input links
(out and reject links)
Synopsis:Joins, Lookup, & Merge
Joins Lookup Merge
Model RDBMS-style relational Source - in RAM LU Table Master -Update(s)
Memory usage light heavy light
# and names of Inputs exactly 2: 1 left, 1 right 1 Source, N LU Tables 1 Master, N Update(s)
Mandatory Input Sort both inputs no all inputs
Duplicates in primary input OK (x-product) OK Warning!
Duplicates in secondary input(s) OK (x-product) Warning! OK only when N = 1
Options on unmatched primary NONE [fail] | continue | drop | reject [keep] | drop
Options on unmatched secondary NONE NONE capture in reject set(s)
On match, secondary entries are reusable reusable consumed
# Outputs 1 1 out, (1 reject) 1 out, (N rejects)
Captured in reject set(s) Nothing (N/A) unmatched primary entries unmatched secondary entries
The Aggregator Stage
Purpose: Perform data aggregations
Specify:
• Zero or more key columns that define the aggregation units (or
groups)
• Columns to be aggregated
• Aggregation functions:
count (nulls/non-nulls) sum
max/min/range
• The grouping method (hash table or pre-sort) is a performance
issue
Grouping Methods
• Hash: results for each aggregation group are stored in a hash table, and the table is
written out after all input has been processed
– doesn’t require sorted data
– good when number of unique groups is small. Running tally for each group’s
aggregate calculations need to fit easily into memory. Require about 1KB/group
of RAM.
– Example: average family income by state, requires .05MB of RAM
• Sort: results for only a single aggregation group are kept in memory; when new group
is seen (key value changes), current group written out.
– requires input sorted by grouping keys
– can handle unlimited numbers of groups
– Example: average daily balance by credit card
Aggregator Functions
• Sum
• Min, max
• Mean
• Missing value count
• Non-missing value count
• Percent coefficient of variation
Aggregator Properties
Aggregation Types
Aggregation types
Containers
• Two varieties
– Local
– Shared
• Local
– Simplifies a large, complex diagram
• Shared
– Creates reusable object that many jobs can include
Creating a Container
• Create a job
• Select (loop) portions to containerize
• Edit > Construct container > local or shared
Configuration File Concepts
• Determine the processing nodes and disk space connected to each
node
• When system changes, need only change the configuration file – no
need to recompile jobs
• When DataStage job runs, platform reads configuration file
– Platform automatically scales the application to fit the system
Processing Nodes Are
• Locations on which the framework runs applications
• Logical rather than physical construct
• Do not necessarily correspond to the number of CPUs in your
system
– Typically one node for two CPUs
• Can define one processing node for multiple physical nodes or
multiple processing nodes for one physical node
Optimizing Parallelism
• Degree of parallelism determined by number of nodes defined
• Parallelism should be optimized, not maximized
– Increasing parallelism distributes work load but also increases
Framework overhead
• Hardware influences degree of parallelism possible
• System hardware partially determines configuration
More Factors to Consider
• Communication amongst operators
– Should be optimized by your configuration
– Operators exchanging large amounts of data should be assigned
to nodes communicating by shared memory or high-speed link
• SMP – leave some processors for operating system
• Desirable to equalize partitioning of data
• Use an experimental approach
– Start with small data sets
– Try different parallelism while scaling up data set sizes
Configuration File
• Text file containing string data that is passed to the Framework
– Sits on server side
– Can be displayed and edited
• Name and location found in environmental variable
APT_CONFIG_FILE
• Components
– Node
– Fast name
– Pools
– Resource
Node Options
• Node name – name of a processing node used by EE
– Typically the network name
– Use command uname –n to obtain network name
• Fastname –
– Name of node as referred to by fastest network in the system
– Operators use physical node name to open connections
– NOTE: for SMP, all CPUs share single connection to network
• Pools
– Names of pools to which this node is assigned
– Used to logically group nodes
– Can also be used to group resources
• Resource
– Disk
– Scratchdisk
Sample Configuration File
{
node “Node1"
{
fastname "BlackHole"
pools "" "node1"
resource disk "/usr/dsadm/Ascential/DataStage/Datasets" {pools "" }
resource scratchdisk "/usr/dsadm/Ascential/DataStage/Scratch" {pools "" }
}
}
Disk Pools
• Disk pools allocate storage
• By default, EE uses the default
pool, specified by “”
pool "bigdata"
Sorting Requirements
Resource pools can also be specified for sorting:
• The Sort stage looks first for scratch disk resources in a
“sort” pool, and then in the default disk pool
Resource Types
• Disk
• Scratchdisk
• DB2
• Oracle
• Saswork
• Sortwork
• Can exist in a pool
– Groups resources together
Using Different Configurations
Lookup stage where DBMS is using a sparse lookup type
Building a Configuration File
• Scoping the hardware:
– Is the hardware configuration SMP, Cluster, or MPP?
– Define each node structure (an SMP would be single node):
• Number of CPUs
• CPU speed
• Available memory
• Available page/swap space
• Connectivity (network/back-panel speed)
– Is the machine dedicated to EE? If not, what other applications are running on
it?
– Get a breakdown of the resource usage (vmstat, mpstat, iostat)
– Are there other configuration restrictions? E.g. DB only runs on certain nodes
and ETL cannot run on them?
Wrappers vs. Buildop vs. Custom
• Wrappers are good if you cannot or do not want to modify the
application and performance is not critical.
• Buildops: good if you need custom coding but do not need
dynamic (runtime-based) input and output interfaces.
• Custom (C++ coding using framework API): good if you need custom
coding and need dynamic input and output interfaces.
Building “Wrapped” Stages
You can “wrapper” a legacy executable:
• Binary
• Unix command
• Shell script
… and turn it into a Enterprise Edition stage capable, among other things, of
parallel execution…
As long as the legacy executable is:
• amenable to data-partition parallelism
» no dependencies between rows
• pipe-safe
» can read rows sequentially
» no random access to data
Wrappers (Cont’d)
Wrappers are treated as a black box
• EE has no knowledge of contents
• EE has no means of managing anything that occurs inside the wrapper
• EE only knows how to export data to and import data from the wrapper
• User must know at design time the intended behavior of the wrapper and its
schema interface
• If the wrappered application needs to see all records prior to processing, it cannot
run in parallel.
LS Example
• Can this command be wrappered?
Creating a Wrapper
Used in this job ---
To create the “ls” stage
Creating Wrapped Stages
From Manager:
Right-Click on Stage Type
> New Parallel Stage > Wrapped
We will "Wrapper” an existing
Unix executables – the ls
command
Wrapper Starting Point
Wrapper - General Page
Unix command to be wrapped
Name of stage
Conscientiously
maintaining the Creator
page for all your wrapped
stages will eventually earn
you the thanks of others.
The "Creator" Page
Wrapper – Properties Page
• If your stage will have properties appear, complete the Properties
page
This will be the name of
the property as it
appears in your stage
Wrapper - Wrapped Page
Interfaces – input and output columns -
these should first be entered into the
table definitions meta data (DS
Manager); let’s do that now.
• Layout interfaces describe what columns the stage:
– Needs for its inputs (if any)
– Creates for its outputs (if any)
– Should be created as tables with columns in Manager
Interface schemas
Column Definition for Wrapper Interface
How Does the Wrapping Work?
– Define the schema for export and
import
• Schemas become interface
schemas of the operator and allow
for by-name column access
import
export
stdout or
named pipe
stdin or
named pipe
UNIX executable
output schema
input schema
Update the Wrapper Interfaces
• This wrapper will have no input interface – i.e. no input link. The
location will come as a job parameter that will be passed to the
appropriate stage property. Therefore, only the Output tab entry is
needed.
Resulting Job
Wrapped stage
Job Run
• Show file from Designer palette
Wrapper Story: Cobol Application
• Hardware Environment:
– IBM SP2, 2 nodes with 4 CPU’s per node.
• Software:
– DB2/EEE, COBOL, EE
• Original COBOL Application:
– Extracted source table, performed lookup against table in DB2, and Loaded results to
target table.
– 4 hours 20 minutes sequential execution
• Enterprise Edition Solution:
– Used EE to perform Parallel DB2 Extracts and Loads
– Used EE to execute COBOL application in Parallel
– EE Framework handled data transfer between
DB2/EEE and COBOL application
– 30 minutes 8-way parallel execution
Buildops
Buildop provides a simple means of extending beyond the functionality provided by EE,
but does not use an existing executable (like the wrapper)
Reasons to use Buildop include:
• Speed / Performance
• Complex business logic that cannot be easily represented
using existing stages
– Lookups across a range of values
– Surrogate key generation
– Rolling aggregates
• Build once and reusable everywhere within project, no
shared container necessary
• Can combine functionality from different stages into one
BuildOps
– The DataStage programmer encapsulates the business logic
– The Enterprise Edition interface called “buildop” automatically
performs the tedious, error-prone tasks: invoke needed header files,
build the necessary “plumbing” for a correct and efficient parallel
execution.
– Exploits extensibility of EE Framework
From Manager (or Designer):
Repository pane:
Right-Click on Stage Type
> New Parallel Stage > {Custom | Build | Wrapped}
• "Build" stages
from within Enterprise Edition
• "Wrapping” existing “Unix”
executables
BuildOp Process Overview
General Page
Identical
to Wrappers,
except:
Under the Build
Tab, your program!
Logic Tab for Business Logic
Enter Business C/C++
logic and arithmetic in four
pages under the Logic tab
Main code section goes in
Per-Record page- it will be
applied to all rows
NOTE: Code will need to
be Ansi C/C++ compliant.
If code does not compile
outside of EE, it won’t
compile within EE either!
Code Sections under Logic Tab
Temporary
variables
declared [and
initialized] here
Logic here is executed
once BEFORE
processing the FIRST
row
Logic here is executed
once AFTER
processing the LAST
row
I/O and Transfer
Under Interface tab: Input, Output & Transfer pages
First line:
output 0
Optional renaming
of
output port from
default "out0"
Write row
Input page: 'Auto Read'
Read next row
In-Repository
Table Definition
'False' setting,
not to interfere
with Transfer page
I/O and Transfer
• Transfer all columns from input to output.
• If page left blank or Auto Transfer = "False" (and RCP = "False")
Only columns in output Table Definition are written
First line:
Transfer of index 0
BuildOp Simple Example
• Example - sumNoTransfer
– Add input columns "a" and "b"; ignore other columns
that might be present in input
– Produce a new "sum" column
– Do not transfer input columns
sumNoTransfer
a:int32; b:int32
sum:int32
NO TRANSFER
- RCP set to "False" in stage definition
and
- Transfer page left blank, or Auto Transfer = "False"
• Effects:
- input columns "a" and "b" are not transferred
- only new column "sum" is transferred
From Peek:
No Transfer
Transfer
TRANSFER
- RCP set to "True" in stage definition
or
- Auto Transfer set to "True"
• Effects:
- new column "sum" is transferred, as well as
- input columns "a" and "b" and
- input column "ignored" (present in input, but
not mentioned in stage)
Columns vs. Temporary C++ Variables
Columns
• DS-EE type
• Defined in Table Definitions
• Value refreshed from row
to row
Temp C++ variables
• C/C++ type
• Need declaration (in
Definitions or Pre-Loop
page)
• Value persistent
throughout "loop" over
rows, unless modified in
code
Custom Stage
• Reasons for a custom stage:
– Add EE operator not already in DataStage EE
– Build your own Operator and add to DataStage EE
• Use EE API
• Use Custom Stage to add new operator to EE canvas
Custom Stage
DataStage Manager > select Stage Types branch > right click
Custom Stage
Name of Orchestrate
operator to be used
Number of input and
output links allowed
Custom Stage – Properties Tab
The Result
Establishing Meta Data
• Data definitions
– Recordization and columnization
– Fields have properties that can be set at individual field level
• Data types in GUI are translated to types used by EE
– Described as properties on the format/columns tab (outputs or inputs pages)
OR
– Using a schema file (can be full or partial)
• Schemas
– Can be imported into Manager
– Can be pointed to by some job stages (i.e. Sequential)
Data Formatting – Record Level
• Format tab
• Meta data described on a record basis
• Record level properties
Data Formatting – Column Level
• Defaults for all columns
Column Overrides
• Edit row from within the columns tab
• Set individual column properties
Extended Column Properties
Field
and
string
settings
Extended Properties – String Type
• Note the ability to convert ASCII to EBCDIC
Editing Columns
Properties depend on
the data type
Schema
• Alternative way to specify column definitions for data used in EE jobs
• Written in a plain text file
• Can be written as a partial record definition
• Can be imported into the DataStage repository
Creating a Schema
• Using a text editor
– Follow correct syntax for definitions
– OR
• Import from an existing data set or file set
– On DataStage Manager import > Table Definitions > Orchestrate
Schema Definitions
– Select checkbox for a file with .fs or .ds
Importing a Schema
Schema location can be
on the server or local
work station
Data Types
• Date
• Decimal
• Floating point
• Integer
• String
• Time
• Timestamp
• Vector
• Subrecord
• Raw
• Tagged
Runtime Column Propagation
• DataStage EE is flexible about meta data. It can cope with the situation where
meta data isn’t fully defined. You can define part of your schema and specify
that, if your job encounters extra columns that are not defined in the meta data
when it actually runs, it will adopt these extra columns and propagate them
through the rest of the job. This is known as runtime column propagation (RCP).
• RCP is always on at runtime.
• Design and compile time column mapping enforcement.
– RCP is off by default.
– Enable first at project level. (Administrator project properties)
– Enable at job level. (job properties General tab)
– Enable at Stage. (Link Output Column tab)
Enabling RCP at Project Level
Enabling RCP at Job Level
Enabling RCP at Stage Level
• Go to output link’s columns tab
• For transformer you can find the output links columns tab by first going to stage
properties
Using RCP with Sequential Stages
• To utilize runtime column propagation in the sequential stage you
must use the “use schema” option
• Stages with this restriction:
– Sequential
– File Set
– External Source
– External Target
Runtime Column Propagation
• When RCP is Disabled
– DataStage Designer will enforce Stage Input Column to Output
Column mappings.
– At job compile time modify operators are inserted on output links in the
generated osh.
Runtime Column Propagation
• When RCP is Enabled
– DataStage Designer will not enforce mapping rules.
– No Modify operator inserted at compile time.
– Danger of runtime error if column names incoming do not match
column names outgoing link – case sensitivity.
Job Control Options
• Manually write job control
– Code generated in Basic
– Use the job control tab on the job properties page
– Generates basic code which you can modify
• Job Sequencer
– Build a controlling job much the same way you build other jobs
– Comprised of stages and links
– No basic coding
Job Sequencer
• Build like a regular job
• Type “Job Sequence”
• Has stages and links
• Job Activity stage represents
a DataStage job
• Links represent passing
control
Stages
Example
Job Activity
stage –
contains
conditional
triggers
Job Activity Properties
Job parameters
to be passed
Job to be executed –
select from dropdown
Job Activity Trigger
• Trigger appears as a link in the diagram
• Custom options let you define the code
Options
• Use custom option for conditionals
– Execute if job run successful or warnings only
• Can add “wait for file” to execute
• Add “execute command” stage to drop real tables and rename new
tables to current tables
Job Activity With Multiple Links
Different links
having different
triggers
Sequencer Stage
• Build job sequencer to control job for the collections application
Can be set to
all or any
Notification
Notification Stage
Notification Activity
Sample DataStage log from Mail Notification
• Sample DataStage log from Mail Notification
Notification Activity Message
• E-Mail Message
Environment Variables
Parallel Environment Variables
Environment Variables Stage Specific
Environment Variables
Environment Variables Compiler
The Director
Typical Job Log Messages:
• Environment variables
• Configuration File information
• Framework Info/Warning/Error messages
• Output from the Peek Stage
• Additional info with "Reporting" environments
• Tracing/Debug output
– Must compile job in trace mode
– Adds overhead
• Job Properties, from Menu Bar of Designer
• Director will
prompt you
before each
run
Job Level Environmental Variables
Troubleshooting
If you get an error during compile, check the following:
• Compilation problems
– If Transformer used, check C++ compiler, LD_LIRBARY_PATH
– If Buildop errors try buildop from command line
– Some stages may not support RCP – can cause column mismatch .
– Use the Show Error and More buttons
– Examine Generated OSH
– Check environment variables settings
• Very little integrity checking during compile, should run validate from Director.
Highlights source of error
Generating Test Data
• Row Generator stage can be used
– Column definitions
– Data type dependent
• Row Generator plus lookup stages provides good way to create
robust test data from pattern files
Thank You !!!Thank You !!!
For More Information click below link:
Follow Us on:
http://vibranttechnologies.co.in/datastage-classes-in-mumbai.html

Mais conteúdo relacionado

Mais procurados

Ceph Performance and Sizing Guide
Ceph Performance and Sizing GuideCeph Performance and Sizing Guide
Ceph Performance and Sizing GuideJose De La Rosa
 
Graylog Engineering - Design Your Architecture
Graylog Engineering - Design Your ArchitectureGraylog Engineering - Design Your Architecture
Graylog Engineering - Design Your ArchitectureGraylog
 
Understanding PostgreSQL LW Locks
Understanding PostgreSQL LW LocksUnderstanding PostgreSQL LW Locks
Understanding PostgreSQL LW LocksJignesh Shah
 
Error Management Features of PL/SQL
Error Management Features of PL/SQLError Management Features of PL/SQL
Error Management Features of PL/SQLSteven Feuerstein
 
Free Training: How to Build a Lakehouse
Free Training: How to Build a LakehouseFree Training: How to Build a Lakehouse
Free Training: How to Build a LakehouseDatabricks
 
Demystifying flink memory allocation and tuning - Roshan Naik, Uber
Demystifying flink memory allocation and tuning - Roshan Naik, UberDemystifying flink memory allocation and tuning - Roshan Naik, Uber
Demystifying flink memory allocation and tuning - Roshan Naik, UberFlink Forward
 
data stage-material
data stage-materialdata stage-material
data stage-materialRajesh Kv
 
Apache Beam: A unified model for batch and stream processing data
Apache Beam: A unified model for batch and stream processing dataApache Beam: A unified model for batch and stream processing data
Apache Beam: A unified model for batch and stream processing dataDataWorks Summit/Hadoop Summit
 
Redefining tables online without surprises
Redefining tables online without surprisesRedefining tables online without surprises
Redefining tables online without surprisesNelson Calero
 
Oracle databáze – Konsolidovaná Data Management Platforma
Oracle databáze – Konsolidovaná Data Management PlatformaOracle databáze – Konsolidovaná Data Management Platforma
Oracle databáze – Konsolidovaná Data Management PlatformaMarketingArrowECS_CZ
 
plProxy, pgBouncer, pgBalancer
plProxy, pgBouncer, pgBalancerplProxy, pgBouncer, pgBalancer
plProxy, pgBouncer, pgBalancerelliando dias
 
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021StreamNative
 
KSQL and Security: The Current State of Affairs (Victoria Xia, Confluent) Kaf...
KSQL and Security: The Current State of Affairs (Victoria Xia, Confluent) Kaf...KSQL and Security: The Current State of Affairs (Victoria Xia, Confluent) Kaf...
KSQL and Security: The Current State of Affairs (Victoria Xia, Confluent) Kaf...confluent
 
Data stage interview questions and answers|DataStage FAQS
Data stage interview questions and answers|DataStage FAQSData stage interview questions and answers|DataStage FAQS
Data stage interview questions and answers|DataStage FAQSBigClasses.com
 
Apache BookKeeper: A High Performance and Low Latency Storage Service
Apache BookKeeper: A High Performance and Low Latency Storage ServiceApache BookKeeper: A High Performance and Low Latency Storage Service
Apache BookKeeper: A High Performance and Low Latency Storage ServiceSijie Guo
 

Mais procurados (20)

Ceph Performance and Sizing Guide
Ceph Performance and Sizing GuideCeph Performance and Sizing Guide
Ceph Performance and Sizing Guide
 
Graylog Engineering - Design Your Architecture
Graylog Engineering - Design Your ArchitectureGraylog Engineering - Design Your Architecture
Graylog Engineering - Design Your Architecture
 
Understanding PostgreSQL LW Locks
Understanding PostgreSQL LW LocksUnderstanding PostgreSQL LW Locks
Understanding PostgreSQL LW Locks
 
ELK Stack
ELK StackELK Stack
ELK Stack
 
Error Management Features of PL/SQL
Error Management Features of PL/SQLError Management Features of PL/SQL
Error Management Features of PL/SQL
 
PostgreSQL
PostgreSQLPostgreSQL
PostgreSQL
 
Free Training: How to Build a Lakehouse
Free Training: How to Build a LakehouseFree Training: How to Build a Lakehouse
Free Training: How to Build a Lakehouse
 
Demystifying flink memory allocation and tuning - Roshan Naik, Uber
Demystifying flink memory allocation and tuning - Roshan Naik, UberDemystifying flink memory allocation and tuning - Roshan Naik, Uber
Demystifying flink memory allocation and tuning - Roshan Naik, Uber
 
data stage-material
data stage-materialdata stage-material
data stage-material
 
Apache Beam: A unified model for batch and stream processing data
Apache Beam: A unified model for batch and stream processing dataApache Beam: A unified model for batch and stream processing data
Apache Beam: A unified model for batch and stream processing data
 
Redefining tables online without surprises
Redefining tables online without surprisesRedefining tables online without surprises
Redefining tables online without surprises
 
Oracle databáze – Konsolidovaná Data Management Platforma
Oracle databáze – Konsolidovaná Data Management PlatformaOracle databáze – Konsolidovaná Data Management Platforma
Oracle databáze – Konsolidovaná Data Management Platforma
 
NoSQL databases
NoSQL databasesNoSQL databases
NoSQL databases
 
plProxy, pgBouncer, pgBalancer
plProxy, pgBouncer, pgBalancerplProxy, pgBouncer, pgBalancer
plProxy, pgBouncer, pgBalancer
 
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021
 
KSQL and Security: The Current State of Affairs (Victoria Xia, Confluent) Kaf...
KSQL and Security: The Current State of Affairs (Victoria Xia, Confluent) Kaf...KSQL and Security: The Current State of Affairs (Victoria Xia, Confluent) Kaf...
KSQL and Security: The Current State of Affairs (Victoria Xia, Confluent) Kaf...
 
Oracle ASM Training
Oracle ASM TrainingOracle ASM Training
Oracle ASM Training
 
Data stage interview questions and answers|DataStage FAQS
Data stage interview questions and answers|DataStage FAQSData stage interview questions and answers|DataStage FAQS
Data stage interview questions and answers|DataStage FAQS
 
ELK Stack
ELK StackELK Stack
ELK Stack
 
Apache BookKeeper: A High Performance and Low Latency Storage Service
Apache BookKeeper: A High Performance and Low Latency Storage ServiceApache BookKeeper: A High Performance and Low Latency Storage Service
Apache BookKeeper: A High Performance and Low Latency Storage Service
 

Destaque

Data stage scenario design 2 - job1
Data stage scenario   design 2 - job1Data stage scenario   design 2 - job1
Data stage scenario design 2 - job1Naresh Bala
 
Data stage faqs datastage faqs
Data stage faqs  datastage faqsData stage faqs  datastage faqs
Data stage faqs datastage faqsNaresh Bala
 
Day 2 Data Stage Manager 11.0
Day 2 Data Stage Manager 11.0Day 2 Data Stage Manager 11.0
Day 2 Data Stage Manager 11.0kshanmug2
 
Sql joins inner join self join outer joins
Sql joins inner join self join outer joinsSql joins inner join self join outer joins
Sql joins inner join self join outer joinsDeepthi Rachumallu
 
Day 1 Data Stage Administrator And Director 11.0
Day 1 Data Stage Administrator And Director 11.0Day 1 Data Stage Administrator And Director 11.0
Day 1 Data Stage Administrator And Director 11.0kshanmug2
 
Datastage real time scenario
Datastage real time scenarioDatastage real time scenario
Datastage real time scenarioNaresh Bala
 
SQL select statement and functions
SQL select statement and functionsSQL select statement and functions
SQL select statement and functionsVikas Gupta
 
Curriculum Vitae - Dinesh Babu S V
Curriculum Vitae - Dinesh Babu S VCurriculum Vitae - Dinesh Babu S V
Curriculum Vitae - Dinesh Babu S VDinesh Babu S V
 
Datastage developer Resume
Datastage developer ResumeDatastage developer Resume
Datastage developer ResumeMallikarjuna P
 
SQL Joins and Query Optimization
SQL Joins and Query OptimizationSQL Joins and Query Optimization
SQL Joins and Query OptimizationBrian Gallagher
 
Capturing Data Requirements
Capturing Data RequirementsCapturing Data Requirements
Capturing Data Requirementsmcomtraining
 
Datastage free tutorial
Datastage free tutorialDatastage free tutorial
Datastage free tutorialtekslate1
 
Big-data analytics: challenges and opportunities
Big-data analytics: challenges and opportunitiesBig-data analytics: challenges and opportunities
Big-data analytics: challenges and opportunities台灣資料科學年會
 

Destaque (20)

Sql server select queries ppt 18
Sql server select queries ppt 18Sql server select queries ppt 18
Sql server select queries ppt 18
 
Datastage
DatastageDatastage
Datastage
 
Data stage scenario design 2 - job1
Data stage scenario   design 2 - job1Data stage scenario   design 2 - job1
Data stage scenario design 2 - job1
 
Data stage faqs datastage faqs
Data stage faqs  datastage faqsData stage faqs  datastage faqs
Data stage faqs datastage faqs
 
Day 2 Data Stage Manager 11.0
Day 2 Data Stage Manager 11.0Day 2 Data Stage Manager 11.0
Day 2 Data Stage Manager 11.0
 
Sql joins inner join self join outer joins
Sql joins inner join self join outer joinsSql joins inner join self join outer joins
Sql joins inner join self join outer joins
 
Day 1 Data Stage Administrator And Director 11.0
Day 1 Data Stage Administrator And Director 11.0Day 1 Data Stage Administrator And Director 11.0
Day 1 Data Stage Administrator And Director 11.0
 
Ibm info sphere datastage tutorial part 1 architecture examples
Ibm info sphere datastage tutorial part 1  architecture examplesIbm info sphere datastage tutorial part 1  architecture examples
Ibm info sphere datastage tutorial part 1 architecture examples
 
Datastage real time scenario
Datastage real time scenarioDatastage real time scenario
Datastage real time scenario
 
SQL select statement and functions
SQL select statement and functionsSQL select statement and functions
SQL select statement and functions
 
Resume_Sathish
Resume_SathishResume_Sathish
Resume_Sathish
 
Curriculum Vitae - Dinesh Babu S V
Curriculum Vitae - Dinesh Babu S VCurriculum Vitae - Dinesh Babu S V
Curriculum Vitae - Dinesh Babu S V
 
Datastage developer Resume
Datastage developer ResumeDatastage developer Resume
Datastage developer Resume
 
SQL Joins
SQL JoinsSQL Joins
SQL Joins
 
SQL Joins and Query Optimization
SQL Joins and Query OptimizationSQL Joins and Query Optimization
SQL Joins and Query Optimization
 
Capturing Data Requirements
Capturing Data RequirementsCapturing Data Requirements
Capturing Data Requirements
 
SQL JOIN
SQL JOINSQL JOIN
SQL JOIN
 
Datastage free tutorial
Datastage free tutorialDatastage free tutorial
Datastage free tutorial
 
Big-data analytics: challenges and opportunities
Big-data analytics: challenges and opportunitiesBig-data analytics: challenges and opportunities
Big-data analytics: challenges and opportunities
 
Big Data: Issues and Challenges
Big Data: Issues and ChallengesBig Data: Issues and Challenges
Big Data: Issues and Challenges
 

Semelhante a Datastage Introduction To Data Warehousing

Evolutionary database design
Evolutionary database designEvolutionary database design
Evolutionary database designSalehein Syed
 
Amit Kumar_Resume
Amit Kumar_ResumeAmit Kumar_Resume
Amit Kumar_ResumeAmit Kumar
 
How to obtain the Cloudera Data Engineer Certification
How to obtain the Cloudera Data Engineer CertificationHow to obtain the Cloudera Data Engineer Certification
How to obtain the Cloudera Data Engineer Certificationelephantscale
 
Data Vault Automation at the Bijenkorf
Data Vault Automation at the BijenkorfData Vault Automation at the Bijenkorf
Data Vault Automation at the BijenkorfRob Winters
 
Remote DBA Experts SQL Server 2008 New Features
Remote DBA Experts SQL Server 2008 New FeaturesRemote DBA Experts SQL Server 2008 New Features
Remote DBA Experts SQL Server 2008 New FeaturesRemote DBA Experts
 
AWS re:Invent 2016| DAT318 | Migrating from RDBMS to NoSQL: How Sony Moved fr...
AWS re:Invent 2016| DAT318 | Migrating from RDBMS to NoSQL: How Sony Moved fr...AWS re:Invent 2016| DAT318 | Migrating from RDBMS to NoSQL: How Sony Moved fr...
AWS re:Invent 2016| DAT318 | Migrating from RDBMS to NoSQL: How Sony Moved fr...Amazon Web Services
 
FME Server Workspace Patterns - Continued
FME Server Workspace Patterns - ContinuedFME Server Workspace Patterns - Continued
FME Server Workspace Patterns - ContinuedSafe Software
 
Staged Patching Approach in Oracle E-Business Suite
Staged Patching Approach in Oracle E-Business SuiteStaged Patching Approach in Oracle E-Business Suite
Staged Patching Approach in Oracle E-Business Suitevasuballa
 
SQL Explore 2012: P&T Part 1
SQL Explore 2012: P&T Part 1SQL Explore 2012: P&T Part 1
SQL Explore 2012: P&T Part 1sqlserver.co.il
 
Choosing the Right Business Intelligence Tools for Your Data and Architectura...
Choosing the Right Business Intelligence Tools for Your Data and Architectura...Choosing the Right Business Intelligence Tools for Your Data and Architectura...
Choosing the Right Business Intelligence Tools for Your Data and Architectura...Victor Holman
 
Sap bods Training in Hyderabad | Sap bods Online Training
Sap bods Training in Hyderabad | Sap bods  Online Training Sap bods Training in Hyderabad | Sap bods  Online Training
Sap bods Training in Hyderabad | Sap bods Online Training CHENNAKESHAVAKATAGAR
 
Sap bods training in hyderabad
Sap bods training in hyderabadSap bods training in hyderabad
Sap bods training in hyderabadRajitha D
 
FME World Tour 2015 - FME & Data Migration Simon McCabe
FME World Tour 2015 -  FME & Data Migration Simon McCabeFME World Tour 2015 -  FME & Data Migration Simon McCabe
FME World Tour 2015 - FME & Data Migration Simon McCabeIMGS
 
Performing successful migrations to the microsoft cloud
Performing successful migrations to the microsoft cloudPerforming successful migrations to the microsoft cloud
Performing successful migrations to the microsoft cloudAndries den Haan
 
SharePoint 2013 Performance Analysis - Robi Vončina
SharePoint 2013 Performance Analysis - Robi VončinaSharePoint 2013 Performance Analysis - Robi Vončina
SharePoint 2013 Performance Analysis - Robi VončinaSPC Adriatics
 
(ATS6-PLAT07) Managing AEP in an enterprise environment
(ATS6-PLAT07) Managing AEP in an enterprise environment(ATS6-PLAT07) Managing AEP in an enterprise environment
(ATS6-PLAT07) Managing AEP in an enterprise environmentBIOVIA
 
Migrate from Oracle to Aurora PostgreSQL: Best Practices, Design Patterns, & ...
Migrate from Oracle to Aurora PostgreSQL: Best Practices, Design Patterns, & ...Migrate from Oracle to Aurora PostgreSQL: Best Practices, Design Patterns, & ...
Migrate from Oracle to Aurora PostgreSQL: Best Practices, Design Patterns, & ...Amazon Web Services
 

Semelhante a Datastage Introduction To Data Warehousing (20)

Evolutionary database design
Evolutionary database designEvolutionary database design
Evolutionary database design
 
Amit Kumar_Resume
Amit Kumar_ResumeAmit Kumar_Resume
Amit Kumar_Resume
 
How to obtain the Cloudera Data Engineer Certification
How to obtain the Cloudera Data Engineer CertificationHow to obtain the Cloudera Data Engineer Certification
How to obtain the Cloudera Data Engineer Certification
 
Data Vault Automation at the Bijenkorf
Data Vault Automation at the BijenkorfData Vault Automation at the Bijenkorf
Data Vault Automation at the Bijenkorf
 
Remote DBA Experts SQL Server 2008 New Features
Remote DBA Experts SQL Server 2008 New FeaturesRemote DBA Experts SQL Server 2008 New Features
Remote DBA Experts SQL Server 2008 New Features
 
Taming the shrew Power BI
Taming the shrew Power BITaming the shrew Power BI
Taming the shrew Power BI
 
Boobalan_Muthukumarasamy_Resume_DW_8_Yrs
Boobalan_Muthukumarasamy_Resume_DW_8_YrsBoobalan_Muthukumarasamy_Resume_DW_8_Yrs
Boobalan_Muthukumarasamy_Resume_DW_8_Yrs
 
AWS re:Invent 2016| DAT318 | Migrating from RDBMS to NoSQL: How Sony Moved fr...
AWS re:Invent 2016| DAT318 | Migrating from RDBMS to NoSQL: How Sony Moved fr...AWS re:Invent 2016| DAT318 | Migrating from RDBMS to NoSQL: How Sony Moved fr...
AWS re:Invent 2016| DAT318 | Migrating from RDBMS to NoSQL: How Sony Moved fr...
 
FME Server Workspace Patterns - Continued
FME Server Workspace Patterns - ContinuedFME Server Workspace Patterns - Continued
FME Server Workspace Patterns - Continued
 
Staged Patching Approach in Oracle E-Business Suite
Staged Patching Approach in Oracle E-Business SuiteStaged Patching Approach in Oracle E-Business Suite
Staged Patching Approach in Oracle E-Business Suite
 
SQL Explore 2012: P&T Part 1
SQL Explore 2012: P&T Part 1SQL Explore 2012: P&T Part 1
SQL Explore 2012: P&T Part 1
 
Data migration
Data migrationData migration
Data migration
 
Choosing the Right Business Intelligence Tools for Your Data and Architectura...
Choosing the Right Business Intelligence Tools for Your Data and Architectura...Choosing the Right Business Intelligence Tools for Your Data and Architectura...
Choosing the Right Business Intelligence Tools for Your Data and Architectura...
 
Sap bods Training in Hyderabad | Sap bods Online Training
Sap bods Training in Hyderabad | Sap bods  Online Training Sap bods Training in Hyderabad | Sap bods  Online Training
Sap bods Training in Hyderabad | Sap bods Online Training
 
Sap bods training in hyderabad
Sap bods training in hyderabadSap bods training in hyderabad
Sap bods training in hyderabad
 
FME World Tour 2015 - FME & Data Migration Simon McCabe
FME World Tour 2015 -  FME & Data Migration Simon McCabeFME World Tour 2015 -  FME & Data Migration Simon McCabe
FME World Tour 2015 - FME & Data Migration Simon McCabe
 
Performing successful migrations to the microsoft cloud
Performing successful migrations to the microsoft cloudPerforming successful migrations to the microsoft cloud
Performing successful migrations to the microsoft cloud
 
SharePoint 2013 Performance Analysis - Robi Vončina
SharePoint 2013 Performance Analysis - Robi VončinaSharePoint 2013 Performance Analysis - Robi Vončina
SharePoint 2013 Performance Analysis - Robi Vončina
 
(ATS6-PLAT07) Managing AEP in an enterprise environment
(ATS6-PLAT07) Managing AEP in an enterprise environment(ATS6-PLAT07) Managing AEP in an enterprise environment
(ATS6-PLAT07) Managing AEP in an enterprise environment
 
Migrate from Oracle to Aurora PostgreSQL: Best Practices, Design Patterns, & ...
Migrate from Oracle to Aurora PostgreSQL: Best Practices, Design Patterns, & ...Migrate from Oracle to Aurora PostgreSQL: Best Practices, Design Patterns, & ...
Migrate from Oracle to Aurora PostgreSQL: Best Practices, Design Patterns, & ...
 

Mais de Vibrant Technologies & Computers

Data ware housing - Introduction to data ware housing process.
Data ware housing - Introduction to data ware housing process.Data ware housing - Introduction to data ware housing process.
Data ware housing - Introduction to data ware housing process.Vibrant Technologies & Computers
 

Mais de Vibrant Technologies & Computers (20)

Buisness analyst business analysis overview ppt 5
Buisness analyst business analysis overview ppt 5Buisness analyst business analysis overview ppt 5
Buisness analyst business analysis overview ppt 5
 
SQL Introduction to displaying data from multiple tables
SQL Introduction to displaying data from multiple tables  SQL Introduction to displaying data from multiple tables
SQL Introduction to displaying data from multiple tables
 
SQL- Introduction to MySQL
SQL- Introduction to MySQLSQL- Introduction to MySQL
SQL- Introduction to MySQL
 
SQL- Introduction to SQL database
SQL- Introduction to SQL database SQL- Introduction to SQL database
SQL- Introduction to SQL database
 
ITIL - introduction to ITIL
ITIL - introduction to ITILITIL - introduction to ITIL
ITIL - introduction to ITIL
 
Salesforce - Introduction to Security & Access
Salesforce -  Introduction to Security & Access Salesforce -  Introduction to Security & Access
Salesforce - Introduction to Security & Access
 
Data ware housing- Introduction to olap .
Data ware housing- Introduction to  olap .Data ware housing- Introduction to  olap .
Data ware housing- Introduction to olap .
 
Data ware housing - Introduction to data ware housing process.
Data ware housing - Introduction to data ware housing process.Data ware housing - Introduction to data ware housing process.
Data ware housing - Introduction to data ware housing process.
 
Data ware housing- Introduction to data ware housing
Data ware housing- Introduction to data ware housingData ware housing- Introduction to data ware housing
Data ware housing- Introduction to data ware housing
 
Salesforce - classification of cloud computing
Salesforce - classification of cloud computingSalesforce - classification of cloud computing
Salesforce - classification of cloud computing
 
Salesforce - cloud computing fundamental
Salesforce - cloud computing fundamentalSalesforce - cloud computing fundamental
Salesforce - cloud computing fundamental
 
SQL- Introduction to PL/SQL
SQL- Introduction to  PL/SQLSQL- Introduction to  PL/SQL
SQL- Introduction to PL/SQL
 
SQL- Introduction to advanced sql concepts
SQL- Introduction to  advanced sql conceptsSQL- Introduction to  advanced sql concepts
SQL- Introduction to advanced sql concepts
 
SQL Inteoduction to SQL manipulating of data
SQL Inteoduction to SQL manipulating of data   SQL Inteoduction to SQL manipulating of data
SQL Inteoduction to SQL manipulating of data
 
SQL- Introduction to SQL Set Operations
SQL- Introduction to SQL Set OperationsSQL- Introduction to SQL Set Operations
SQL- Introduction to SQL Set Operations
 
Sas - Introduction to designing the data mart
Sas - Introduction to designing the data martSas - Introduction to designing the data mart
Sas - Introduction to designing the data mart
 
Sas - Introduction to working under change management
Sas - Introduction to working under change managementSas - Introduction to working under change management
Sas - Introduction to working under change management
 
SAS - overview of SAS
SAS - overview of SASSAS - overview of SAS
SAS - overview of SAS
 
Teradata - Architecture of Teradata
Teradata - Architecture of TeradataTeradata - Architecture of Teradata
Teradata - Architecture of Teradata
 
Teradata - Restoring Data
Teradata - Restoring Data Teradata - Restoring Data
Teradata - Restoring Data
 

Último

My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesZilliz
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 

Último (20)

My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector Databases
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 

Datastage Introduction To Data Warehousing

  • 1.
  • 3. What is DataStage? • Design jobs for Extraction, Transformation, and Loading (ETL) • Ideal tool for data integration projects – such as, data warehouses, data marts, and system migrations • Import, export, create, and managed metadata for use within jobs • Schedule, run, and monitor jobs all within DataStage • Administer your DataStage development and execution environments
  • 10. Developing in DataStage • Define global and project properties in Administrator • Import meta data into Manager • Build job in Designer • Compile Designer • Validate, run, and monitor in Director
  • 12. Project Properties • Projects can be created and deleted in Administrator • Project properties and defaults are set in Administrator
  • 13. Setting Project Properties • To set project properties, log onto Administrator, select your project, and then click “Properties”
  • 21. What Is Metadata? Data TargetSource Transform Meta Data Repository Meta Data Meta Data
  • 23. Import and Export • Any object in Manager can be exported to a file • Can export whole projects • Use for backup • Sometimes used for version control • Can be used to move DataStage objects from one project to another • Use to share DataStage jobs and projects with other developers
  • 24. Export Procedure • In Manager, click “Export>DataStage Components” • Select DataStage objects for export • Specified type of export: DSX, XML • Specify file path on client machine
  • 27. Import Procedure • In Manager, click “Import>DataStage Components” • Select DataStage objects for import
  • 30. Metadata Import • Import format and column destinations from sequential files • Import relational table column destinations • Imported as “Table Definitions” • Table definitions can be loaded into job stages
  • 31. Sequential File Import Procedure • In Manager, click Import>Table Definitions>Sequential File Definitions • Select directory containing sequential file and then the file • Select Manager category • Examined format and column definitions and edit is necessary
  • 34. What Is a Job? • Executable DataStage program • Created in DataStage Designer, but can use components from Manager • Built using a graphical user interface • Compiles into Orchestrate shell language (OSH)
  • 35. Job Development Overview • In Manager, import metadata defining sources and targets • In Designer, add stages defining data extractions and loads • And Transformers and other stages to defined data transformations • Add linkss defining the flow of data from sources to targets • Compiled the job • In Director, validate, run, and monitor your job
  • 37. Designer Toolbar Provides quick access to the main functions of Designer Job properties Compile Show/hide metadata markers
  • 39. Adding Stages and Links • Stages can be dragged from the tools palette or from the stage type branch of the repository view • Links can be drawn from the tools palette or by right clicking and dragging from one stage to another
  • 40. Designer - Create New Job
  • 41. Drag Stages and Links Using Palette
  • 43. Editing a Sequential Source Stage
  • 45. Transformer Stage • Used to define constraints, derivations, and column mappings • A column mapping maps an input column to an output column • In this module will just defined column mappings (no derivations)
  • 50. Adding Job Parameters • Makes the job more flexible • Parameters can be: – Used in constraints and derivations – Used in directory and file names • Parameter values are determined at run time
  • 51. Adding Job Documentation • Job Properties – Short and long descriptions – Shows in Manager • Annotation stage – Is a stage on the tool palette – Shows on the job GUI (work area)
  • 53. Annotation Stage on the Palette
  • 55. Final Job Work Area with Documentation
  • 58. Prerequisite to Job Execution Result from Designer compile
  • 59. DataStage Director • Can schedule, validating, and run jobs • Can be invoked from DataStage Manager or Designer – Tools > Run Director
  • 61. Run Options – Parameters and Limits
  • 63. Message Details are Available
  • 64. Other Director Functions • Schedule job to run on a particular date/time • Clear job log • Set Director options – Row limits – Abort after x warnings
  • 65. Process Flow • Administrator – add/delete projects, set defaults • Manager – import meta data, backup projects • Designer – assemble jobs, compile, and execute • Director – execute jobs, examine job run logs
  • 67. Administrator – Project Creation/Removal Functions specific to a project.
  • 68. Administrator – Project Properties RCP for parallel jobs should be enabled Variables for parallel processing
  • 69. Administrator – Environment Variables Variables are category specific
  • 70. OSH is what is run by the EE Framework
  • 72. Export Objects to MetaStage Push meta data to MetaStage
  • 73. Designer Workspace Can execute the job from Designer
  • 74. DataStage Generated OSH The EE Framework runs OSH
  • 75. Director – Executing Jobs Messages from previous run in different color
  • 76. Stages Can now customize the Designer’s palette Select desired stages and drag to favorites
  • 78. Row Generator • Can build test data Repeatable property Edit row in column tab
  • 79. Peek • Displays field values – Will be displayed in job log or sent to a file – Skip records option – Can control number of records to be displayed • Can be used as stub stage for iterative development (more later)
  • 80. Why EE is so Effective • Parallel processing paradigm – More hardware, faster processing – Level of parallelization is determined by a configuration file read at runtime • Emphasis on memory – Data read into memory and lookups performed like hash table
  • 81. Parallel Processing Systems • DataStage EE Enables parallel processing = executing your application on multiple CPUs simultaneously – If you add more resources (CPUs, RAM, and disks) you increase system performance • Example system containing 6 CPUs (or processing nodes) and disks 1 2 3 4 5 6
  • 82. Scaleable Systems: Examples Three main types of scalable systems • Symmetric Multiprocessors (SMP): shared memory and disk • Clusters: UNIX systems connected via networks • MPP: Massively Parallel Processing note
  • 83. SMP: Shared Everything • Multiple CPUs with a single operating system • Programs communicate using shared memory • All CPUs share system resources (OS, memory with single linear address space, disks, I/O) When used with Enterprise Edition: • Data transport uses shared memory • Simplified startup cpu cpu cpu cpu
  • 84. Source Transform Target Data Warehouse Operational Data Archived Data Clean Load Disk Disk Disk Traditional approach to batch processing: • Write to disk and read from disk before each processing operation • Sub-optimal utilization of resources • a 10 GB stream leads to 70 GB of I/O • processing resources can sit idle during I/O • Very complex to manage (lots and lots of small jobs) • Becomes impractical with big data volumes Traditional Batch Processing
  • 85. Data Pipelining • Transform, clean and load processes are executing simultaneously on the same processor • rows are moving forward through the flow Source Target Data Warehouse Operational Data Transform Archived Data Clean Load • Start a downstream process while an upstream process is still running. • This eliminates intermediate storing to disk, which is critical for big data. • This also keeps the processors busy. • Still has limits on scalability Pipeline Multiprocessing
  • 86. Data Partitioning Transform Source Data Transform Transform Transform Node 1 Node 2 Node 3 Node 4 A-F G- M N-T U-Z • Break up big data into partitions • Run one partition on each processor • 4X times faster on 4 processors - With data big enough: 100X faster on 100 processors • This is exactly how the parallel databases work! • Data Partitioning requires the same transform to all partitions: Aaron Abbott and Zygmund Zorn undergo the same transform Partition Parallelism
  • 87. Putting It All Together: Parallel Dataflow Source Target Transform Clean Load Pipelining Partitioning Source Data Data Warehouse Combining Parallelism Types
  • 88. Putting It All Together: Parallel Dataflow with Repartioning on-the-fly Without Landing To Disk! Source Target Transform Clean Load Pipelining Source Data Data Warehouse Partitioning Repartitioning A-F G- M N-T U-Z Customer last name Customer zip code Credit card number Repartitioning Repartitioning
  • 89. EE Program Elements • Dataset: uniform set of rows in the Framework's internal representation - Three flavors: 1. file sets *.fs : stored on multiple Unix files as flat files 2. persistent: *.ds : stored on multiple Unix files in Framework format read and written using the DataSet Stage 3. virtual: *.v : links, in Framework format, NOT stored on disk - The Framework processes only datasets—hence possible need for Import - Different datasets typically have different schemas - Convention: "dataset" = Framework data set. • Partition: subset of rows in a dataset earmarked for processing by the same node (virtual CPU, declared in a configuration file). - All the partitions of a dataset follow the same schema: that of the dataset
  • 90. Orchestrate Program (sequential dataflow) Orchestrate Application Framew ork and Runtim e System Import Clean 1 Clean 2 Merge Analyze C onfiguration File C entralized Error Handling and Event Logging Parallel access to data in files Parallel access to data in R DBMS Inter-node com m unications Parallel pipelining Parallelization of operations Import Clean1 Merge Analyze Clean2 Relational D ata Perform ance Visualization Flat Files Orchestrate Framework: Provides application scalability DataStage Enterprise Edition: Best-of-breed scalable data integration platform No limitations on data volumes or throughput DataStage EE Architecture DataStage: Provides data integration platform
  • 91. Introduction to DataStage EE • DSEE: – Automatically scales to fit the machine – Handles data flow among multiple CPU’s and disks • With DSEE you can: – Create applications for SMP’s, clusters and MPP’s… Enterprise Edition is architecture-neutral – Access relational databases in parallel – Execute external applications in parallel – Store data across multiple disks and nodes
  • 92. Developer assembles data flow using the Designer Job Design VS. Execution …and gets: parallel access, propagation, transformation, and load. The design is good for 1 node, 4 nodes, or N nodes. To change # nodes, just swap configuration file. No need to modify or recompile the design
  • 93. Partitioners and Collectors • Partitioners distribute rows into partitions – implement data-partition parallelism • Collectors = inverse partitioners • Live on input links of stages running – in parallel (partitioners) – sequentially (collectors) • Use a choice of methods
  • 95. Types of Sequential Data Stages • Sequential – Fixed or variable length • File Set • Lookup File Set • Data Set
  • 96. Sequential Stage Introduction • The EE Framework processes only datasets • For files other than datasets, such as flat files, Enterprise Edition must perform import and export operations – this is performed by import and export OSH operators generated by Sequential or FileSet stages • During import or export DataStage performs format translations – into, or out of, the EE internal format • Data is described to the Framework in a schema
  • 97. How the Sequential Stage Works • Generates Import/Export operators, depending on whether stage is source or target • Performs direct C++ file I/O streams
  • 98. Using the Sequential File Stage Importing/Exporting Data Both import and export of general files (text, binary) are performed by the SequentialFile Stage. – Data import: Data export EE internal format EE internal format
  • 99. Working With Flat Files • Sequential File Stage – Normally will execute in sequential mode – Can be parallel if reading multiple files (file pattern option) – Can use multiple readers within a node – DSEE needs to know • How file is divided into rows • How row is divided into columns
  • 100. Processes Needed to Import Data • Recordization – Divides input stream into records – Set on the format tab • Columnization – Divides the record into columns – Default set on the format tab but can be overridden on the columns tab – Can be “incomplete” if using a schema or not even specified in the stage if using RCP
  • 101. File Format Example F ie ld 1 F ie ld 1 F ie ld 1 F ie ld 1 F ie ld 1 F ie ld 1 , , , , , , L a s t f ie ld L a s t f ie ld n l n l, F i e l d D e l i m i t e r F i n a l D e l i m i t e r = c o m m a F i n a l D e l i m i t e r = e n d R e c o r d d e l i m i t e r
  • 102. Sequential File Stage • To set the properties, use stage editor – Page (general, input/output) – Tabs (format, columns) • Sequential stage link rules – One input link – One output links (except for reject link definition) – One reject link • Will reject any records not matching meta data in the column definitions
  • 103. Job Design Using Sequential Stages Stage categories
  • 104. General Tab – Sequential Source Multiple output links Show records
  • 105. Properties – Multiple Files Click to add more files having the same meta data.
  • 106. Properties - Multiple Readers Multiple readers option allows you to set number of readers
  • 107. Format Tab File into records Record into columns
  • 109. Reject Link • Reject mode = output • Source – All records not matching the meta data (the column definitions) • Target – All records that are rejected for any reason • Meta data – one column, datatype = raw
  • 110. File Set Stage • Can read or write file sets • Files suffixed by .fs • File set consists of: 1. Descriptor file – contains location of raw data files + meta data 2. Individual raw data files • Can be processed in parallel
  • 111. File Set Stage Example Descriptor file
  • 112. File Set Usage • Why use a file set? – 2G limit on some file systems – Need to distribute data among nodes to prevent overruns – If used in parallel, runs faster that sequential file
  • 113. Lookup File Set Stage • Can create file sets • Usually used in conjunction with Lookup stages
  • 114. Lookup File Set > Properties Key column specified Key column dropped in descriptor file
  • 115. Data Set • Operating system (Framework) file • Suffixed by .ds • Referred to by a control file • Managed by Data Set Management utility from GUI (Manager, Designer, Director) • Represents persistent data • Key to good performance in set of linked jobs
  • 116. Persistent Datasets • Accessed from/to disk with DataSet Stage. • Two parts: – Descriptor file: • contains metadata, data location, but NOT the data itself – Data file(s) • contain the data • multiple Unix files (one per node), accessible in parallel input.ds node1:/local/disk1/… node2:/local/disk2/… record ( partno: int32; description: string; )
  • 117. Data Set Stage Is the data partitioned?
  • 118. Engine Data Translation • Occurs on import – From sequential files or file sets – From RDBMS • Occurs on export – From datasets to file sets or sequential files – From datasets to RDBMS • Engine most efficient when processing internally formatted records (I.e. data contained in datasets)
  • 119. Managing DataSets • GUI (Manager, Designer, Director) – tools > data set management • Alternative methods – Orchadmin • Unix command line utility • List records • Remove data sets (will remove all components) – Dsrecords • Lists number of records in a dataset
  • 121. Data Set Management From Unix • Alternative method of managing file sets and data sets – Dsrecords • Gives record count – Unix command-line utility – $ dsrecords ds_name I.e.. $ dsrecords myDS.ds 156999 records – Orchadmin • Manages EE persistent data sets – Unix command-line utility I.e. $ orchadmin rm myDataSet.ds
  • 122. Job Presentation Document using the annotation stage
  • 123. Job Properties Documentation Description shows in DS Manager and MetaStage Organize jobs into categories
  • 124. Naming conventions • Stages named after the – Data they access – Function they perform – DO NOT leave defaulted stage names like Sequential_File_0 • Links named for the data they carry – DO NOT leave defaulted link names like DSLink3
  • 125. Stage and Link Names Stages and links renamed to data they handle
  • 126. Create Reusable Job Components • Use Enterprise Edition shared containers when feasible Container
  • 127. Use Iterative Job Design • Use copy or peek stage as stub • Test job in phases – small first, then increasing in complexity • Use Peek stage to examine records
  • 128. Copy or Peek Stage Stub Copy stage
  • 129. Transformer Stage Techniques • Suggestions - – Always include reject link. – Always test for null value before using a column in a function. – Try to use RCP and only map columns that have a derivation other than a copy. More on RCP later. – Be aware of Column and Stage variable Data Types. • Often user does not pay attention to Stage Variable type. – Avoid type conversions. • Try to maintain the data type as imported.
  • 130. The Copy Stage With 1 link in, 1 link out: the Copy Stage is the ultimate "no-op" (place-holder): – Partitioners – Sort / Remove Duplicates – Rename, Drop column … can be inserted on: – input link (Partitioning): Partitioners, Sort, Remove Duplicates) – output link (Mapping page): Rename, Drop. Sometimes replace the transformer: – Rename, – Drop, – Implicit type Conversions
  • 131. Developing Jobs 1. Keep it simple • Jobs with many stages are hard to debug and maintain. 1. Start small and Build to final Solution • Use view data, copy, and peek. • Start from source and work out. • Develop with a 1 node configuration file. 1. Solve the business problem before the performance problem. • Don’t worry too much about partitioning until the sequential flow works as expected. 1. If you have to write to Disk use a Persistent Data set.
  • 133. Good Things to Have in each Job • Use job parameters • Some helpful environmental variables to add to job parameters – $APT_DUMP_SCORE • Report OSH to message log – $APT_CONFIG_FILE • Establishes runtime parameters to EE engine; I.e. Degree of parallelization
  • 134. Setting Job Parameters Click to add environment variables
  • 135. DUMP SCORE Output Double-click Mapping Node--> partition Setting APT_DUMP_SCORE yields: Partitoner And Collector
  • 136. Use Multiple Configuration Files • Make a set for 1X, 2X,…. • Use different ones for test versus production • Include as a parameter in each job
  • 137. Parallel Database Connectivity TraditionalTraditional Client-ServerClient-Server Enterprise EditionEnterprise Edition SortSort ClientClient Parallel RDBMS ClientClient ClientClient ClientClient ClientClient Parallel RDBMS Only RDBMS is running in parallel Each application has only one connection Parallel server runs APPLICATIONS Application has parallel connections to RDBMS ClientClient LoadLoad
  • 138. RDBMS Access Supported Databases Enterprise Edition provides high performance / scalable interfaces for: • DB2 • Informix • Oracle • Teradata
  • 139. RDBMS Access • Automatically convert RDBMS table layouts to/from Enterprise Edition Table Definitions • RDBMS nulls converted to/from nullable field values • Support for standard SQL syntax for specifying: – field list for SELECT statement – filter for WHERE clause • Can write an explicit SQL query to access RDBMS • EE supplies additional information in the SQL query
  • 140. RDBMS Stages • DB2/UDB Enterprise • Informix Enterprise • Oracle Enterprise • Teradata Enterprise
  • 141. RDBMS Usage • As a source – Extract data from table (stream link) – Extract as table, generated SQL, or user-defined SQL – User-defined can perform joins, access views – Lookup (reference link) – Normal lookup is memory-based (all table data read into memory) – Can perform one lookup at a time in DBMS (sparse option) – Continue/drop/fail options • As a target – Inserts – Upserts (Inserts and updates) – Loader
  • 142. RDBMS Source – Stream Link Stream link
  • 143. DBMS Source - User-defined SQL Columns in SQL statement must match the meta data in columns tab
  • 144. DBMS Source – Reference Link Reject link
  • 145. Lookup Reject Link “Output” option automatically creates the reject link
  • 146. Null Handling • Must handle null condition if lookup record is not found and “continue” option is chosen • Can be done in a transformer stage
  • 148. Lookup Stage Properties Referenc e link Must have same column name in input and reference links. You will get the results of the lookup in the output column.
  • 149. DBMS as a Target
  • 150. DBMS As Target • Write Methods – Delete – Load – Upsert – Write (DB2) • Write mode for load method – Truncate – Create – Replace – Append
  • 152. Checking for Nulls • Use Transformer stage to test for fields with null values (Use IsNull functions) • In Transformer, can reject or load default value
  • 153. Concepts • The Enterprise Edition Platform – Script language - OSH (generated by DataStage Parallel Canvas, and run by DataStage Director) – Communication - conductor,section leaders,players. – Configuration files (only one active at a time, describes H/W) – Meta data - schemas/tables – Schema propagation - RCP – EE extensibility - Buildop, Wrapper – Datasets (data in Framework's internal representation)
  • 154. Output Data Set schema: prov_num:int16; member_num:int8; custid:int32; Input Data Set schema: prov_num:int16; member_num:int8; custid:int32; EE Stages Involve A Series Of Processing Steps Input Interface Partitioner Business Logic Output Interface EE Stage • Piece of Application Logic Running Against Individual Records • Parallel or Sequential DS-EE Stage Elements
  • 155. • EE Delivers Parallelism in Two Ways – Pipeline – Partition • Block Buffering Between Components – Eliminates Need for Program Load Balancing – Maintains Orderly Data Flow Dual Parallelism Eliminates Bottlenecks! Pipeline Partition Producer Consumer DSEE Stage Execution
  • 156. Stages Control Partition Parallelism • Execution Mode (sequential/parallel) is controlled by Stage – default = parallel for most Ascential-supplied Stages – Developer can override default mode – Parallel Stage inserts the default partitioner (Auto) on its input links – Sequential Stage inserts the default collector (Auto) on its input links – Developer can override default • execution mode (parallel/sequential) of Stage > Advanced tab • choice of partitioner/collector on Input > Partitioning tab
  • 157. How Parallel Is It? • Degree of parallelism is determined by the configuration file – Total number of logical nodes in default pool, or a subset if using "constraints". • Constraints are assigned to specific pools as defined in configuration file and can be referenced in the stage
  • 158. OSH • DataStage EE GUI generates OSH scripts – Ability to view OSH turned on in Administrator – OSH can be viewed in Designer using job properties • The Framework executes OSH • What is OSH? – Orchestrate shell – Has a UNIX command-line interface
  • 159. OSH Script • An osh script is a quoted string which specifies: – The operators and connections of a single Orchestrate step – In its simplest form, it is: osh “op < in.ds > out.ds” • Where: – op is an Orchestrate operator – in.ds is the input data set – out.ds is the output data set
  • 160. OSH Operators • OSH Operator is an instance of a C++ class inheriting from APT_Operator • Developers can create new operators • Examples of existing operators: – Import – Export – RemoveDups
  • 161. Enable Visible OSH in Administrator Will be enabled for all projects
  • 162. View OSH in Designer Schema Operator
  • 163. • Operators • Datasets: set of rows processed by Framework – Orchestrate data sets: – persistent (terminal) *.ds, and – virtual (internal) *.v. – Also: flat “file sets” *.fs • Schema: data description (metadata) for datasets and links. Elements of a Framework Program
  • 164. • Consist of Partitioned Data and Schema • Can be Persistent (*.ds) or Virtual (*.v, Link) • Overcome 2 GB File Limit = What you program: What gets processed: . . . Multiple files per partition Each file up to 2GBytes (or larger) Operator A Operator A Operator A Operator A Node 1 Node 2 Node 3 Node 4 data files of x.ds $ osh “operator_A > x.ds“ GUI OSH Datasets What gets generated: Operator A
  • 165. Computing Architectures: Definition Clusters and MPP Systems Shared Disk Shared Nothing Uniprocessor Dedicated Disk Shared Memory SMP System (Symmetric Multiprocessor) DiskDisk CPU Memory CPU CPU CPU CPU CPU Disk Memory CPU Disk Memory CPU Disk Memory CPU Disk Memory
  • 166. Job Execution: Orchestrate Processing Node Processing Node • Conductor - initial DS/EE process – Step Composer – Creates Section Leader processes (one/node) – Consolidates massages, outputs them – Manages orderly shutdown. • Section Leader – Forks Players processes (one/Stage) – Manages up/down communication. Players – The actual processes associated with Stages – Combined players: one process only – Send stderr to SL – Establish connections to other players for data flow Conductor Node C SL PP P SL PP P
  • 167. Working with Configuration Files • You can easily switch between config files: • '1-node' file - for sequential execution, lighter reports—handy for testing • 'MedN-nodes' file - aims at a mix of pipeline and data-partitioned parallelism • 'BigN-nodes' file - aims at full data-partitioned parallelism • Only one file is active while a step is running • The Framework queries (first) the environment variable: $APT_CONFIG_FILE # nodes declared in the config file needs not match # CPUs • Same configuration file can be used in development and target machines
  • 168. Scheduling Nodes, Processes, and CPUs • DS/EE does not: – know how many CPUs are available – schedule • Who knows what? • Who does what? – DS/EE creates (Nodes*Ops) Unix processes – The O/S schedules these processes on the CPUs Nodes = # logical nodes declared in config. file Ops = # ops. (approx. # blue boxes in V.O.) Processes = # Unix processes CPUs = # available CPUs Nodes Ops Processes CPUs User Y N Orchestrate Y Y Nodes * Ops N O/S " Y
  • 169. Parallel to parallel flow may incur reshuffling: Records may jump between nodes node 1 node 2 partitioner Re-Partitioning
  • 170. Partitioning Methods • Auto • Hash • Entire • Range • Range Map
  • 171. • Collectors combine partitions of a dataset into a single input stream to a sequential Stage data partitions collector sequential Stage ... –Collectors do NOT synchronize data Collectors
  • 172. Partitioning and Repartitioning Are Visible On Job Design
  • 173. Partitioning and Collecting Icons Partitioner Collector
  • 174. Reading Messages in Director • Set APT_DUMP_SCORE to true • Can be specified as job parameter • Messages sent to Director log • If set, parallel job will produce a report showing the operators, processes, and datasets in the running job
  • 176. Transformed Data • Transformed data is: – Outgoing column is a derivation that may, or may not, include incoming fields or parts of incoming fields – May be comprised of system variables • Frequently uses functions performed on something (ie. incoming columns) – Divided into categories – I.e. • Date and time • Mathematical • Logical • Null handling • More
  • 177. Stages Review • Stages that can transform data – Transformer • Parallel • Basic (from Parallel palette) – Aggregator (discussed in later module) • Sample stages that do not transform data – Sequential – FileSet – DataSet – DBMS
  • 178. Transformer Stage Functions • Control data flow • Create derivations
  • 179. Flow Control • Separate records flow down links based on data condition – specified in Transformer stage constraints • Transformer stage can filter records • Other stages can filter records but do not exhibit advanced flow control – Sequential can send bad records down reject link – Lookup can reject records based on lookup failure – Filter can select records based on data value
  • 180. Rejecting Data • Reject option on sequential stage – Data does not agree with meta data – Output consists of one column with binary data type • Reject links (from Lookup stage) result from the drop option of the property “If Not Found” – Lookup “failed” – All columns on reject link (no column mapping option) • Reject constraints are controlled from the constraint editor of the transformer – Can control column mapping – Use the “Other/Log” checkbox
  • 181. Rejecting Data Example “If Not Found” property Constraint – Other/log option Property Reject Mode = Output
  • 183. Transformer Stage Variables • First of transformer stage entities to execute • Execute in order from top to bottom – Can write a program by using one stage variable to point to the results of a previous stage variable • Multi-purpose – Counters – Hold values for previous rows to make comparison – Hold derivations to be used in multiple field dervations – Can be used to control execution of constraints
  • 185. Transforming Data • Derivations – Using expressions – Using functions • Date/time • Transformer Stage Issues – Sometimes require sorting before the transformer stage – I.e. using stage variable as accumulator and need to break on change of column value • Checking for nulls
  • 186. Checking for Nulls • Nulls can get introduced into the dataflow because of failed lookups and the way in which you chose to handle this condition • Can be handled in constraints, derivations, stage variables, or a combination of these
  • 187. Transformer - Handling Rejects Constraint Rejects – All expressions are false and reject row is checked
  • 188. Transformer: Execution Order • Derivations in stage variables are executed first • Constraints are executed before derivations • Column derivations in earlier links are executed before later links • Derivations in higher columns are executed before lower columns
  • 189. Parallel Palette - Two Transformers • All > Processing > • Transformer • Is the non-Universe transformer • Has a specific set of functions • No DS routines available • Parallel > Processing • Basic Transformer • Makes server style transforms available on the parallel palette • Can use DS routines • Program in Basic for both transformers
  • 190. Transformer Functions From Derivation Editor • Date & Time • Logical • Null Handling • Number • String • Type Conversion
  • 191. Sorting Data • Important because – Some stages require sorted input – Some stages may run faster – I.e Aggregator • Can be performed – Option within stages (use input > partitioning tab and set partitioning to anything other than auto) – As a separate stage (more complex sorts)
  • 192. Sorting Alternatives • Alternative representation of same flow:
  • 193. Sort Option on Stage Link
  • 195. Sort Stage - Outputs • Specifies how the output is derived
  • 196. Sort Specification Options • Input Link Property – Limited functionality – Max memory/partition is 20 MB, then spills to scratch • Sort Stage – Tunable to use more memory before spilling to scratch. • Note: Spread I/O by adding more scratch file systems to each node of the APT_CONFIG_FILE
  • 197. Removing Duplicates • Can be done by Sort stage – Use unique option OR • Remove Duplicates stage – Has more sophisticated ways to remove duplicates
  • 198. Combining Data • There are two ways to combine data: – Horizontally: Several input links; one output link (+ optional rejects) made of columns from different input links. E.g., • Joins • Lookup • Merge – Vertically: One input link, one output link with column combining values from all input rows. E.g., • Aggregator
  • 199. Join, Lookup & Merge Stages • These "three Stages" combine two or more input links according to values of user-designated "key" column(s). • They differ mainly in: – Memory usage – Treatment of rows with unmatched key values – Input requirements (sorted, de-duplicated)
  • 200. Not all Links are Created Equal Joins Lookup Merge Primary Input: port 0 Left Source Master Secondary Input(s): ports 1,… Right LU Table(s) Update(s) • Enterprise Edition distinguishes between: - The Primary Input (Framework port 0) - Secondary - in some cases "Reference" (other ports) • Naming convention: Tip: Check "Input Ordering" tab to make sure intended Primary is listed first
  • 201. Join Stage Editor One of four variants: – Inner – Left Outer – Right Outer – Full Outer Several key columns allowed Link Order immaterial for Inner and Full Outer Joins (but VERY important for Left/Right Outer and Lookup and Merge)
  • 202. 1. The Join Stage Four types: • 2 sorted input links, 1 output link – "left outer" on primary input, "right outer" on secondary input – Pre-sort make joins "lightweight": few rows need to be in RAM • Inner • Left Outer • Right Outer • Full Outer
  • 203. 2. The Lookup Stage Combines: – one source link with – one or more duplicate-free table links no pre-sort necessary allows multiple keys LUTs flexible exception handling for source input rows with no match Source input One or more tables (LUTs) Output Reject Lookup 0 1 2 0 1
  • 204. The Lookup Stage • Lookup Tables should be small enough to fit into physical memory (otherwise, performance hit due to paging) • On an MPP you should partition the lookup tables using entire partitioning method, or partition them the same way you partition the source link • On an SMP, no physical duplication of a Lookup Table occurs
  • 205. The Lookup Stage • Lookup File Set – Like a persistent data set only it contains metadata about the key. – Useful for staging lookup tables • RDBMS LOOKUP – NORMAL • Loads to an in memory hash table first – SPARSE • Select for each row. • Might become a performance bottleneck.
  • 206. 3. The Merge Stage • Combines – one sorted, duplicate-free master (primary) link with – one or more sorted update (secondary) links. – Pre-sort makes merge "lightweight": few rows need to be in RAM (as with joins, but opposite to lookup). • Follows the Master-Update model: – Master row and one or more updates row are merged if they have the same value in user- specified key column(s). – A non-key column occurs in several inputs? The lowest input port number prevails (e.g., master over update; update values are ignored) – Unmatched ("Bad") master rows can be either • kept • dropped – Unmatched ("Bad") update rows in input link can be captured in a "reject" link – Matched update rows are consumed.
  • 207. The Merge Stage Allows composite keys Multiple update links Matched update rows are consumed Unmatched updates can be captured Lightweight Space/time tradeoff: presorts vs. in-RAM table Master One or more updates Output Rejects Merge 0 0 21 21
  • 208. In this table: • , <comma> = separator between primary and secondary input links (out and reject links) Synopsis:Joins, Lookup, & Merge Joins Lookup Merge Model RDBMS-style relational Source - in RAM LU Table Master -Update(s) Memory usage light heavy light # and names of Inputs exactly 2: 1 left, 1 right 1 Source, N LU Tables 1 Master, N Update(s) Mandatory Input Sort both inputs no all inputs Duplicates in primary input OK (x-product) OK Warning! Duplicates in secondary input(s) OK (x-product) Warning! OK only when N = 1 Options on unmatched primary NONE [fail] | continue | drop | reject [keep] | drop Options on unmatched secondary NONE NONE capture in reject set(s) On match, secondary entries are reusable reusable consumed # Outputs 1 1 out, (1 reject) 1 out, (N rejects) Captured in reject set(s) Nothing (N/A) unmatched primary entries unmatched secondary entries
  • 209. The Aggregator Stage Purpose: Perform data aggregations Specify: • Zero or more key columns that define the aggregation units (or groups) • Columns to be aggregated • Aggregation functions: count (nulls/non-nulls) sum max/min/range • The grouping method (hash table or pre-sort) is a performance issue
  • 210. Grouping Methods • Hash: results for each aggregation group are stored in a hash table, and the table is written out after all input has been processed – doesn’t require sorted data – good when number of unique groups is small. Running tally for each group’s aggregate calculations need to fit easily into memory. Require about 1KB/group of RAM. – Example: average family income by state, requires .05MB of RAM • Sort: results for only a single aggregation group are kept in memory; when new group is seen (key value changes), current group written out. – requires input sorted by grouping keys – can handle unlimited numbers of groups – Example: average daily balance by credit card
  • 211. Aggregator Functions • Sum • Min, max • Mean • Missing value count • Non-missing value count • Percent coefficient of variation
  • 214. Containers • Two varieties – Local – Shared • Local – Simplifies a large, complex diagram • Shared – Creates reusable object that many jobs can include
  • 215. Creating a Container • Create a job • Select (loop) portions to containerize • Edit > Construct container > local or shared
  • 216. Configuration File Concepts • Determine the processing nodes and disk space connected to each node • When system changes, need only change the configuration file – no need to recompile jobs • When DataStage job runs, platform reads configuration file – Platform automatically scales the application to fit the system
  • 217. Processing Nodes Are • Locations on which the framework runs applications • Logical rather than physical construct • Do not necessarily correspond to the number of CPUs in your system – Typically one node for two CPUs • Can define one processing node for multiple physical nodes or multiple processing nodes for one physical node
  • 218. Optimizing Parallelism • Degree of parallelism determined by number of nodes defined • Parallelism should be optimized, not maximized – Increasing parallelism distributes work load but also increases Framework overhead • Hardware influences degree of parallelism possible • System hardware partially determines configuration
  • 219. More Factors to Consider • Communication amongst operators – Should be optimized by your configuration – Operators exchanging large amounts of data should be assigned to nodes communicating by shared memory or high-speed link • SMP – leave some processors for operating system • Desirable to equalize partitioning of data • Use an experimental approach – Start with small data sets – Try different parallelism while scaling up data set sizes
  • 220. Configuration File • Text file containing string data that is passed to the Framework – Sits on server side – Can be displayed and edited • Name and location found in environmental variable APT_CONFIG_FILE • Components – Node – Fast name – Pools – Resource
  • 221. Node Options • Node name – name of a processing node used by EE – Typically the network name – Use command uname –n to obtain network name • Fastname – – Name of node as referred to by fastest network in the system – Operators use physical node name to open connections – NOTE: for SMP, all CPUs share single connection to network • Pools – Names of pools to which this node is assigned – Used to logically group nodes – Can also be used to group resources • Resource – Disk – Scratchdisk
  • 222. Sample Configuration File { node “Node1" { fastname "BlackHole" pools "" "node1" resource disk "/usr/dsadm/Ascential/DataStage/Datasets" {pools "" } resource scratchdisk "/usr/dsadm/Ascential/DataStage/Scratch" {pools "" } } }
  • 223. Disk Pools • Disk pools allocate storage • By default, EE uses the default pool, specified by “” pool "bigdata"
  • 224. Sorting Requirements Resource pools can also be specified for sorting: • The Sort stage looks first for scratch disk resources in a “sort” pool, and then in the default disk pool
  • 225. Resource Types • Disk • Scratchdisk • DB2 • Oracle • Saswork • Sortwork • Can exist in a pool – Groups resources together
  • 226. Using Different Configurations Lookup stage where DBMS is using a sparse lookup type
  • 227. Building a Configuration File • Scoping the hardware: – Is the hardware configuration SMP, Cluster, or MPP? – Define each node structure (an SMP would be single node): • Number of CPUs • CPU speed • Available memory • Available page/swap space • Connectivity (network/back-panel speed) – Is the machine dedicated to EE? If not, what other applications are running on it? – Get a breakdown of the resource usage (vmstat, mpstat, iostat) – Are there other configuration restrictions? E.g. DB only runs on certain nodes and ETL cannot run on them?
  • 228. Wrappers vs. Buildop vs. Custom • Wrappers are good if you cannot or do not want to modify the application and performance is not critical. • Buildops: good if you need custom coding but do not need dynamic (runtime-based) input and output interfaces. • Custom (C++ coding using framework API): good if you need custom coding and need dynamic input and output interfaces.
  • 229. Building “Wrapped” Stages You can “wrapper” a legacy executable: • Binary • Unix command • Shell script … and turn it into a Enterprise Edition stage capable, among other things, of parallel execution… As long as the legacy executable is: • amenable to data-partition parallelism » no dependencies between rows • pipe-safe » can read rows sequentially » no random access to data
  • 230. Wrappers (Cont’d) Wrappers are treated as a black box • EE has no knowledge of contents • EE has no means of managing anything that occurs inside the wrapper • EE only knows how to export data to and import data from the wrapper • User must know at design time the intended behavior of the wrapper and its schema interface • If the wrappered application needs to see all records prior to processing, it cannot run in parallel.
  • 231. LS Example • Can this command be wrappered?
  • 232. Creating a Wrapper Used in this job --- To create the “ls” stage
  • 233. Creating Wrapped Stages From Manager: Right-Click on Stage Type > New Parallel Stage > Wrapped We will "Wrapper” an existing Unix executables – the ls command Wrapper Starting Point
  • 234. Wrapper - General Page Unix command to be wrapped Name of stage
  • 235. Conscientiously maintaining the Creator page for all your wrapped stages will eventually earn you the thanks of others. The "Creator" Page
  • 236. Wrapper – Properties Page • If your stage will have properties appear, complete the Properties page This will be the name of the property as it appears in your stage
  • 237. Wrapper - Wrapped Page Interfaces – input and output columns - these should first be entered into the table definitions meta data (DS Manager); let’s do that now.
  • 238. • Layout interfaces describe what columns the stage: – Needs for its inputs (if any) – Creates for its outputs (if any) – Should be created as tables with columns in Manager Interface schemas
  • 239. Column Definition for Wrapper Interface
  • 240. How Does the Wrapping Work? – Define the schema for export and import • Schemas become interface schemas of the operator and allow for by-name column access import export stdout or named pipe stdin or named pipe UNIX executable output schema input schema
  • 241. Update the Wrapper Interfaces • This wrapper will have no input interface – i.e. no input link. The location will come as a job parameter that will be passed to the appropriate stage property. Therefore, only the Output tab entry is needed.
  • 243. Job Run • Show file from Designer palette
  • 244. Wrapper Story: Cobol Application • Hardware Environment: – IBM SP2, 2 nodes with 4 CPU’s per node. • Software: – DB2/EEE, COBOL, EE • Original COBOL Application: – Extracted source table, performed lookup against table in DB2, and Loaded results to target table. – 4 hours 20 minutes sequential execution • Enterprise Edition Solution: – Used EE to perform Parallel DB2 Extracts and Loads – Used EE to execute COBOL application in Parallel – EE Framework handled data transfer between DB2/EEE and COBOL application – 30 minutes 8-way parallel execution
  • 245. Buildops Buildop provides a simple means of extending beyond the functionality provided by EE, but does not use an existing executable (like the wrapper) Reasons to use Buildop include: • Speed / Performance • Complex business logic that cannot be easily represented using existing stages – Lookups across a range of values – Surrogate key generation – Rolling aggregates • Build once and reusable everywhere within project, no shared container necessary • Can combine functionality from different stages into one
  • 246. BuildOps – The DataStage programmer encapsulates the business logic – The Enterprise Edition interface called “buildop” automatically performs the tedious, error-prone tasks: invoke needed header files, build the necessary “plumbing” for a correct and efficient parallel execution. – Exploits extensibility of EE Framework
  • 247. From Manager (or Designer): Repository pane: Right-Click on Stage Type > New Parallel Stage > {Custom | Build | Wrapped} • "Build" stages from within Enterprise Edition • "Wrapping” existing “Unix” executables BuildOp Process Overview
  • 248. General Page Identical to Wrappers, except: Under the Build Tab, your program!
  • 249. Logic Tab for Business Logic Enter Business C/C++ logic and arithmetic in four pages under the Logic tab Main code section goes in Per-Record page- it will be applied to all rows NOTE: Code will need to be Ansi C/C++ compliant. If code does not compile outside of EE, it won’t compile within EE either!
  • 250. Code Sections under Logic Tab Temporary variables declared [and initialized] here Logic here is executed once BEFORE processing the FIRST row Logic here is executed once AFTER processing the LAST row
  • 251. I/O and Transfer Under Interface tab: Input, Output & Transfer pages First line: output 0 Optional renaming of output port from default "out0" Write row Input page: 'Auto Read' Read next row In-Repository Table Definition 'False' setting, not to interfere with Transfer page
  • 252. I/O and Transfer • Transfer all columns from input to output. • If page left blank or Auto Transfer = "False" (and RCP = "False") Only columns in output Table Definition are written First line: Transfer of index 0
  • 253. BuildOp Simple Example • Example - sumNoTransfer – Add input columns "a" and "b"; ignore other columns that might be present in input – Produce a new "sum" column – Do not transfer input columns sumNoTransfer a:int32; b:int32 sum:int32
  • 254. NO TRANSFER - RCP set to "False" in stage definition and - Transfer page left blank, or Auto Transfer = "False" • Effects: - input columns "a" and "b" are not transferred - only new column "sum" is transferred From Peek: No Transfer
  • 255. Transfer TRANSFER - RCP set to "True" in stage definition or - Auto Transfer set to "True" • Effects: - new column "sum" is transferred, as well as - input columns "a" and "b" and - input column "ignored" (present in input, but not mentioned in stage)
  • 256. Columns vs. Temporary C++ Variables Columns • DS-EE type • Defined in Table Definitions • Value refreshed from row to row Temp C++ variables • C/C++ type • Need declaration (in Definitions or Pre-Loop page) • Value persistent throughout "loop" over rows, unless modified in code
  • 257. Custom Stage • Reasons for a custom stage: – Add EE operator not already in DataStage EE – Build your own Operator and add to DataStage EE • Use EE API • Use Custom Stage to add new operator to EE canvas
  • 258. Custom Stage DataStage Manager > select Stage Types branch > right click
  • 259. Custom Stage Name of Orchestrate operator to be used Number of input and output links allowed
  • 260. Custom Stage – Properties Tab
  • 262. Establishing Meta Data • Data definitions – Recordization and columnization – Fields have properties that can be set at individual field level • Data types in GUI are translated to types used by EE – Described as properties on the format/columns tab (outputs or inputs pages) OR – Using a schema file (can be full or partial) • Schemas – Can be imported into Manager – Can be pointed to by some job stages (i.e. Sequential)
  • 263. Data Formatting – Record Level • Format tab • Meta data described on a record basis • Record level properties
  • 264. Data Formatting – Column Level • Defaults for all columns
  • 265. Column Overrides • Edit row from within the columns tab • Set individual column properties
  • 267. Extended Properties – String Type • Note the ability to convert ASCII to EBCDIC
  • 269. Schema • Alternative way to specify column definitions for data used in EE jobs • Written in a plain text file • Can be written as a partial record definition • Can be imported into the DataStage repository
  • 270. Creating a Schema • Using a text editor – Follow correct syntax for definitions – OR • Import from an existing data set or file set – On DataStage Manager import > Table Definitions > Orchestrate Schema Definitions – Select checkbox for a file with .fs or .ds
  • 271. Importing a Schema Schema location can be on the server or local work station
  • 272. Data Types • Date • Decimal • Floating point • Integer • String • Time • Timestamp • Vector • Subrecord • Raw • Tagged
  • 273. Runtime Column Propagation • DataStage EE is flexible about meta data. It can cope with the situation where meta data isn’t fully defined. You can define part of your schema and specify that, if your job encounters extra columns that are not defined in the meta data when it actually runs, it will adopt these extra columns and propagate them through the rest of the job. This is known as runtime column propagation (RCP). • RCP is always on at runtime. • Design and compile time column mapping enforcement. – RCP is off by default. – Enable first at project level. (Administrator project properties) – Enable at job level. (job properties General tab) – Enable at Stage. (Link Output Column tab)
  • 274. Enabling RCP at Project Level
  • 275. Enabling RCP at Job Level
  • 276. Enabling RCP at Stage Level • Go to output link’s columns tab • For transformer you can find the output links columns tab by first going to stage properties
  • 277. Using RCP with Sequential Stages • To utilize runtime column propagation in the sequential stage you must use the “use schema” option • Stages with this restriction: – Sequential – File Set – External Source – External Target
  • 278. Runtime Column Propagation • When RCP is Disabled – DataStage Designer will enforce Stage Input Column to Output Column mappings. – At job compile time modify operators are inserted on output links in the generated osh.
  • 279. Runtime Column Propagation • When RCP is Enabled – DataStage Designer will not enforce mapping rules. – No Modify operator inserted at compile time. – Danger of runtime error if column names incoming do not match column names outgoing link – case sensitivity.
  • 280. Job Control Options • Manually write job control – Code generated in Basic – Use the job control tab on the job properties page – Generates basic code which you can modify • Job Sequencer – Build a controlling job much the same way you build other jobs – Comprised of stages and links – No basic coding
  • 281. Job Sequencer • Build like a regular job • Type “Job Sequence” • Has stages and links • Job Activity stage represents a DataStage job • Links represent passing control Stages
  • 283. Job Activity Properties Job parameters to be passed Job to be executed – select from dropdown
  • 284. Job Activity Trigger • Trigger appears as a link in the diagram • Custom options let you define the code
  • 285. Options • Use custom option for conditionals – Execute if job run successful or warnings only • Can add “wait for file” to execute • Add “execute command” stage to drop real tables and rename new tables to current tables
  • 286. Job Activity With Multiple Links Different links having different triggers
  • 287. Sequencer Stage • Build job sequencer to control job for the collections application Can be set to all or any
  • 290. Sample DataStage log from Mail Notification • Sample DataStage log from Mail Notification
  • 297. The Director Typical Job Log Messages: • Environment variables • Configuration File information • Framework Info/Warning/Error messages • Output from the Peek Stage • Additional info with "Reporting" environments • Tracing/Debug output – Must compile job in trace mode – Adds overhead
  • 298. • Job Properties, from Menu Bar of Designer • Director will prompt you before each run Job Level Environmental Variables
  • 299. Troubleshooting If you get an error during compile, check the following: • Compilation problems – If Transformer used, check C++ compiler, LD_LIRBARY_PATH – If Buildop errors try buildop from command line – Some stages may not support RCP – can cause column mismatch . – Use the Show Error and More buttons – Examine Generated OSH – Check environment variables settings • Very little integrity checking during compile, should run validate from Director. Highlights source of error
  • 300. Generating Test Data • Row Generator stage can be used – Column definitions – Data type dependent • Row Generator plus lookup stages provides good way to create robust test data from pattern files
  • 301. Thank You !!!Thank You !!! For More Information click below link: Follow Us on: http://vibranttechnologies.co.in/datastage-classes-in-mumbai.html

Notas do Editor

  1. DataStage is a comprehensive tool for the fast, easy creation and maintenance of data marts and data warehouses. It provides the tools you need to build, manage, and expand them. With DataStage, you can build solutions faster and give users access to the data and reports they need. With DataStage you can: ·        Design the jobs that extract, integrate, aggregate, load, and transform the data for your data warehouse or data mart. ·        Create and reuse metadata and job components. ·        Run, monitor, and schedule these jobs. ·        Administer your development and execution environments.
  2. The DataStage client components are: Administrator Administers DataStage projects and conducts housekeeping on the server Designer Creates DataStage jobs that are compiled into executable programs Director Used to run and monitor the DataStage jobs Manager Allows you to view and edit the contents of the repository
  3. Use the Administrator to specify general server defaults, add and delete projects, and to set project properties. The Administrator also provides a command interface to the UniVerse repository. ·        Use the Administrator Project Properties window to: ·        Set job monitoring limits and other Director defaults on the General tab. ·        Set user group privileges on the Permissions tab. ·        Enable or disable server-side tracing on the Tracing tab. ·        Specify a user name and password for scheduling jobs on the Schedule tab. ·        Specify hashed file stage read and write cache sizes on the Tunables tab.
  4. Use the Manager to store and manage reusable metadata for the jobs you define in the Designer. This metadata includes table and file layouts and routines for transforming extracted data. Manager is also the primary interface to the DataStage repository. In addition to table and file layouts, it displays the routines, transforms, and jobs that are defined in the project. Custom routines and transforms can also be created in Manager.
  5. The DataStage Designer allows you to use familiar graphical point-and-click techniques to develop processes for extracting, cleansing, transforming, integrating and loading data into warehouse tables. The Designer provides a “visual data flow” method to easily interconnect and configure reusable components.
  6. Use the Director to validate, run, schedule, and monitor your DataStage jobs. You can also gather statistics as the job runs.
  7. ·        Define your project’s properties: Administrator ·        Open (attach to) your project ·        Import metadata that defines the format of data stores your jobs will read from or write to: Manager ·        Design the job: Designer -        Define data extractions (reads) -        Define data flows -        Define data integration -        Define data transformations -        Define data constraints -        Define data loads (writes) -        Define data aggregations ·        Compile and debug the job: Designer ·        Run and monitor the job: Director
  8. All your work is done in a DataStage project. Before you can do anything, other than some general administration, you must open (attach to) a project. Projects are created during and after the installation process. You can add projects after installation on the Projects tab of Administrator. A project is associated with a directory. The project directory is used by DataStage to store your jobs and other DataStage objects and metadata. You must open (attach to) a project before you can do any work in it. Projects are self-contained. Although multiple projects can be open at the same time, they are separate environments. You can, however, import and export objects between them. Multiple users can be working in the same project at the same time. However, DataStage will prevent multiple users from accessing the same job at the same time.
  9. Recall from module 1: In DataStage all development work is done within a project. Projects are created during installation and after installation using Administrator. Each project is associated with a directory. The directory stores the objects (jobs, metadata, custom routines, etc.) created in the project. Before you can work in a project you must attach to it (open it). You can set the default properties of a project using DataStage Administrator.
  10. The logon screen for Administrator does not provide the option to select a specific project (unlike the other DataStage clients).
  11. The Licensing Tab is used to change DataStage license information.
  12. Click Properties on the DataStage Administration window to open the Project Properties window. There are nine tabs. (The Mainframe tab is only enabled if your license supports mainframe jobs.) The default is the General tab. If you select the Enable job administration in Director box, you can perform some administrative functions in Director without opening Administrator. When a job is run in Director, events are logged describing the progress of the job. For example, events are logged when a job starts, when it stops, and when it aborts. The number of logged events can grow very large. The Auto-purge of job log box tab allows you to specify conditions for purging these events. You can limit the logged events either by number of days or number of job runs.
  13. Use this page to set user group permissions for accessing and using DataStage. All DataStage users must belong to a recognized user role before they can log on to DataStage. This helps to prevent unauthorized access to DataStage projects. There are three roles of DataStage user: ·        DataStage Developer, who has full access to all areas of a DataStage project. ·        DataStage Operator, who can run and manage released DataStage jobs. ·        &amp;lt;None&amp;gt;, who does not have permission to log on to DataStage. UNIX note: In UNIX, the groups displayed are defined in /etc/group.
  14. This tab is used to enable and disable server-side tracing. The default is for server-side tracing to be disabled. When you enable it, information about server activity is recorded for any clients that subsequently attach to the project. This information is written to trace files. Users with in-depth knowledge of the system software can use it to help identify the cause of a client problem. If tracing is enabled, users receive a warning message whenever they invoke a DataStage client. Warning: Tracing causes a lot of server system overhead. This should only be used to diagnose serious problems.
  15. On the Tunables tab, you can specify the sizes of the memory caches used when reading rows in hashed files and when writing rows to hashed files. Hashed files are mainly used for lookups and are discussed in a later module.
  16. You should enable OSH for viewing – OSH is generated when you compile a job.
  17. Metadata is “data about data” that describes the formats of sources and targets. This includes general format information such as whether the record columns are delimited and, if so, the delimiting character. It also includes the specific column definitions.
  18. DataStage Manager is a graphical tool for managing the contents of your DataStage project repository, which contains metadata and other DataStage components such as jobs and routines. The left pane contains the project tree. There are seven main branches, but you can create subfolders under each. Select a folder in the project tree to display its contents.
  19. Any set of DataStage objects, including whole projects, which are stored in the Manager Repository, can be exported to a file. This export file can then be imported back into DataStage. Import and export can be used for many purposes, including: ·        Backing up jobs and projects. ·        Maintaining different versions of a job or project. ·        Moving DataStage objects from one project to another. Just export the objects, move to the other project, then re-import them into the new project. ·        Sharing jobs and projects between developers. The export files, when zipped, are small and can be easily emailed from one developer to another.
  20. Click Export&amp;gt;DataStage Components in Manager to begin the export process. Any object in Manager can be exported to a file. Use this procedure to backup your work or to move DataStage objects from one project to another. Select the types of components to export. You can select either the whole project or select a portion of the objects in the project. Specify the name and path of the file to export to. By default, objects are exported to a text file in a special format. By default, the extension is dsx. Alternatively, you can export the objects to an XML document. The directory you export to is on the DataStage client, not the server.
  21. To import DataStage components, click Import&amp;gt;DataStage Components. Select the file to import. Click Import all to begin the import process or Import selected to view a list of the objects in the import file. You can import selected objects from the list. Select the Overwrite without query button to overwrite objects with the same name without warning.
  22. Table definitions define the formats of a variety of data files and tables. These definitions can then be used and reused in your jobs to specify the formats of data stores. For example, you can import the format and column definitions of the Customers.txt file. You can then load this into the sequential source stage of a job that extracts data from the Customers.txt file. You can load this same metadata into other stages that access data with the same format. In this sense the metadata is reusable. It can be used with any file or data store with the same format. If the column definitions are similar to what you need you can modify the definitions and save the table definition under a new name. You can import and define several different kinds of table definitions including: Sequential files and ODBC data sources.
  23. To start the import, click Import&amp;gt;Table Definitions&amp;gt;Sequential File Definitions. The Import Meta Data (Sequential) window is displayed. Select the directory containing the sequential files. The Files box is then populated with the files you can import. Select the file to import. Select or specify a category (folder) to import into. ·        The format is: &amp;lt;Category&amp;gt;\&amp;lt;Sub-category&amp;gt; ·        &amp;lt;Category&amp;gt; is the first-level sub-folder under Table Definitions. ·        &amp;lt;Sub-category&amp;gt; is (or becomes) a sub-folder under the type.
  24. In Manager, select the category (folder) that contains the table definition. Double-click the table definition to open the Table Definition window. Click the Columns tab to view and modify any column definitions. Select the Format tab to edit the file format specification.
  25. A job is an executable DataStage program. In DataStage, you can design and run jobs that perform many useful data integration tasks, including data extraction, data conversion, data aggregation, data loading, etc. DataStage jobs are: ·        Designed and built in Designer. ·        Scheduled, invoked, and monitored in Director. ·        Executed under the control of DataStage.
  26. In this module, you will go through the whole process with a simple job, except for the first bullet. In this module you will manually define the metadata.
  27. The appearance of the designer work space is configurable; the graphic shown here is only one example of how you might arrange components. In the right center is the Designer canvas, where you create stages and links. On the left is the Repository window, which displays the branches in Manager. Items in Manager, such as jobs and table definitions can be dragged to the canvas area. Click View&amp;gt;Repository to display the Repository window.
  28. The tool palette contains icons that represent the components you can add to your job design. You can also install additional stages called plug-ins for special purposes.
  29. Several types of DataStage jobs: Server – not covered in this course. However, you can create server jobs, convert them to a container, then use this container in a parallel job. However, this has negative performance implications. Shared container (parallel or server) – contains reusable components that can be used by other jobs. Mainframe – DataStage 390, which generates Cobol code Parallel – this course will concentrate on parallel jobs. Job Sequence – used to create jobs that control execution of other jobs.
  30. The tools palette may be shown as a floating dock or placed along a border. Alternatively, it may be hidden and the developer may choose to pull needed stages from the repository onto the design work area.
  31. Meta data may be dragged from the repository and dropped on a link.
  32. Any required properties that are not completed will appear in red. You are defining the format of the data flowing out of the stage, that is, to the output link. Define the output link listed in the Output name box. You are defining the file from which the job will read. If the file doesn’t exist, you will get an error at run time. On the Format tab, you specify a format for the source file. You will be able to view its data using the View data button. Think of a link as like a pipe. What flows in one end flows out the other end (at the transformer stage).
  33. Defining a sequential target stage is similar to defining a sequential source stage. You are defining the format of the data flowing into the stage, that is, from the input links. Define each input link listed in the Input name box. You are defining the file the job will write to. If the file doesn’t exist, it will be created. Specify whether to overwrite or append the data in the Update action set of buttons. On the Format tab, you can specify a different format for the target file than you specified for the source file. If the target file doesn’t exist, you will not (of course!) be able to view its data until after the job runs. If you click the View data button, DataStage will return a “Failed to open …” error. The column definitions you defined in the source stage for a given (output) link will appear already defined in the target stage for the corresponding (input) link. Think of a link as like a pipe. What flows in one end flows out the other end. The format going in is the same as the format going out.
  34. In the Transformer stage you can specify: ·        Column mappings ·        Derivations ·        Constraints A column mapping maps an input column to an output column. Values are passed directly from the input column to the output column. Derivations calculate the values to go into output columns based on values in zero or more input columns. Constraints specify the conditions under which incoming rows will be written to output links.
  35. There are two: transformer and basic transformer. Both look the same but access different routines and functions. Notice the following elements of the transformer: The top, left pane displays the columns of the input links. The top, right pane displays the contents of the stage variables. The lower, right pane displays the contents of the output link. Unresolved column mapping will show the output in red. For now, ignore the Stage Variables window in the top, right pane. This will be discussed in a later module. The bottom area shows the column definitions (metadata) for the input and output links.
  36. Stage variables are used for a variety of purposes: Counters Temporary registers for derivations Controls for constraints
  37. Two versions of the annotation stage are available: Annotation Annotation description The difference will be evident on the following slides.
  38. You can type in whatever you want; the default text comes from the short description of the jobs properties you entered, if any. Add one or more Annotation stages to the canvas to document your job. An Annotation stage works like a text box with various formatting options. You can optionally show or hide the Annotation stages by pressing a button on the toolbar. There are two Annotation stages. The Description Annotation stage is discussed in a later slide.
  39. Before you can run your job, you must compile it. To compile it, click File&amp;gt;Compile or click the Compile button on the toolbar. The Compile Job window displays the status of the compile. A compile will generate OSH.
  40. If an error occurs: Click Show Error to identify the stage where the error occurred. This will highlight the stage in error. Click More to retrieve more information about the error. This can be lengthy for parallel jobs.
  41. As you know, you run your jobs in Director. You can open Director from within Designer by clicking Tools&amp;gt;Run Director. In a similar way, you can move between Director, Manager, and Designer. There are two methods for running a job: ·        Run it immediately. ·        Schedule it to run at a later time or date. To run a job immediately: ·        Select the job in the Job Status view. The job must have been compiled. ·        Click Job&amp;gt;Run Now or click the Run Now button in the toolbar. The Job Run Options window is displayed.
  42. This shows the Director Status view. To run a job, select it and then click Job&amp;gt;Run Now. Better yet: Shift to log view from main Director screen. Then click green arrow to execute job.
  43. The Job Run Options window is displayed when you click Job&amp;gt;Run Now. This window allows you to stop the job after: ·        A certain number of rows. ·        A certain number of warning messages. You can validate your job before you run it. Validation performs some checks that are necessary in order for your job to run successfully. These include: ·        Verifying that connections to data sources can be made. ·        Verifying that files can be opened. ·        Verifying that SQL statements used to select data can be prepared. Click Run to run the job after it is validated. The Status column displays the status of the job run.
  44. Click the Log button in the toolbar to view the job log. The job log records events that occur during the execution of a job. These events include control events, such as the starting, finishing, and aborting of a job; informational messages; warning messages; error messages; and program-generated messages.
  45. A typical DataStage workflow consists of: Setting up the project in Administrator Including metadata via Manager Building and assembling the job in Designer Executing and testing the job in Director.
  46. Change licensing, if appropriate. Timeout period should be set to large number or choose “do not timeout” option.
  47. Available functions: Add or delete projects. Set project defaults (properties button). Cleanup – perform repository functions. Command – perform queries against the repository.
  48. Recommendations: Check enable job administration in Director Check enable runtime column propagation May check auto purge of jobs to manage messages in director log
  49. You will see different environment variables depending on which category is selected.
  50. Reading OSH will be covered in a later module. Since DataStage Enterprise Edition writes OSH, you will want to check this option.
  51. To attach to the DataStage Manager client, one first enters through the logon screen. Logons can be either by DNS name or IP address. Once logged onto Manager, users can import meta data; export all or portions of the project, or import components from another project’s export. Functions: Backup project Export Import Import meta data Table definitions Sequential file definitions Can be imported from metabrokers Register/create new stages
  52. DataStage objects can now be pushed from DataStage to MetaStage.
  53. Job design process: Determine data flow Import supporting meta data Use designer workspace to create visual representation of job Define properties for all stages Compile Execute
  54. The DataStage GUI now generates OSH when a job is compiled. This OSH is then executed by the Enterprise Edition engine.
  55. Messages from previous runs are kept in different color from current run.
  56. In Designer View &amp;gt; Customize palette This window will allow you to move icons into your Favorites folder plus many other customization features.
  57. The row generator and peek stages are especially useful during development to generate test data and display data in the message log.
  58. Depending on the type of data, you can set values for each column in the row generator.
  59. The peek stage will display column values in a job&amp;apos;s output messages log.
  60. EE takes advantage of the machines hardware architecture -- this can be changed at runtime.
  61. DataStage Enterprise Edition can take advantage of multiple processing nodes to instantiate multiple instances of a DataStage job.
  62. You can describe an MPP as a bunch of connected SMPs.
  63. A typical SMP machine has multiple CPUs that share both disks and memory.
  64. The traditional data processing paradigm involves dropping data to disk many times throughout a processing run.
  65. On the other hand, but parallel processing paradigm rarely drops data to disk unless necessary for business reasons -- such as backup and recovery.
  66. Data may actually be partitioned in several ways -- range partitioning is only one example. We will explore others later.
  67. Pipelining and partitioning can be combined together to provide a powerful parallel processing paradigm.
  68. In addition, data can change partitioning from stage to stage. This can either happened explicitly at the desire of a programmer or implicitly performed by the engine.
  69. Enterprise Edition deals with several different types and data: file sets data sets -- both in persistent and non-persistent forms.
  70. The Enterprise Edition engine was derived from DataStage and Orchestrate.
  71. Enterprise Edition is architecturally neutral -- it can run on SMP&amp;apos;s, clusters, and MPP&amp;apos;s. The configuration file determines how Enterprise Edition will treat the hardware.
  72. Much of the parallel processing paradigm is hidden from the programmer -- they simply designate process flow as shown in the upper portion of this diagram. Enterprise Edition, using the definitions in that configuration file, will actually execute UNIX processes that are partitioned and parallelized.
  73. Partitioners and collectors work in opposite directions -- however, many are frequently together in job designs.
  74. Partitioners and collectors have no stage nor icons of their own. They live live on input links of stages running in parallel (resp. sequentially). Link markings indicate their presence. S-----------------&amp;gt;S (no Marking) S----(fan out)---&amp;gt;P (partitioner) P----(fan in) ----&amp;gt;S (collector) P----(box)-------&amp;gt;P (no reshuffling: partitioner using &amp;quot;SAME&amp;quot; method) P----(bow tie)---&amp;gt;P (reshuffling: partitioner using another method) Collectors = inverse partitioners recollect rows from partitions into a single input stream to a sequential stage They are responsible for some surprising behavior: The default (Auto) is &amp;quot;eager&amp;quot; to output rows and typically causes non-determinism: row order may vary from run to run with identical input.
  75. Several stages handle sequential data. Each stage as both advantages and differences from the other stages that handle sequential data. Sequential data can come in a variety of types -- including both fixed length and variable length.
  76. The DataStage sequential stage writes OSH – specifically the import and export Orchestrate operators. Q: Why import data into an Orchestrate data set? A: Partitioning works only with data sets. You must use data sets to distribute data to the multiple processing nodes of a parallel system. Every Orchestrate program has to perform some type of import operation, from: a flat file, COBOL data, an RDBMS, or a SAS data set. This section describes how to get your data into Orchestrate. Also talk about getting your data back out. Some people will be happy to leave data in Orchestrate data sets, while others require their results in a different format.
  77. Behind each parallel stage is one or more Orchestrate operators. Import and Export are both operators that deal with sequential data.
  78. When data is imported the imported operator translates that data into the Enterprise Edition internal format. The export operator performs the reverse action.
  79. Both export and import operators are generated by the sequential stage -- which one you get depends on whether the sequential stage is used as source or target.
  80. These two processes must want together to correctly interpret data -- that is, to break a data string down into records and columns.
  81. Fields all columns are defined my delimiters. Similarly, records are defined by terminating characters.
  82. The DataStage GUI allows you to determine properties that will be used to read and write sequential files.
  83. Source stage Multiple output links - however, note that one of the links is represented by a broken line. This is a reject link, not to be confused with a stream link or a reference link. Target One input link
  84. If multiple links are present you&amp;apos;ll need to down-click to see each.
  85. If specified individually, you can make a list of files that are unrelated in name. If you select “read method” and choose file pattern, you effectively select an undetermined number of files.
  86. To use multiple readers on a sequential file, must be fixed-length records.
  87. DXEE needs to know: How a file is divided into rows How a row is divided into columns Column properties set on this tab are defaults for each column; they can be overridden at the column level (from columns tab).
  88. The sequential stage can have a single reject link. This is typically used when you are writing to a file and provides a location where records that have failed to be written to a file for some reason can be sent. When you are reading files, you can use a reject link as a destination for rows that do not match the expected column definitions.
  89. Number of raw data files depends on: the configuration file – more on configuration files later.
  90. The descriptor file shows both a record is metadata and the file&amp;apos;s location. The location is determined by the configuration file.
  91. File sets, lower yielding faster access and simple text files, are not in the Enterprise Edition internal format.
  92. The lookup file set is similar to the file set but also contains information about the key columns. These keys will be used later in lookups.
  93. Data sets represent persistent data maintained in the internal format.
  94. Accessed from/to disk with DataSet Stage. Two parts: Descriptor file User-specified name Contains table definition (&amp;quot;unformatted core&amp;quot; schema) Here is the icon of the DataSet Stage, used to access persistent datsets Descriptor file, e.g., &amp;quot;input.ds&amp;quot; contains: Paths of data files Metadata: unformatted table definition, no formats (unformatted schema) Config file used to store the data Data file(s) Contain the data itself System-generated long file names, to avoid naming conflicts.
  95. In both cases the answer is Yes.
  96. Both dsrecords and orchadmin are Unix command-line utilities. The DataStage Designer GUI provides a mechanism to view and manage data sets.
  97. The screen is available (data sets management) from manager, designer, and director.
  98. RDBMS access is relatively easy because Orchestrate extracts the schema definition for the imported data set. Litle or no work is required from the user.
  99. All columns from the input link will be placed on the rejects link. Therefore, no column tab is available for the rejects link.
  100. The mapping tab will show all columns from the input link and the reference link (less the column used for key lookup).
  101. If the lookup results in a non-match and the action was set to continue, the output column will be null.
  102. Each stage has input and output interface schemas, a partitioner and business logic. Interface schemas define the names and data types of the required fields of the component’s input and output Data Sets. The component’s input interface schema requires that an input Data Set have the named fields and data types exactly compatible with those specified by the interface schema for the input Data Set to be accepted by the component. A component ignores any extra fields in a Data Set, which allows the component to be used with any data set that has at least the input interface schema of the component. This property makes it possible to add and delete fields from a relational database table or from the Orchestrate Data Set without having to rewrite code inside the component. In the example shown here, Component has an interface schema that requires three fields with named fields and data types as shown in the example. In this example, the output schema for the component is the same as the input schema. This does not always have to be the case. The partitioner is key to Orchestrate’s ability to deliver parallelism and unlimited scalability. We’ll discuss exactly how the partitioners work in a few slides, but here it’s important to point out that partitioners are an integral part of Orchestrate components.
  103. There is one more point to be made about the DSEE execution model. DSEE achieves parallelism in two ways. We have already talked about partitioning the records and running multiple instances of each component to speed up program execution. In addition to this partition parallelism, Orchestrate is also executing pipeline parallelism. As shown in the picture on the left, as the Orchestrate program is executing, a producer component is feeding records to a consumer component without first writing the records to disk. Orchestrate is pipelining the records forward in the flow as they are being processed by each component. This means that the consumer component is processing records fed to it by the producer component before the producer has finished processing all of the records. Orchestrate provides block buffering between components so that producers cannot produce records faster than consumers can consume those records. This pipelining of records eliminates the need to store intermediate results to disk, which can provide significant performance advantages, particularly when operating against large volumes of data.
  104. Operators are the basic functional units of an Orchestrate application. Operators read records from input data sets, perform actions on the input records, and write results to output data sets. An operator may perform an action as simple as copying records from an input data set to an output data set without modification. Alternatively, an operator may modify a record by adding, removing, or modifying fields during execution. Operators are the basic functional units of an Orchestrate application. Operators read records from input data sets, perform actions on the input records, and write results to output data sets. An operator may perform an action as simple as copying records from an input data set to an output data set without modification. Alternatively, an operator may modify a record by adding, removing, or modifying fields during execution. Operators are the basic functional units of an Orchestrate application. Operators read records from input data sets, perform actions on the input records, and write results to output data sets. An operator may perform an action as simple as copying records from an input data set to an output data set without modification. Alternatively, an operator may modify a record by adding, removing, or modifying fields during execution.
  105. Partitioners and collectors have no stage nor icons of their own. They live live on input links of stages running in parallel (resp. sequentially). Link markings indicate their presence. S-----------------&amp;gt;S (no Marking) S----(fan out)---&amp;gt;P (partitioner) P----(fan in) ----&amp;gt;S (collector) P----(box)-------&amp;gt;P (no reshuffling: partitioner using &amp;quot;SAME&amp;quot; method) P----(bow tie)---&amp;gt;P (reshuffling: partitioner using another method) Collectors = inverse partitioners recollect rows from partitions into a single input stream to a sequential stage
  106. Link naming conventions are important because they identify appropriate links in the stage properties screen shown above. Four quadrants: Incoming data link (one only) Outgoing links (can have multiple) Meta data for all incoming links Meta data for all outgoing links – may have multiple tabs if there are multiple outgoing links Note the constraints bar – if you double-click on any you will get screen for defining constraints for all outgoing links.
  107. If you perform a lookup from a lookup stage and choose the continue option for a failed lookup, you have the possibility of nulls entering your data flow.
  108. There is no longer a need to used shared containers to get Universe functionality on the parallel palette. Basic transformer is slow in that records need to be exported by the framework to Universe functions then imported.
  109. One of the nation&amp;apos;s largest direct marketing outfits has been using this simple program in DS-EE (and its previous instantiations) for years. Householding yields enormous savings by avoiding mailing the same material (in particular expensive catalogs) to the same household.
  110. Stable will not rearrange records that are already in a properly sorted data set. If set to false no prior ordering of records is guaranteed to be preserved by the sorting operation.
  111. The Framework concept of Port # is translated in the GUI by Primary/Reference
  112. Follow the RDBMS-style relational model: the operations Join and Load in RDBMS commute. x-products in case of duplicates, matching entries are reusable No fail/reject/drop option for missed matches
  113. Contrary to Join, Lookup and Merge deal with missing rows. Obviously, a missing row cannot be captured, since it is missing. The closest thing one can capture is the corresponding unmatched row. Lookup can capture in a reject link unmatched rows from the primary input(Source). That is why it has only one reject link (there is only one primary). We&amp;apos;ll see the reject option is exactly the opposite with the Merge stage.
  114. Contrary to Lookup, Merges captures unmatched secondary (update) rows. Since there may be several update links, there may be several reject links.
  115. This table contains everything one needs to know to use the three stages.
  116. WARNING! Hash has nothing to do with the Hash Partitioner! It says one hash table per group must be carried in RAM Sort has nothing to do with the Sort Stage. Just says expects sorted input.
  117. The hardware that makes up your system partially determines configuration. For example, applications with large memory requirements, such as sort operations, are best assigned to machines with a lot of memory. Applications that will access an RDBMS must run on its server nodes; operators using other proprietary software, such as SAS or SyncSort, must run on nodes with licenses for that software.
  118. Set of node pool reserved names: DB2 Oracle Informix Sas Sort Syncsort
  119. For a single node system node name usually set to value of UNIX command –uname –n Fastname attribute is the name of the node as it is referred to on the fastest network in the system, such as an IBM switch, FDDI, or BYNET. The fast name is the physical node name that operators use to open connections for high-volume data transfers. Typically this is the principal node name as returned by the UNIX command uname –n.
  120. Recommendations: Each logical node defined in the configuration file that will run sorting operations should have its own sort disk. Each logical node&amp;apos;s sorting disk should be a distinct disk drive or a striped disk, if it is shared among nodes. In large sorting operations, each node that performs sorting should have multiple disks, where a sort disk is a scratch disk available for sorting that resides in either the sort or default disk pool.
  121. In this instance, since a sparse lookup is viewed as the bottleneck, the stage has been set to execute on multiple nodes.
  122. Example: wrapping ls Unix command: Ls /opt/AdvTrain/dnich would yield a list of files and subdirectories. The wrapper is thus comprised of the command and a parameter that contains a disk location.
  123. Unix ls command can take several arguments but we will use it in its simplest form: Ls location Where location will be passed into the stage with a job parameter.
  124. The Interfaces &amp;gt; input and output describe the meta data for how you will communicate with the wrapped application.
  125. Answer: You must first EXIT the DS-EE environment to access the vanilla Unix environment. Then you must reenter the DS-EE environment.
  126. The four tabs are Definitions, Pre-Loop, Per-Record, Post-Loop. The main action is in Per-record. Definitions is used to declare and initialize variables Pre-loop has code to be performed prior to processing the first row. Post-loop has code to be performed after processing the last row
  127. This is the Output page. This Input page is the same, except it has an &amp;quot;Auto Read,&amp;quot; instead of the &amp;quot;Auto Write&amp;quot; column. The Input/Output interface TDs must be prepared in advance and put in the repository.
  128. The role of Transfer will be made clearer soon with examples.
  129. Only the column(s) explicitly listed in the output TD survive.
  130. All the columns in the input link are transferred, irrespective of what the Input/Output TDs say.
  131. ANSWER to QUIZ Replacing index = count++; with index++ ; would result in index=1 throughout. See bottom bullet in left column.
  132. To view documentation on each of these properties, open a stage &amp;gt; input or output &amp;gt; format. Now hover your cursor over the property in question and help text will appear.
  133. The format of each line describing a column is: column_name:[nullability]datatype
  134. Raw – collection of untyped bytes Vector – elemental array Subrecord – a record within a record (elements of a group level) Tagged – column linked to another column that defines its datatype
  135. What is runtime column propagation? Runtime column propagation (RCP) allows DataStage to be flexible about the columns you define in a job. If RCP is enabled for a project, you can just define the columns you are interested in using in a job, but ask DataStage to propagate the other columns through the various stages. So such columns can be extracted from the data source and end up on your data target without explicitly being operated on in between. Sequential files, unlike most other data sources, do not have inherent column definitions, and so DataStage cannot always tell where there are extra columns that need propagating. You can only use RCP on sequential files if you have used the Schema File property (see Link\xd2 ” on page  ‑ and on page  ‑ ) to specify a schema which describes all the columns in the sequential file. You need to specify the same schema file for any similar stages in the job where you want to propagate columns. Stages that will require a schema file are: Sequential File File Set External Source External Target Column Import Column Export
  136. Modify operators can add or change columns in a data flow.
  137. Environment variables fall in broad categories, listed in the left pane. We&amp;apos;ll see these categories one by one. All environments values listed in the ADMINISTRATOR are the project-wide defaults. Can be modified by DESIGNER per job, and again, by DIRECTOR per run. The default values are reasonable ones, there is no need for the beginning user to modify them, or even to know much about them--with one possible exception: APT_CONFIG_FILE-- see next slide.
  138. Highlighted: APT_CONFIG_FILE, contains the path (on the server) of the active config file. The main aspect of a given configuration file is the # of nodes it declares. In the Labs, we used two files: One with one node declared; for use in sequential execution One with two nodes declared; for use in parallel execution
  139. The correct settings for these should be set at install. If you need to modify them, first check with your DBA.
  140. These are for the user to play with. Easy: they take only TRUE/FALSE values. Control the verbosity of the log file. The defaults are set for minimal verbosity. The top one, APT_DUMP_SCORE, is an old favorite. It tracks datasets, nodes, partitions, and combinations --- all TBD soon. APT_RECORDS_COUNT helps you detect load imbalance. APT_PRINT_SCHEMAS shows the textual representation of the unformatted metadata at all stages. Online descriptions with the &amp;quot;Help&amp;quot; button.
  141. You need to have these right to use the Transformer and the Custom stages. Only these stages invoke the C++ compiler. The correct values are listed in the Release Notes.
  142. Project-wide environments set by ADMINISTRATOR can be modified on a job basis in DESIGNER&amp;apos;s Job Properties, and on a run basis by DIRECTOR. Provides great flexibility.