2. developerWorks®
ibm.com/developerWorks/
the repository is empty. Delta loads are regular (such as daily) data updates from source systems
into InfoSphere MDM Server.
There are two different approaches to loading data into InfoSphere MDM Server in batch.
The maintenance service batch approach loads data into InfoSphere MDM Server using the
maintenance services invoked by the Batch Processor. Alternatively, data can be loaded directly
into the database using DataStage jobs.
This article shares an IBM team's experience performing case studies focusing on the
Maintenance Transaction approach using InfoSphere MDM Server version 8.0.1.
The article starts with an introduction to MDM Server Maintenance Transactions. Then it goes
on to cover the basic installation and setup steps of the MDM Server environment, including
DB2® database server, WebSphere® Application Server, InfoSphere MDM Server, MDM Server
Maintenance Transactions, and batch processor. The article covers a high-level summary of key
performance results based on internal case studies. It concludes with a list of performance tuning
tips and best practices to get optimal performance while doing initial data load. Using this article,
you can leverage the IBM team's experience, and you can use recommendations as guidance in
your own InfoSphere MDM Server initial load solutions.
Introducing the MDM Server service batch approach
The MDM Server service batch approach loads data into MDM Server using the maintenance
transactions batch processor invokes or using any other batch framework. Because MDM Server
services process the data during load, this approach provides the best level of business data
validation. You can use the same set of maintenance transactions for both initial and delta loads.
To create the setup that uses this option, you need to install InfoSphere MDM Server capable of
running maintenance transactions. You also need to prepare the input data in a format that the
Batch Processor can consume.
What are maintenance transactions?
InfoSphere MDM Server creates a unique internal identifier for each record or business entity that
serves as its internal key. The regular InfoSphere MDM Server services expect the internal key to
be provided as part of the update service request, to ensure that services can identify the correct
business entity in the database. However, when data flows into InfoSphere MDM Server directly
from external applications such as legacy systems, the internal key is not known, and often the
nature of the data change is also not known.
Maintenance transactions address this problem. These transactions do not require the internal
key as part of the input. They also do not require the external system to specify whether this entity
needs to be added or updated in InfoSphere MDM Server. Instead of the internal key, maintenance
transactions expect the business key as part of the input, which is the unique identifier of the
business entity in external applications. Maintenance transactions use the business key provided
in the load operation to locate the correct instance of the business entity in the database. If an
existing entity is found, it is updated using the appropriate transaction, such as updateParty. If no
Loading a large volume of Master Data Management data
quickly: Using MDM Server maintenance services batch
Page 2 of 34
3. ibm.com/developerWorks/
developerWorks®
existing entity is found, a new entity is created in InfoSphere MDM Server using the appropriate
transaction, such as addParty.
There are many types of maintenance transactions, including maintainParty,
maintainPersonName, and maintainContractPlus. For a complete
list of the transactions and more details about them, refer to the
MDMRapidDeploymentPackage_CompositeMaintenanceServices.pdf document, available as part
of the EntryLevelMDM patch.
Maintenance transactions are not part of default InfoSphere MDM Server 8.0.1 distribution and
installation. You need to obtain and install EntryLevelMDM patch to use these transactions.
Note: Maintenance transactions are part of default InfoSphere MDM Server 8.5 distribution. They
are provided with source code as part of the MDM Server Samples distribution archive. You need
to install them on top of an existing InfoSphere MDM Server 8.5 instance. See Resources for a link
to instructions. It's recommended that you get assets from the FTP site mentioned in the Get the
Installer section in this article to ensure you have the latest version.
Batch transaction processing
You can use maintenance transactions to load data using MDM Server Batch, or they can
be invoked as any other service exposed by MDM Server using the RMI or JMS messaging
mechanisms. This article focuses on the invocation batch method. InfoSphere MDM Server
provides two ways to perform batch transaction processing. You can use either the J2SE
Batch processor framework or the WebSphere Application Server eXtended Deployment batch
framework. This article focuses on the first option: the J2SE Batch Processor framework.
The J2SE Batch processor framework is a J2SE client application, and it is part of a default
InfoSphere MDM Server installation. The batch processor is a multi-threaded application that can
process large volumes of batch data. It can process multiple records from the same batch input
simultaneously, increasing the throughput. Additionally, you can run multiple instances of the batch
processor simultaneously, each one processing a separate batch input and pointing to the same
server or to different servers.
Each batch record in the batch input flows through the batch processor in the following sequence:
1. The reader consumer reads the record from the batch input. The submitter consumer sends it
to the request/response framework for parsing and processing.
2. The parser transforms the input request into one or more business objects.
3. After passing through business proxy, business processing and persistence logic are applied
to the business objects.
4. The application responses are sent to the constructor in order to construct the desired batch
output response.
5. The constructed response is returned to the batch processor.
6. The writer records the transaction outcome in the writer log, if necessary. For example,
FailedWriter logs any failed messages.
Loading a large volume of Master Data Management data
quickly: Using MDM Server maintenance services batch
Page 3 of 34
4. developerWorks®
ibm.com/developerWorks/
The batch processor is shipped with pre-built readers and writers that can be used as is. The
default reader expects the batch input is an XML data format where each line contains one XML
request. The default writer writes the response in the XML format. You can also use the InfoSphere
MDM Server batch processor to process batch files containing messages in SIF format.
If your input data is not in the format specified above, you need to convert them to the required
format, or use a customized reader and parser. It is possible to customize many of the components
of the Batch Processor, but customization is not within the scope of this article.
Understanding software and hardware requirements
The following is a typical system topology for InfoSphere MDM Server deployment using
QualityStage from Information Server for Standardization and Matching:
• Application Server and InfoSphere MDM Server are installed on one physical box or LPAR
with the correct CPU capacity (Server1). The number of CPUs depends on the overall
throughput requirements.
• The database server is installed on another physical box or LPAR (Server2) with wellequipped IO capacity.
• IIS Server should be installed either on the database server or on a third physical box or
LPAR (Server3) with adequate IO bandwidth.
• IIS Client is used to configure QS jobs, and it is installed on a Windows® computer.
To efficiently maximize the performance for the given configuration, follow the following general
guidelines:
• The ratio of the number of CPUs on InfoSphere MDM Server and DB server can range from
2:1 to 3:1. For example, if you have a database server with 4 CPUs, the recommended
number of CPUs on the MDM Server box is at least 8 CPUs in order to well-utilize the CPU
capacity on the database server.
• You should have 5 to 10 physical disk spindles available for each CPU on the database
server.
• The ratio of the number of CPUs on InfoSphere MDM Server and IIS server can range from
2:1 to 1:1. For example, if you have MDM Server with 8 CPUs, the recommended number of
CPUs on the IIS server box is between 4 and 8.
Note: You only need IIS server if you plan to use QualityStage for standardization and matching
(such as suspect processing). InfoSphere MDM Server default configuration does not use
QualityStage.
Exploring the example environment
This section briefly describes the example environment, including hardware and software
information, in each layer in the stack. It also describes the system topology used in the tests.
Software and hardware stack
• Server 1 (AppServer and InfoSphere MDM Server)
Loading a large volume of Master Data Management data
quickly: Using MDM Server maintenance services batch
Page 4 of 34
5. ibm.com/developerWorks/
developerWorks®
• Hardware
• Machine type: IBM 9116-561, PowerPC® POWER5™
• CPUs: 8 core Power5 with 16 threads, 1.5GHz , 64 bit
• Memory/IO: 32 GB RAM, 6 internal disks
• Software
• OS : AIX® Version 5300-06 (64 bit)
• WebSphere® Application Server ND 6.1.0.11 (32 bit)
• InfoSphere MDM Server 8.0.1 + EntryLevelMDM patch
• Server 2 (DB2® database Server)
• Hardware
• Machine type: IBM 9116-561, PowerPC POWER5
• CPUs: 8 core Power5 with 16 threads, 1.5GHz , 64 bit
• Memory/IO : 32 GB RAM, 6 internal disks + 40 external disks
• Software
• OS : AIX Version 5300-06 (64 bit)
• DB2® database server v9.5 (64 bit)
• Server 3 (Information Server)
• Hardware
• Machine type: IBM 9116-561, PowerPC POWER5
• CPUs: 8 core Power5 with 16 threads, 1.5GHz , 64 bit
• Memory/IO : 32 GB RAM, 6 internal disks
• Software
• OS : AIX Version 5300-06 (64 bit)
• IIS v8.0.1
• Server 4 (IIS Client - To configure QualityStage jobs, not needed while running the test)
• Hardware
• 32 bit x86 machine
• Software
• OS : Windows 2003 Server
• IIS client version 8.0.1 for Windows
System topology
For InfoSphere MDM Server to use QualityStage jobs for standardization and matching, you need
Server3 and Server4, as shown in Figure 1. For default standardization and matching algorithms
from InfoSphere MDM Server, Server1 and Server2 are sufficient.
Loading a large volume of Master Data Management data
quickly: Using MDM Server maintenance services batch
Page 5 of 34
6. developerWorks®
ibm.com/developerWorks/
Figure 1. System topology
Installing the components
The purpose of this section is to show the high-level steps required to get the needed software
installed in the test environment. The steps focus on the Maintenance services-related steps, while
briefly mentioning the prerequisite software installation, including WebSphere® Application Server,
DB2 database server, InfoSphere MDM Server, and InfoSphere Information Server.
Installation prerequisites
The prerequisite installations include WebSphere Application Server, DB2 database server, and
InfoSphere Information Server. For installation instructions, see each product's Information Center
in Resources.
1. On Server1, install IBM WebSphere Application Server Network Deployment, Version 6.1, and
upgrade it with Fixpack 11.
2. On Server2, install DB2 Database Server, Version 9.5.
3. On Server3, install IIS Server, Version 8.0.1.
4. On Server4 (Windows machine), install IIS client.
InfoSphere MDM Server Installation
For InfoSphere MDM Server installation, see Resources for a link to the information center. You
can install it on a standalone WebSphere Application Server or on a WebSphere Application
Server cluster.
Installation of Entry Level MDM Server patch for maintenance services
Follow the steps in this section to apply the Entry Level MDM (ELMDM) Server patch, which
enables you to use maintenance transactions.
Loading a large volume of Master Data Management data
quickly: Using MDM Server maintenance services batch
Page 6 of 34
7. ibm.com/developerWorks/
developerWorks®
These instructions assume that you have already installed InfoSphere MDM Server and have
applied all the required fixpacks. These instructions are based on software stack mentioned in the
Test Environment section.
Step 1. Get the installer.
Maintenance transactions are not part of the default installation of MDM Server, and they need
to be installed separately. If you have a service agreement with IBM, you can get the installer
for maintenance transactions by logging into the Secure File Transfer site and finding https://
testcase.boulder.ibm.com/www/prot/MDM_RDP/?T. At the time of writing, the latest installable
package is https://testcase.boulder.ibm.com/www/prot/MDM_RDP/MDMServer801_RDP801/
ELMDM-20090407.tar.gz. Contact your IBM service representative if you need help getting this
package.
For more instructions, see the chapter titled Installing Rapid Deployment Package for
MDM Server Maintenance transactions and MDM Customizations in the document
MDMRapidDeploymentPackage_InstallGuide.pdf. You can find this document under the directory
Docs when you uncompress the installer.
Step 2. Make required backups before installing.
The installer makes changes to the InfoSphere MDM Server Database. As a precaution, you
might want to make a backup of this database before running the installer. The installer creates
backup copies of files that it changes. These files are named *.beforeELMDM. However, they
get overwritten during subsequent installer runs. So before you invoke the installer again for any
reason, ensure you have moved the previous set of files to a safe place.
The files modified by the installer are:
• MDM Server home directory installable .ear file. For example, /usr/IBM/MDM_801/
installableApps/MDM.ear
• A set of files in the <MDM_Instance>.ear directory under WebSphere Application Server. For
example, /opt/IBM/WebSphere/AppServer/profiles/AppSrv1/installedApps/myHostCell01/
MDM_801.ear/
Step 3. Prepare the installer.
Complete the following steps to prepare the installer.
a. Create a new base directory named setup.
b. Extract the installer (.tar.gz file) in this directory. It creates several directories, including one
named install.
c. Go to directory setup/install/DB2 database server.
d. Give execute permissions for all the scripts using the command chmod 755 *.sh
e. Connect to the InfoSphere MDM Server database and execute the SQL below. The schema
name is assumed to be mySchema.
Loading a large volume of Master Data Management data
quickly: Using MDM Server maintenance services batch
Page 7 of 34
8. developerWorks®
ibm.com/developerWorks/
Listing 1. SQL to execute
db2 "insert into mySchema.DataAssociation values
(25083715210700005,'a_name',current_timestamp,'a_description',null)"
Step 4. Customize a clustered environment.
This step is not required if your MDM Server is a standalone server. If you are installing ELMDM
on a Clustered MDM Server installation (MDM Server running on a cluster of WebSphere
Application Servers), make the following modifications in the scripts.
a. In setVariables.sh, add the line in Listing 2 at the beginning of the script. NAME_OF_SERVER
refers to the name of the WebSphere Application Server instance that is a member of the
cluster.
Listing 2. Added line
#add the line below
export SRV_NAME=NAME_OF_SERVER
b. In the scripts install_DisableHVL.sh, install_EnableHVL.sh, and install_ELPCustom.sh, make
the changes shown in Listing 3.
Listing 3. Changes to script files
#comment out the line below and replace with the new line as shown below
#$CURRENT/restartServer.sh $WAS_HOME $NODE_NAME $APP_NAME $ADMIN_USER $ADMIN_PASSWORD
#add the line below
$CURRENT/restartServer.sh $WAS_HOME $NODE_NAME $SRV_NAME $ADMIN_USER $ADMIN_PASSWORD
c. In the install_ELPTx.sh script, make the changes in Listing 4.
Listing 4. The install_EPLTx.sh script
#comment out the line below and replace with the new line as shown below
#$LOC/restartServer.sh $WAS_HOME $NODE_NAME $APP_NAME $ADMIN_USER $ADMIN_PASSWORD
#add the line below
$LOC/restartServer.sh $WAS_HOME $NODE_NAME $SRV_NAME $ADMIN_USER $ADMIN_PASSWORD
Step 5. Optionally modify the installer to help in debugging.
Complete the following steps to modify the installer to debug.
a. At the beginning of each script, add set -x
b. Add the verbose option to db2 calls by replacing all occurrences of db2 -tf with db2 -tvf in the
scripts below:
• runsql.sh
• install_ELPCustom.sh
• install_EnableHVL.sh
• install_DisableHVL.sh
Step 6. Set your environment variables
Loading a large volume of Master Data Management data
quickly: Using MDM Server maintenance services batch
Page 8 of 34
9. ibm.com/developerWorks/
developerWorks®
Modify the setVariables.sh script according to your environment. The values given in Listing 5 are
examples. Read the comments and instructions embedded within the example.
Listing 5. Extract from the setVariables.sh script
export WAS_HOME=/opt/IBM/WebSphere/AppServer
export CELL_NAME=myhostCell01
#set the profile name used by WAS running MDM Server. such as AppSrv01 and Custom01
export NODE_NAME=Custom01
export APP_NAME=MDM_801
#The Name of the WebSphere Application Server running MDM Server,
#You will have this only if you followed Step 4 above
export SRV_NAME=Cluster_member1
export INSTALL_HOME=/usr/IBM/MDM_801
# IIS Server Version: Could be 801 or 81
export IIS_SRV_VERSION=801
export
export
export
export
export
export
DB_NAME=MDMDB
DB_USER=myDBuser
DB_PASSWORD=myDBpassword
TABLE_SPACE=TABLESPACE1
INDEX_SPACE=INDEXSPACE1
LONG_SPACE=LONGSPACE1
export TRIG=COMPOUND
export DEL_TRIG=TRUE
export APPLICATION_NAME='WebSphere Customer Center'
export APPLICATION_VERSION=8.0.1.0
export DEPLOY_NAME=MDM_801
#You need to set this only if you are integrating QualityStage with MDM Server.
#Please note the back slashes. The number 2809 here refers to the
#bootstrap port of WebSphere Application Server instance running IIS server.
export ISP_URL='iiop://myIISserver.mylab.ibm.com:2809'
Step 7. Execute the scripts.
a. Execute install_ELPTx.sh.
b. If you are integrating InfoSphere MDM Server with QualityStage, run the
install_ELPCustom.sh script as well.
Step 8. Check for errors.
Go through all the log files to ensure there are no errors.
Step 9. Repeat steps for a clustered environment.
If you are installing in a clustered environment, complete the steps below for each cluster member.
a. Reconfigure setVariables.sh to point to another cluster member.
b. Run the additionalClusterInstall.sh script.
c. If you are integrating InfoSphere MDM Server with QualityStage, run the
install_ELPCustom.sh script.
Loading a large volume of Master Data Management data
quickly: Using MDM Server maintenance services batch
Page 9 of 34
10. developerWorks®
ibm.com/developerWorks/
Note: As part of the install_ELPCustom.sh script, there are changes made to InfoSphere MDM
Server database. Some of these changes cannot be executed more than once (such as a DB
insert). Either ignore these errors during repeated execution of this script, or alter the script so that
it does not attempt to repeat the database operations.
Step 10. Configure the SIF parser.
Complete this step only if you want to use a SIF parser. Otherwise, skip to Step 11. The example
uses the default XML parser. To configure the batch processor to use the SIF parser, modify the
following:
a. In the DWLCommon_extention.property file, which is in properties.jar on server runtime
environment, set sif_compatibility_mode = on.
b. In batch extension property file, set ParserAndExecConfiguration.Parser = SIF.
For more details, see the section SIF Parser in
MDMRapidDeploymentPackage_CompositeMaintenanceServices.pdf.
Step 11. Restart the InfoSphere MDM Server.
Restart the InfoSphere MDM Server, including all the servers in a cluster.
Integration of InfoSphere MDM Server with QualityStage
If you want to use default standardization and matching algorithms from InfoSphere MDM
Server, these steps are not needed, and you can continue to Optimizing performance with key
configuration parameters. However, if you want InfoSphere MDM Server to use QualityStage for
standardization and matching, this section describes how to configure them.
These instructions assume the following:
• InfoSphere MDM Server is installed and all the required fixpacks are applied.
• EntryLevelMDM is installed.
• The IIS server and IIS client are installed. The version of the IIS client must be the same as
that of the IIS server.
• The software stack is similar to that described in the Software and hardware stack section of
the example environment.
See Resources to access the documentation for InfoSphere MDM Server and QS integration
(MDM Server Developers Guide, chapter titled Integrating IBM Information Server QualityStage
with IBM InfoSphere Master Data Management Server). The instructions in this article complement
those mentioned in the developer's guide. However, there are a few configuration changes
mentioned in this article that are helpful during the installation.
Step 1. Change security settings.
Loading a large volume of Master Data Management data
quickly: Using MDM Server maintenance services batch
Page 10 of 34
11. ibm.com/developerWorks/
developerWorks®
If global security is enabled on the WebSphere Application Server running IIS, the transaction
protocol security on that server must be disabled. To disable protocol security on a server,
complete the following steps in the administrative console:
a. In the administrative console, click Servers > Application Servers > server_name. The
properties of the application server are displayed in the content pane.
b. Under Container Settings, expand Container Services and click Transaction Service to
display the properties page for the transaction service.
c. Under Additional Properties, click Custom Properties.
d. On the Custom Properties page, click New.
e. Type DISABLE_PROTOCOL_SECURITY in the Name field, and type TRUE in the Value
field.
f. Click Apply or OK.
g. Click Save to save your changes to the master configuration.
h. Restart the server.
Optionally, if WebSphere Application Server application security is turned on for InfoSphere MDM
Server, the LTPA keys need to be shared between the MDM WebSphere Application Server cell
and the IIS WebSphere Application Server cell. For detailed instructions, refer to the WebSphere
Application Server Information Center (see Resources).
Step 2. Get the installer.
The installable components are part of the same bundle that you used while installing maintenance
services. You will find them in the QualityStage folder.
Step 3. Create the IIS project.
Use the IIS Administrator Client to connect to the IIS server. Create a new project called
ELMDMQS.
Step 4. Import the IIS project.
1. Log into the ELMDMQS project through the DataStage and QualityStage Designer.
2. Click Import > Datastage Components.
3. Browse to the ELMDMQS.dsx file under the EntryLevelMDMQualityStage folder you
extracted above.
4. Import the file.
Step 5. Provision imported rule sets.
You need to provision imported rule sets to the designer client before a job that uses them can be
compiled. Complete the following steps to provision imported rule sets.
a. In the Designer client, find the rule set within the repository tree ELMDMQS > ELMDMRT >
Standardization Rules > MDMQS.
Loading a large volume of Master Data Management data
quickly: Using MDM Server maintenance services batch
Page 11 of 34
12. developerWorks®
ibm.com/developerWorks/
b. Select the rule set by right-clicking and selecting Provision All from the menu, as shown in
Figure 2.
Figure 2. Provisioning rule sets
c. Repeat the steps for all the rulesets listed below.
• MDMQSStandardization RulesMDMCanadaCAADDRMDMCAADDR
• MDMQSStandardization RulesMDMCanadaCAAREAMDMCAAREA
• MDMQSStandardization RulesMDMUSAUSADDRMDMUSADDR
• MDMQSStandardization RulesMDMUSAUSAREAMDMUSAREA
• MDMQSStandardization RulesMNADKEYSMNADKEYS
• MDMQSStandardization RulesMNNAMEMNNAME
• MDMQSStandardization RulesMNNMKEYS
• MDMQSStandardization RulesMNPHONEMNPHONE
• MDMQSStandardization RulesMNSPOSTMNSPOST
Step 6. Prepare test data and configure parameters
a. Copy the provided test data (*.csv files and *.txt) into a directory on your IIS server (not the IIS
client) called /data01/ELMDMQS.
b. Open the parameter set ELMDMQS_Data_Directory under ELMDMQSELMDMRTParameter
Sets (in the Repository view of the designer).
c. Double-click on the Parameter set.
d. Go to the Values tab and set the value of the parameter DATADIR to the directory path into
which you just copied the test data (/data01/ELMDMQS/ in this example), as shown in Figure
3. Note the slash (/) at the end of the parameter value.
Loading a large volume of Master Data Management data
quickly: Using MDM Server maintenance services batch
Page 12 of 34
13. ibm.com/developerWorks/
developerWorks®
Figure 3. Parameter set
e. Under the ELMDMQSELMDMRTShared Containers folder, double-click to open the shared
container MDMQSPartySuspectReferenceMatchOrganization.
f. Set the file paths of data set stages Data_Frequency and Reference_Frequency to the same
path that you provided for ELMDMQS_Data_Directory.DATADIR to in the previous step, as
shown in Figure 4.
Loading a large volume of Master Data Management data
quickly: Using MDM Server maintenance services batch
Page 13 of 34
14. developerWorks®
ibm.com/developerWorks/
Figure 4. Edit input file path
g. Click OK to save the changes.
h. Close the stage, clicking Yes when it prompts you to save the changes in the stage.
i. Repeat the above steps for MDMQSPartySuspectReferenceMatchPerson.
Step 7. Compile the jobs.
a. Compile all the jobs inside the ELMDMQSELMDMRTJobs folder and its subfolders using
Tool > Multiple Job compile from the designer client's menu.
b. Follow the instructions in the wizard, and start compiling.
Note: Batch versions of jobs can be found in the ELMDMQSELMDMRTJobs folder. Information
Service Director (ISD) versions of these jobs can be found in the ELMDMQSELMDMRTJobsISD
folder.
Step 8. Generate match frequency data
a. Use the director client to run the job ELMDMQSELMDMRTJobs
MDMQS_Person_Match_Frequency_Generation to generate the match frequency
data. When completed, it generates files PersonRefMatchTransFreq.txt and
PersonRefMatchCandFreq.txt, as shown in Figure 5.
Loading a large volume of Master Data Management data
quickly: Using MDM Server maintenance services batch
Page 14 of 34
15. ibm.com/developerWorks/
developerWorks®
Figure 5. Generating match frequency data
b. Similarly, run ELMDMQSELMDMRTJobsMDMQS_Org_Match_Frequency_Generation to
generate files OrgRefMatchTransFreq.txt and OrgRefMatchCandFreq.txt
Step 9. Run the test jobs.
a. Use the director client to run the following batch jobs to test that they execute successfully on
your system before you use the ISD jobs:
• All jobs in ELMDMQSELMDMRTStandardization Testing
• All the Jobs in ELMDMQSELMDMRTMatch Testing
b. After running the jobs, view the output in the Sequential file to check the result
Step 10. Deploy services using ISD
a. Log on to the IBM Information Server (IIS) console.
b. Click File > Import Information Services Project > Browse for the file
ELMDMQS_ISDProject.xml in the EntryLevelMDMQualityStage directory.
c. Keep all the default settings, and click Import.
d. Open the Information Service Application (ELMDMQS) contained in the imported project.
e. Click Develop, as shown in Figure 6.
Loading a large volume of Master Data Management data
quickly: Using MDM Server maintenance services batch
Page 15 of 34
16. developerWorks®
ibm.com/developerWorks/
Figure 6. Selecting the Develop icon
f. Click Information Services Application.
g. On the resulting screen, double-click the ELMDMQS application to open it.
h. Go into Edit mode.
i. In the Select a View window, click Services > ELMDMQSService, as shown in Figure 7.
Loading a large volume of Master Data Management data
quickly: Using MDM Server maintenance services batch
Page 16 of 34
17. ibm.com/developerWorks/
developerWorks®
Figure 7. Configuring jobs using ISD
j. In the expanded tree, select Operations, and double-click the operations one at a time to edit
each of them.
Loading a large volume of Master Data Management data
quickly: Using MDM Server maintenance services batch
Page 17 of 34
18. developerWorks®
ibm.com/developerWorks/
Figure 8. Checking the project name
k. Edit each of the operations as follows:
i. Ensure that the project name is correct, as shown in Box 1 in Figure 8. When you
created the new project using the administration client, if you chose ELMDMQS as the
name of the project, you can keep the defaults. If you specified another name, ensure
that the project name and the job names are correct. To check the project and job
names, click the Edit button, and browse to the project and job in the ISD folder.
ii. Ensure that the Group Arguments into Structure option is enabled for inputs, as shown in
Box 2 in Figure 8.
iii. Change the input data type according to Table 1 below, as shown in Box 3 in Figure 8.
iv. Check or uncheck the Accept array checkboxes according to Table 1, as shown in Box 4
in Figure 8 (the checkbox should show a checkmark if the table entry indicates Yes).
v. Check or uncheck the output data type and Accept array checkboxes on the output tab
according to Table 1.
Table 1. ISD job configuration
Operation name
standardizeAddress
Operation job name
Inputs accept array
ISD_MDMQS_Address_Standardization
No
Loading a large volume of Master Data Management data
quickly: Using MDM Server maintenance services batch
Input data type
AddressInput
Outputs return array
No
Output data type
AddressOutput
Page 18 of 34
20. developerWorks®
ibm.com/developerWorks/
Figure 10. Deploying the application
o. Click Deploy, as shown in the Figure 10.
p. Leave the defaults, and click Deploy to start the deployment.
Step 11. Set configuration values for QualityStage.
Note: This example integration is being done for an InfoSphere MDM Server installation on
which maintenance services are installed. During the installation of maintenance services, if you
ran install_ELPCustom.sh then you can skip to Optimizing performance with key configuration
parameters.
Set the configuration values according to Table 2 in order to properly communicate with the IIS-QS
server.
Table 2. Configuration modifications
Configuration name
Default value
/IBM/ThirdPartyAdapters/IIS/defaultCountry
185
/IBM/ThirdPartyAdapters/IIS/initialContextFactory
This configuration element is used in conjunction with the provider URL
to use JNDI registry initial context. A typical value for this element is
com.ibm.websphere.naming.WsnInitialContextFactory.
/IBM/ThirdPartyAdapters/IIS/providerURL
iiop://<yourQSServer>:<QSServerBootstrapPort>. For example: iiop://
myIIS.torolab.ibm.com:2809.
Loading a large volume of Master Data Management data
quickly: Using MDM Server maintenance services batch
Page 20 of 34
21. ibm.com/developerWorks/
developerWorks®
/IBM/Party/Standardizer/Name/className
com.ibm.mdm.thirdparty.integration.iis8.adapter.InfoServerStandardizerAdapter
/IBM/Party/Standardizer/Address/className
com.ibm.mdm.thirdparty.integration.iis8.adapter.InfoServerStandardizerAdapter
Step 12: Use QualityStage (QS) name and address standardization.
Use QS to standardize names and addresses that are entered into InfoSphere MDM Server. See
Standardizing name, address and phone number information in the MDM developer's guide (see
Resources) for more information.
Step 13: Using QualityStage in suspect duplicate processing.
QualityStage can be used with the InfoSphere MDM Server Suspect Duplicate Processing (SDP)
feature. See Configuring IBM Information Server QualityStage integration for SDP in the MDM
developer's guide (see Resources) for more information on using QualityStage with SDP.
Optimizing performance with key configuration parameters
After you install the InfoSphere MDM Server, tune the key configuration parameters for optimal
performance.
InfoSphere MDM Server and batch processor configuration
1. Increase the number of submitters to increase parallelism. Do this by editing the file
<MDM_installation_Folder>/BatchProcessor/properties/Batch.properties. On an 8-way MDM
Server box, 24 submitters are optimal.
2. Increase JVM heap settings for the batch processor. Do this by editing the file
<MDM_installation_Folder>/BatchProcessor/bin/runbatch.sh. For example: for 24 submitters,
512MB of heap is sufficient.
3. Reduce BatchProcessor logging by setting the threshold to ERROR. Do this by editing
<MDM_installation_Folder>/BatchProcessor/Log4J.properties and setting the logging
threshold to ERROR, if it is not already. For example: log4j.appender.file.Threshold=ERROR.
4. Reduce MDM Server logging by setting the threshold to ERROR. Do this by editing
Log4J.properties inside the properties.jar file at <WebSphere_Location>/profiles/
<ServerName>/installedApps/<CellName>/<InstanceName>/properties.jar.
WebSphere Application Server configuration
1. Increase the JDBC connection pool size to support the parallelism.
a. From the WebSphere Administration Console, go to Resources >JDBC > Data sources
> DWLCustomer > Connection pool properties
b. Increase the value for Maximum connections. The example setup uses 50.
2. Increase the prepared statement cache size.
a. The size of the prepared statement cache depends on the number of unique SQL
statements used in your application. For InfoSphere MDM Server, set it to 300 and
monitor the application to determine if the cache size needs to be increased.
b. It can be changed from the WebSphere Administration Console. Go to Resources
> JDBC > Data sources > DWLCustomer > Connection pools > WebSphere
Application Server data source properties.
Loading a large volume of Master Data Management data
quickly: Using MDM Server maintenance services batch
Page 21 of 34
22. developerWorks®
ibm.com/developerWorks/
3. Increase the EJB cache size. Do this by using the WebSphere Administration Console to go
to Servers > Application servers > [ServerName] > EJB Container Settings > EJB cache
settings. The example uses 4000.
4. Change the JVM heap size and GC policy.
a. From the WebSphere Administration Console, go to Servers > Application servers >
[ServerName] > Java and Process Management > Process Definition > Java Virtual
Machine.
b. Indicate the initial heap size as 512 MB and the maximum heap size as 1024 MB.
c. Use gencon GC policy for better performance. To use this GC policy, specify Xgcpolicy:gencon under Generic JVM arguments. While testing the example using the
gencon GC policy, sometimes WebSphere Application Server generates unnecessary
heapdumps. To disable this behavior, do the following after the server is started:
i. From the WebSphere Administration Console, go to Servers > Application
servers > [ServerName] > Performance > Performance and Diagnostic Advisor
Configuration > Runtime (tab).
ii. Uncheck the check box (ensure the checkbox is empty) for Enable automatic heap
dump collection.
Database tuning (DB2)
It is recommended to follow best practices and recommendations to set up a database server. It
is also recommended to closely monitor your database performance and to tune your database
as needed for optimal performance and productive resource usage. This section briefly describes
several recommendations on configuring and tuning a DB2 database. The basic concepts also
apply to other types of databases.
• Typically it is recommended that you use one set of dedicated disks for DB2 transaction logs
and you use another set of dedicated disks for DB2 table spaces. If possible, it is even better
to use different disk controllers for DB2 transaction logs and DB2 table spaces, because this
gives you the flexibility to configure the disk controllers independently for different I/O patterns
to favor writes instead of a mix of writes and reads.
• Ensure read and write cache is enabled on the storage system. Monitor the cache
effectiveness, and configure the cache size properly.
• Properly plan the table spaces to ensure balanced I/O operations across all of the available
disks. This avoids hot spots in your database and avoids limiting your overall database
performance to the bandwidth of a few of the busiest disks. This maximizes the utilization of
all the I/O bandwidth available from all the physical disks.
• In addition to a well-planned table space layout over the I/O system, one of the biggest
configuration parameters that affects performance dramatically is the database buffer pool
size. Pay close attention to the overall buffer pool hit ratio, which tells how often it needs to
go to the physical disks (which is very expensive) for the needed data that is found in the
database buffer pools.
• Strive for a buffer pool hit ratio of 80% or higher for data, and 90% or higher for indexes.
Typically in MDM Server implementations, start with one big buffer pool for both data and
indexes. If necessary, separate data and indexes into two different buffer pools to help ensure
a good index buffer pool hit ratio.
Loading a large volume of Master Data Management data
quickly: Using MDM Server maintenance services batch
Page 22 of 34
23. ibm.com/developerWorks/
developerWorks®
• Because an MDM Server enables a good amount of customization and extension, analyze
the most expensive SQLs from the database snapshot or other tools. Ensure that those SQLs
have optimal access plans with the best indexes in place.
Those recommendations should be considered together to achieve what you need for
performance, because the behavior of one area might be just a symptom of another incorrectly
configured or misbehaving area.
Understanding performance test methodology used in the example
Input data preparation
The maintainContractPlus transaction was used for testing the example. Because the default
parser from the BatchProcessor was used, the input data format had to be LineFeed delimited
XML transactions.
The first step toward getting the input data set was to create seed-data. The seed-data was
generated using a home-grown, Java-based tool with key distributions based on U.S. Census
data (2000). Some realistic data was added to make the overall parties closely match a typical
MDM business scenario. The seed-data contained details such as name, gender, date of birth,
addresses.
As a second step, a template for maintainContractPlus transaction was created. This template had
variables for key party details that needed to be filled in with generated seed-data. Another homegrown, Java-based tool was used to generate the XML transactions. One such transaction yielded
one person with one name, one address, one contract, and one contact method. Table 3 shows
the detailed profile of database tables populated by a single transaction. The example run used a
total of one million such records as one input data set, representing one party and its associated
attributes.
Suspect duplicate data preparation
The data generated in the example so far was primarily clean. A similar approach was used
to generate dirty data, which included 40% duplicates. This data set was used when Suspect
Duplicate Processing was turned on.
During the initial load, the input data might have duplicate entries, where details from one record
closely resemble those from another one. Such records are termed as suspect duplicates.
Depending on how closely two records match, suspect duplicates are assigned a match category.
To determine the match category, some critical data fields are used while comparing two records.
The critical data fields include first name, last name, address, date of birth, gender, and social
security number. Based on comparison results, the suspect duplicates are assigned a matchscore and a non-match-score, and then the match category is derived. Depending on the match
category, InfoSphere MDM Server takes appropriate actions for the suspect duplicates.
When testing the example, two sets of data were used:
Loading a large volume of Master Data Management data
quickly: Using MDM Server maintenance services batch
Page 23 of 34
24. developerWorks®
ibm.com/developerWorks/
• 100% clean data with no suspect duplicates in the input data set
• 60% clean data with 40% of the records as suspect duplicates.
The example test included 4 types of suspect duplicates in the 60% clean data set. Population
of each type of suspect duplicate was kept equal, and they were randomly distributed in the data
using home-grown, Java-based tools.
The details of this data set are shown in Table 3.
Table 3. Details of input data with suspects
sr#
Matching critical
data details
Non-matching
critical data details
Population
Weight (match/
non-match score)
Match category
1
Gender, FirstName,
LastName, Address,
DOB, SSN
None
10%
63/0
A1
2
Gender, FirstName,
LastName, DOB,SSN
Address
10%
60/3
A2
3
Gender, Address,
DOB, SSN
FirstName, LastName
10%
55/4
A2
4
Gender, Address, Last First Name (and SSN
Name DOB
field is empty)
10%
46/1
B
The scores and categories in the Table 3 are calculated by InfoSphere MDM Server's deterministic
matching approach, which is the default implementation for party-matching. In contrast,
QualityStage matching offers a probabilistic matching approach, and it calculates only one
composite weight.
Data profile
Table 4 shows the population of InfoSphere MDM Server database tables when the two sets of
input data are loaded.
Table 4. Database population
Table name
100% clean data
60% clean data
ADDRESS
1,000,000
700,000
ADDRESSGROUP
1,000,000
900,000
CONTACT
1,000,000
900,000
CONTACTMETHOD
1,000,000
900,000
CONTACTMETHODGROUP
1,000,000
900,000
CONTEQUIV
1,000,000
1,000,000
CONTRACT
1,000,000
1,000,000
CONTRACTCOMPONENT
1,000,000
1,000,000
CONTRACTROLE
1,000,000
1,000,000
IDENTIFIER
1,000,000
900,000
Loading a large volume of Master Data Management data
quickly: Using MDM Server maintenance services batch
Page 24 of 34
25. ibm.com/developerWorks/
developerWorks®
LOBREL
1,000,000
900,000
LOCATIONGROUP
2,000,000
1,800,000
MISCVALUE
1,000,000
1,000,000
PERSON
1,000,000
900,000
PERSONNAME
1,000,000
900,000
PERSONSEARCH
1,000,000
900,000
SUSPECT
0
300,000
Test methodology
Different tests were performed to check stability and scalability and to measure the overhead
associated with several commonly used features. All the tests were conducted in two solution
configurations:
• The MDM Server only solution, where InfoSphere MDM Server uses its own algorithm for
standardization and matching. In this case, IBM Information Server is not required.
• MDM Server + QS solution, where InfoSphere MDM Server uses QualityStage to do the
standardization and matching.
The methodology for all these tests was similar:
1. Set up the systems. Do the configuration and tuning of various components as mentioned in
previous sections.
2. Prepare a set of input data with 10000 records using the approach mentioned.
3. Load the input data with 10000 records using 1 submitter in the batch processor. This is done
to avoid deadlocks while working with an empty database.
4. Perform DB2 reorgchk on all the tables to update statistics.
5. Create a backup of the MDM Server database at this stage, and use it is as the starting point
for all the tests.
The following steps were used to run the example test:
1. Restore the database using the backup copy.
2. Change the database configuration if required for the test. For example, you may want to
switch OFF Suspect Duplicate Processing.
3. Restart WebSphere Application Server running InfoSphere MDM Server.
4. Run data collection scripts in the background, which collect CPU statistics, IO statistics, and
database snapshots.
5. Start the test to load the selected input dataset.
6. Collect the logs from InfoSphere MDM Server, WebSphere Application Server, and DB2
database server.
7. Derive response time and throughput from transactiondata.log as generated by InfoSphere
MDM Server.
Measuring performance results
This section describes the performance measurements including the following:
Loading a large volume of Master Data Management data
quickly: Using MDM Server maintenance services batch
Page 25 of 34
26. developerWorks®
ibm.com/developerWorks/
• Results showing very stable performance throughput and response time
• Performance overhead of some commonly used features in the context of initial data loading
• Scalability of throughput
Test 1: Stability of throughput and response time
The purpose of this test is to show whether the throughput and response times remain stable as
the loading progresses and as the database size increases. This test also measures the system
resource usage pattern along the test. The data for throughput and response time is derived from
transactiondata.log, as generated by InfoSphere MDM Server.
Various tests were conducted for both MDM Server only and MDM Server + QS scenarios, and all
of them showed good stability. Table 5 shows the configuration settings for the first test.
Table 5. Test 1 configuration
Parameter
Value
Hardware/Software stack
As described in example test environment
InfoSphere MDM Server heap size
Initial : 512MB; Max 1024MB
InfoSphere MDM Server JVM GC policy
gencon
Number of submitters in batch processor
24
Batch processor JVM memory
512MB
ISD job configurations (applicable to MDM
Server + QS scenario only)
Default
Type of transaction used
MaintainContractPlus
Total volume
1 million parties and their associated records
Input data quality
60% clean
40% suspected duplicates of various types
Name standardization
ON (default)
Address standardization
ON (StandardFormatingIndicator to N in the
requestXML)
Suspect duplicate processing
ON
History triggers
Enabled
Test 1 results: Stability results
Figure 11 shows the throughput and response times captured for the MDM Server only scenario.
The chart shows that throughput and response time are stable during the whole run duration. The
results for the MDM Server + QS scenario are similar.
Loading a large volume of Master Data Management data
quickly: Using MDM Server maintenance services batch
Page 26 of 34
27. ibm.com/developerWorks/
developerWorks®
Figure 11. Throughput and response time
Figure 12 shows that by configuring a sufficient number of submitters to the required number,
almost all CPU resources on WebSphere Application Server running InfoSphere MDM Server can
be used, and the system does not have any other bottlenecks. Figure 10 also shows the resource
usage on other systems.
Figure 12. Resource usage
Loading a large volume of Master Data Management data
quickly: Using MDM Server maintenance services batch
Page 27 of 34
28. developerWorks®
ibm.com/developerWorks/
Test 2: Feature overheads
The purpose of the tests is to measure the overhead of four commonly used features of
InfoSphere MDM Server. Under this series of tests, the overhead of the following were measured:
•
•
•
•
Name standardization
Address standardization
Suspect duplicate processing
History triggers
Overhead is expressed as a percentage reduction in throughput per unit of time when the
feature is enabled. For example, 5% overhead associated with a particular feature means that if
throughput was 100 transactions per second (TPS), it becomes 95 TPS due to overhead when the
feature is enabled. Throughput is measured as total data volume loaded / total time taken.
Various tests were conducted for both MDM Server only and MDM Server + QS scenarios,
enabling one or more features at a time. In the MDM Server + QS scenario, the overheads of
standardization and suspect duplicate processing should be higher because they involve extra
processing by QualityStage.
Table 6 shows the configuration settings for the second test.
Table 6. Test 2 configuration
Parameter
Value
Hardware/Software stack
As described in example test environment
InfoSphere MDM Server heap size
Initial: 512MB ; Max 1024MB
InfoSphere MDM Server JVM GC policy
Default
Number of submitters in batch processor
24
Batch processor JVM memory
512MB
ISD job configurations (applicable to MDM Server + QS scenario only)
Default
Type of transaction used
MaintainContractPlus
Total volume
1 million parties and their associated records
Input data quality
a) 100% clean; b) 60% clean
Following are some notes about the configuration:
• Name standardization was turned ON or OFF by setting /IBM/Party/
ExcludePartyNameStandardization/enabled to FALSE or TRUE, respectively.
• Address standardization was effectively switched ON or OFF by setting
StandardFormatingIndicator to N/Y in the transaction request XMLs.
• Suspect duplicate processing was switched ON or OFF by setting the following to TRUE or
FALSE respectively in the configuration table:
• /IBM/Party/SuspectProcessing/enabled
• /IBM/Party/SuspectProcessing/AddParty/returnSuspect
Loading a large volume of Master Data Management data
quickly: Using MDM Server maintenance services batch
Page 28 of 34
29. ibm.com/developerWorks/
developerWorks®
Test 2 results: Feature overheads
Standardization
The following table shows the overhead of standardization only for the MDM Server only scenario.
Tests were conducted with both datasets (100% clean and 60% clean) when suspect duplicate
processing was switched ON. History triggers were enabled during these tests.
Table 7. Overhead of standardization
Overhead
SDP OFF
SDP ON (100% clean)
SDP ON (60% clean)
Overhead of name standardization 2%
3%
3%
Overhead of address
standardization
2%
2%
0%
Overhead of name and address
standardization
4%
3%
2%
Note: With 60% clean data, there are fewer unique addresses. This can result in less overhead.
Suspect duplicate processing
Table 8 shows the overhead of suspect duplicate processing with and without standardization in
the MDM Server only scenario. Tests were conducted with both datasets (100% clean and 60%
clean). History triggers were enabled during these tests.
Table 8. Overhead of suspect duplicate processing
Overhead
100% clean data
60% clean data
Overhead of suspect duplicate processing
3%
20%
Overhead of suspect duplicate processing
along with name and address standardization
6%
21%
History triggers
If history triggers are enabled, the IO requirement on the database server increases significantly
(nearly doubles). With enough IO bandwidth provided, the overhead is small (approximately 5%).
Test 3: Scalability tests
By definition, scalability is a measure of how well the throughput increases when more load is
put on the system. However, for the example test, the number of processors did not actually vary.
Instead, the number of parallel requests to the InfoSphere MDM Server were changed by varying
the number of submitters in the batch processor. Data points were collected between 1 submitter
and 24 submitters, at which point the system was clearly saturated.
The test was conducted for both the MDM Server only and the MDM Server + QS scenarios. Tests
were conducted in different configurations, and all of them showed near linear scalability.
Table 9 shows the configuration settings for the third test.
Loading a large volume of Master Data Management data
quickly: Using MDM Server maintenance services batch
Page 29 of 34
30. developerWorks®
ibm.com/developerWorks/
Table 9. Test 3 configuration
Parameter
Value
Hardware/Software stack
As described in example test environment
InfoSphere MDM Server heap size
Initial: 512MB; Max 1024MB
InfoSphere MDM Server JVM GC policy
Default
Number of submitters in batch processor
Varied between 1 to 24
Batch processor JVM memory
512MB
ISD job configurations (applicable to the MDM Server + QS scenario
only)
Default
Type of transaction used
MaintainContractPlus
Total volume
15000 to 100,000 records
Input data quality
60% clean
Name standardization
ON (default)
Address standardization
ON (StandardFormatingIndicator to N in the requestXML)
Suspect duplicate processing
ON
History triggers
Enabled
Test 3 results: Scalability results
Figure 13 shows the scalability for the MDM Server only scenario. As shown by green line, the
throughput increases almost linearly with an increase in the number of submitters. The example
configuration utilized more than 90% of CPU capacity on the server running InfoSphere MDM
Server. The results for MDM Server + QS are similar.
Figure 13. Scalability of InfoSphere MDM Server with SDP ON
Loading a large volume of Master Data Management data
quickly: Using MDM Server maintenance services batch
Page 30 of 34
32. developerWorks®
ibm.com/developerWorks/
Performance is based on measurements and projections using standard IBM benchmarks in
a controlled environment. The actual throughput or performance that any user will experience
will vary depending upon many factors, including considerations such as the amount of
multiprogramming in the user's job stream, the I/O configuration, the storage configuration, and
the workload processed. Therefore, no assurance can be given that an individual user will achieve
results similar to those stated here.
The information in this document concerning non-IBM products was obtained from the supplier(s)
of those products. IBM has not tested such products and cannot confirm the accuracy of the
performance, compatibility or any other claims related to non-IBM products. Questions about the
capabilities of non-IBM products should be addressed to the supplier(s) of those products.
The information contained in this publication is provided for informational purposes only. While
efforts were made to verify the completeness and accuracy of the information contained in this
publication, it is provided AS IS without warranty of any kind, express or implied. In addition, this
information is based on IBM’s current product plans and strategy, which are subject to change
by IBM without notice. IBM shall not be responsible for any damages arising out of the use of, or
otherwise related to, this publication or any other materials. Nothing contained in this publication
is intended to, nor shall have the effect of, creating any warranties or representations from IBM or
its suppliers or licensors, or altering the terms and conditions of the applicable license agreement
governing the use of IBM software.
References in this publication to IBM products, programs, or services do not imply that they will
be available in all countries in which IBM operates. Product release dates and/or capabilities
referenced in this presentation may change at any time at IBM’s sole discretion based on market
opportunities or other factors, and are not intended to be a commitment to future product or feature
availability in any way. Nothing contained in these materials is intended to, nor shall have the effect
of, stating or implying that any activities undertaken by you will result in any specific sales, revenue
growth, savings or other results.
Loading a large volume of Master Data Management data
quickly: Using MDM Server maintenance services batch
Page 32 of 34
33. ibm.com/developerWorks/
developerWorks®
Resources
Learn
• See IBM Redbook™Master Data Management: Rapid Deployment Package for MDM for
more instructions.
• Refer to the IBM InfoSphere MDM Server Information Center for more instructions.
• Refer to the WebSphere Application Server, Version 6.1 Information Center to install IBM
WebSphere Application Server Network Deployment, Version 6.1, and upgrade it with
Fixpack 11.
• Refer to the IBM DB2 Database for Linux®, UNIX®, and Windows Information Center to
install DB2 Database Server, Version 9.5.
• Refer to the IBM Information Server Information Center to install IIS Server, Version 8.0.1.
• Learn more from IBM Redpaper WebSphere Customer Center: Understanding Performance
• Discover DB2 Tuning Tips for OLTP Applications from this classic developerWorks article.
• Explore the Information Management Software for z/OS Solutions Information Center.
• Learn more about Information Management at the developerWorks Information Management
zone. Find technical documentation, how-to articles, education, downloads, product
information, and more.
• Stay current with developerWorks technical events and webcasts.
Get products and technologies
• Build your next development project with IBM trial software, available for download directly
from developerWorks.
Discuss
• Participate in the discussion forum for this content.
• Check out the developerWorks blogs and get involved in the developerWorks community.
Loading a large volume of Master Data Management data
quickly: Using MDM Server maintenance services batch
Page 33 of 34