My_DataStage_Tips: 2010

Thursday, July 1, 2010

DataStage Job Compile - Receives "Failed to invoke GenRuntime using phantom process helper." error.

Problem(Abstract)

When attempting to compile a job, user receives: Failed to invoke GenRuntime using phantom process helper.

Cause

Possible causes for this error include:
Servers /tmp space was full

Jobs status incorrect.
Format problem with projects uvodbc,config file
Corrupted DS_STAGETYPES file
Internal locks.

Diagnosing the problem

If the steps under Resolving the problem do not resolve the problem, proceed with the following steps,

Before opening a PMR with support, turn on server side tracing, attempt to compile the problem job, turn off server side tracing, and gather the tracing information.

Turn on server side by connecting to the server with the DataStage Administrator client.
High light the project which has the problem job.
Click on the Properties button.
In the Properties window, click the Tracing tab
Click on the Enabled check box
Click the OK button
With a new DataStage Designer connection, attempt to compile the job.
With the DataStage Administrator client, go back into the projects properties
Select the Tracing tab
Uncheck the Enabled check box
For each DSRTRACE entry do the following:
a) High light the entry and click View
b) High light the contents of the display and click on Copy
c) Paste the copied information into Notepad
Open a PMR with support and supply the Notepad information.

Resolving the problem

Servers /tmp space full:

Clean up space in /tmp

Jobs status is incorrect:

DataStage Director->Job->Clear Status File

Format problem with projects uvodbc.config file:

Confirm uvodbc.config has the following entry/format:
[ODBC DATA SOURCES]

DBMSTYPE = UNIVERSE
network = TCP/IP
service = uvserver
host = 127.0.0.1

Corrupted DS_STAGETYPES:

Connect to the DataStage server,change directory to DSEngine, source dsenv ( . ./dsenv)
$ bin/uvsh
>LOGTO projectname (case sensitive)
Set a file pointer RTEMP to the template DS_STAGETYPES file
>SETFILE /Template/DS_STAGETYPES RTEMP
Check that all of the entries in the template DS_STAGETYPES file are present in the project's DS_STAGETYPES file
>SELECT RTEMP
* this will return a count of records found in the template DS_STAGETYPES file
>COUNT DS_STAGETYPES
* this will return a count of records found in the project's DS_STAGETYPES file
* These numbers should be the same
If the numbers differ and some records are missing from the project's DS_STAGETYPES file
>COPY FROM RTEMP TO DS_STAGETYPES ALL OVERWRITING
exit Universe shell
>Q

Internal locks:

Connect to the DataStage server,change directory to DSEngine, source dsenv ( . ./dsenv)
Change directory to the projects directory that has the job generating the error.
Execute the following replacing with the actual job name.
$ $DSHOME/bin/uvsh "DS.PLADMIN.CMD NOPROMPT CLEAR LOCKS "

Monday, April 26, 2010

TeraData Sync Tables From DataStage

Question>
I need to find out how and What TeraSync Tables are in Datastage using Teradata

The Teradata DBA says these are a Datastage thing.

What is the purpose of these Tables? Can these be shut off?
Are these tables used with the TD Connector as well ?

Answer>
The Teradata sync tables are indeed created by certain DataStage stages. Teradata Enterprise creates one named terasync that is shared by all Teradata Enterprise jobs that are loading into the same database. The name of the sync table created by the Teradata Connector is supplied by the user, and that table can either be shared by other Teradata Connector jobs (with each job using a unique Sync ID key into that table) or each Teradata Connector job can have its own sync table.

These sync tables are a necessity due to requirements imposed by Teradata's parallel bulk load and export interfaces. These interfaces require a certain amount of synchronization at the start and end of a load or export and at every checkpoint in between. The interface requires a sequence of method calls to be done in lock step. After each player process has called the first method in the sequence, they cannot proceed to call the next method until all player processes have finished calling the first method. So the sync table is used as a means of communication between the player processes. After calling a method, a player updates the sync table and repeatedly checks the sync table until all the other players have also updated the sync table.

In Teradata Enterprise, you cannot avoid using the terasync table. In the Teradata Connector, you can avoid using the sync table by setting the Parallel synchronization property to No, however that stage will be forced to run sequentially in that case.

DataStage - setting ulimit in Clustered or GRID (MPP) Environment

Problem(Abstract)
Ulimits are determined by the dsrpc daemon process for all process spawned by the Conductor on an SMP environment, however, when you introduce additional processing nodes (MPP) Ulimits need to be addressed on those nodes separately.

Symptom
Changing ulimit in the conduct node does not resolve the error - One symptom of this error: Fatal Error: Need to be able to open at least 16 files; please check your ulimit setting for number of file descriptors [sort/merger.C:1087]

Cause
On UNIX/Linux MPP configuration the ulimit are set for the user running the processes on the processing nodes, They are not inherited from the Conductor like the rest of the environment.

Environment
MPP on Unix/Linux

Resolving the problem
Each users that "owns" processes on the remote processing nodes (where the Conductor spawns processes on remote processing nodes as that user) needs to have their ulimit set correctly on EACH processing node. This can be done by either setting the default ulimit for everyone, or setting each user's ulimit specifically.

An example for AIX:

Modify the /etc/security/limits file's default settings as root user:

default:
fsize = -1
core = 2097151
cpu = -1
data = -1
rss = -1
stack = 65536
nofiles = 102400
nofiles_hard = 102400

The default section is used to set the defaults for all users, You can override each user in the same manner.

A note about ulimit nofiles (number of open files) on 64-bit AIX. You should not set nofiles to -1 (unlimited) or nofiles_hard to unlimited on AIX running a 64 bit kernel. This will cause an integer overflow for Information Server and you will get a variation of the error listed above.

A bootstrap address with no port specification defaults to port 2809

Problem(Abstract)
The IBM InfoSphere DataStage client applications cannot connect to the InfoSphere DataStage server when the TCP/IP ports are not available through firewalls for the application server.

Symptom
An error message similar to the following message is displayed:

Failed to authenticate the current user against the selected Domain
A communication failure occurred while attempting to obtain an initial context with the provider URL: "iiop://s6lpar50:2825". Make sure that any bootstrap address information in the URL is correct and that the target name server is running. A bootstrap address with no port specification defaults to port 2809. Possible causes other than an incorrect bootstrap address or unavailable name server include the network environment and workstation network configuration. javax.security.auth.login.LoginException: A communication failure occurred while attempting to obtain an initial context with the provider URL: "iiop://s6lpar50:2825".

Resolving the problem
Before you install IBM InfoSphere Information Server, assess and configure your network to ensure that the IBM InfoSphere DataStage client applications can connect to the InfoSphere DataStage server when the WebSphere Application Server is behind a firewall.
To configure the application server ports:

In the WebSphere Application Server administrative console, expand the application server list in the Servers > Application servers, and then select server1.
In the Communications section, select Ports.
Open the ports for the firewall configuration.
Use the following list to identify the ports that are required for your network configuration.

To access the application server behind a firewall, open the following ports:

WC_defaulthost
BOOTSTRAP_ADDRESS
ORB_LISTENER_ADDRESS
SAS_SSL_SERVERAUTH_LISTENER_ADDRESS
CSIV2_SSL_MUTUALAUTH_LISTENER_ADDRESS
CSIV2_SSL_SERVERAUTH_LISTENER_ADDRESS

To use HTTPS to access Web clients, open the following port:

WC_defaulthost_secure

To access the WebSphere Application Server administrative console, open the following port:

WC_adminhost

To use HTTPS to access the WebSphere Application Server administrative console, open the following ports:

WC_adminhost_secure

To publishing services using a JMS binding, open the following ports:

SIB_ENDPOINT_ADDRESS
SIB_ENDPOINT_SECURE_ADDRESS
SIB_MQ_ENDPOINT_ADDRESS
SIB_MQ_ENDPOINT_SECURE_ADDRESS

Unable to trigger DataStage jobs from SAP

Question>
We are unable to trigger any DataStage jobs from SAP. When I try to invoke a DataStage job from SAP infopackage in a BW Load - PULL operation, I see this message in the log:

> cat RFCServer.2010_02_26.log
2 I: BW Open Hub Extract: Data ready for BW initiated job. InfoSpoke Request ID 306696. Fri Feb 26 09:45:00 2010
4 I: BW Load: Notifying push job request REQU_4GZ0F9G5SIAEE283M04XSUM9G is ready at Fri Feb 26 11:08:55 2010

But the DataStage job is not triggered. The info package keeps waiting forever to receive data but the DataStage job that provides data never gets triggered.

Answer>
Running jobReset to cleanup the temporary files:
Open a command prompt window to the DS Server and:
1. source dsenv:
cd /Server/DSEngine
. ./dsenv
2. run the job reset utility
cd $DSSAPHOME/DSBWbin
./jobReset
where = MASTER | MASTERTEXT | TRANSACTION | MASTERHIERARCHY
for example:
cd /hpithome/IBM/InformationServer/Server/DSBWbin
./jobReset SAPBW8 PACK09 IS1 MASTER bwload
You should get a confirmation message: "jobReset - Successful"

Tips - IS Project Implementation problem

If there will be any issues with respect to the connectivity, then the logs created under the directory as listed below:

Check out the Designer logs in for more details on the error you're seeing.

Client machine logs
C:\Documents and Settings\_Windows_Username_\ds_logs\dstage_wrapper_trace_N.log (where N cycles from 1 to 20, to give you the log files for the last 20 DataStage sessions)

tsortxxxxxx file size in Scratch directory

Question>
Is there any way how to influence the size of tsortxxxxx files that are created in the Scratch directory? We are running a PoC at the customer size and we have more than 70 mil records that has to be sorted. It leads to more than 5000 tsortxxxx files created in the Scratch directory. When these files are read in the next join stage, the performance is extremely slow because of necessity of access to this big directory. Each of the file has a size of approx. 10MB. Is there any way how to increase the file size and what are advantages and disadvantages of this approach?

Answer>
The size of the files are controlled through the use of the "Limit Memory" option in the Sort Stage. The default size is 20MB. What you are actually controlling is the amount of memory allocated to that instance of the sort stage on the node. The memory is used to buffer incoming rows and once full it is written to the sort work disk storage (in your case the Scratch directory). Increasing the size of the of the memory buffer can improve sort performance. 100-200MB is generally a decent compromise in many situations.

Advantages:
-- Larger memory buffer allows more records to be read into memory before being written to disk temporarily. For smaller files, this can help keep sorts completely in memory
-- The larger the memory buffer, the fewer i/o operations to the sort work disk and therefore your sort performance is increased. Also, a smaller number of files leads to fewer directory entries for the o/s to deal with

Disadvantages:
-- Larger memory buffer means more memory usage on your processing node(s). With multiple jobs and/or multiple sorts within jobs, this can lead to low memory and paging if not carefully watched.
-- This option is only available within the Sort stage...it is not available when using "link sorts" (sorting on the input link of a stage). If your job uses link sorts, you will have to replace them with Sort stages in order to use this option.

Be careful when tuning with this option so that you don't inadvertantly overtax system memory. Adjust and test and work towards a compromise that gives you acceptable performance without overloading the system.

Tuesday, April 13, 2010

Triming to all Numeric

Question>
I have a phone number that has all kinds of special characters or alpha characters that can be mixed up in the field

I want to trim all of the non numeric characters out of the field

Anyone have any ideas as to how I can do this with out creating many stage variables to do the trim?

Answer>
StripWhiteSpace(Convert("ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz-.", Space(56), Input.String))

If you have extra characters, just increment the number of spaces as needed.

Truncating logging tables in XMETA

Question>
Our customer forgot to purge logs in DB so in several days
LOGGING_XMETAGEN_LOGGINGEVENT1466CB5F table growed to enormous size. They
switched to FS logging and set up log purging. Now they want to get rid of
the data in that table.

I think it will be safe enough to delete rows from there (using db load of
course) and they to reorginize it to free up the space. Table reorg will
drop the table and recreate it and this part worries me a bit. AFAIK there
are no foreign keys and if WAS is down, I don't think there will be a
problem to drop and recreate the table, but still ... What are the
official procedure to do this task?

I saw an email about using LoggingAdmin.sh and then reorg, but I wanted to
be sure that reorg (that will DROP the table) doesn't cause any problems
(it's production environment).

Answer>
You can use the purge function available in web console but this will take long time to finish for this data size

This is what i use to delete all logging events from xmeta table

1. Create an empty file "empty.id"

2. execute this command. File empty.id should be in this directory

db2 import from empty.id of del replace into xmeta.logging_xmetagen_loggingEvent1466cb5f

Trying to imrove performance generating a sorted seq file

Question>
I need to create a sorted seq file of about 31 millions of rows.
Running the sort stage sequentially (with 167 columns, 3 of them are the sort keys) takes about 3 hours and 10 minutes to land all the data.

Is there any technique I can use to boost the performance?

Answer>
just to inform you I was able to drop the elapsed of my test job from 3 hours to 1 hour for about 32 millions of rows
What I've done is:
1 . remove the length of varchar fields and decimal fields
2 . convert the decimal field to varchar but I've got the same throughput converting decimal fields to double, the problem was the output format.
3. create one big varchar field (without length) containing all the fields not part of the sort key

RTLogging and ORLogging

With RTLogging = 1 you enable old fashion logging (on Universe) and you have to disable the repository logging (ORLogging = 0)

Question >
If i understand well, logs must be put in universe repository for job controls to operate correctly. Right?
What if the customer argues that migration to xmeta seems to be incomplete?
what if the customer wants logging in new repository and job controls to still operate?
Btw, it seems to me that it should work that way, no?

Answer>
No, the logs can be in either repository, and all APIs should work (give the same results) in either case. The RTLogging version is much faster and scales better - that should be the only difference

Question>
What is the pros & cons of ORLogging versus RTLogging?

Is it really because the repository cannot scale to support the ORLogging or we don't know the right setting. I believe DB2/Oracle can handle hundreds of thousands transactions per second with the right configuration.

Answer>
You are quite right about the database performance.

However, the implementation of ORlogging is not designed for performance. The actual table has many DBCLOB (double byte character large object) columns - and this is in reality preventing high performance.

As far as I know DB2 does not hold large object-tables in the buffer pool - so every update to such a table will trigger I/O - and waits.....

The DDL for the log table is:

CREATE TABLE "XMETA "."LOGGING_XMETAGEN_LOGGINGEVENT1466CB5F" (
"XMETA_CREATED_BY_USER_XMETA" VARGRAPHIC(64) ,
"XMETA_CREATION_TIMESTAMP_XMETA" BIGINT ,
"XMETAMODIFICATIONTIMESTAMPXMET" BIGINT ,
"XMETA_MODIFIED_BY_USER_XMETA" VARGRAPHIC(64) ,
"XMETA_OPTIMISTIC_LOCK_ID_XMETA" INTEGER ,
"XMETA_REPOS_OBJECT_ID_XMETA" VARCHAR(64) NOT NULL ,
"CATEGORYNAME_XMETA" VARGRAPHIC(255) ,
"CIS10_XMETA" VARGRAPHIC(255) ,
"CIS11_XMETA" VARGRAPHIC(255) ,
"CIS12_XMETA" VARGRAPHIC(255) ,
"CIS13_XMETA" VARGRAPHIC(255) ,
"CIS14_XMETA" VARGRAPHIC(255) ,
"CIS15_XMETA" VARGRAPHIC(255) ,
"CIS1_XMETA" VARGRAPHIC(255) ,
"CIS2_XMETA" VARGRAPHIC(255) ,
"CIS3_XMETA" VARGRAPHIC(255) ,
"CIS4_XMETA" VARGRAPHIC(255) ,
"CIS5_XMETA" VARGRAPHIC(255) ,
"CIS6_XMETA" VARGRAPHIC(255) ,
"CIS7_XMETA" VARGRAPHIC(255) ,
"CIS8_XMETA" VARGRAPHIC(255) ,
"CIS9_XMETA" VARGRAPHIC(255) ,
"CIT1_XMETA" DBCLOB(536870912) LOGGED NOT COMPACT ,
"CIT2_XMETA" DBCLOB(536870912) LOGGED NOT COMPACT ,
"CIT3_XMETA" DBCLOB(536870912) LOGGED NOT COMPACT ,
"CIT4_XMETA" DBCLOB(536870912) LOGGED NOT COMPACT ,
"CIT5_XMETA" DBCLOB(536870912) LOGGED NOT COMPACT ,
"CIT6_XMETA" DBCLOB(536870912) LOGGED NOT COMPACT ,
"CONTAINER_RID" VARCHAR(64) ,
"CONTEXT_XMETA" DBCLOB(536870912) LOGGED NOT COMPACT ,
"DELETED_XMETA" SMALLINT ,
"LEVEL_XMETA" INTEGER ,
"MESSAGE_XMETA" DBCLOB(536870912) LOGGED NOT COMPACT ,
"MSGID_XMETA" VARGRAPHIC(255) ,
"SEQ_XMETA" BIGINT ,
"THROWABLEINFO_XMETA" DBCLOB(536870912) LOGGED NOT COMPACT ,
"TIMESTAMP_XMETA" BIGINT )
IN "XMETA" ;

Question>

Does anyone have an exact customer scenarios that has ORLogging failed because of performance. How many logging events were being logged at the moment of poor performance?
workload description? is there a purging mechanism in place?. I would like to include the scenarios to test it for IS8.5.

Answer>
I had better answer this following my previous comments......... ;-)

November 2008 I installed v8.1 on a windows box (Windows Server 2003 Enterprise Edition). I don't remember the specs of the server - except that I am sure it had 16 GB memory.

Of the Information Server products only DataStage was installed (no Information Analyzer, no Business Glossary etc.).

The installation was similar to the client's other systems running v8.0.1 - and was tested with existing jobs.

A couple of server jobs (imported from an v7.5 export) threw a warning for every row during test. This made Director quite unresponsive, showing no log records for jobs in status finished if it responded at all. It turned out LOGGING_XMETAGEN_LOGGINGEVENT1466CB5F had around 800.000 rows. I deleted something like 780.000 records with:

delete FROM XMETA.LOGGING_XMETAGEN_LOGGINGEVENT1466CB5F where cis13_xmeta = 'MultiBomFinal'

MultiBomFinal was the job responsible for around half a million log records - logs for a couple of other jobs were deleted as well.

Following this I switched the installation to RTLogging.

While the technical solution was straight forward this did nothing to improve the client's impression of Information Server - and they were not too happy as it was (long story - mostly not our fault).

Question>
We have installed 8.1 at this client site and I have been here for a year. The response times has been pretty bad -- mainly due to the poor network at the client site.

I understand the difference between RTLogging and ORLogging -- thanks to techlist -- but we dont want to change it midway of this project and get into some unexpected problems.

Over a period of time the DEV system has over 1000 jobs and logging is set to ORLogging -- hence or otherwise we are noticing poor performance overall.

Since it is the dev system, we want to improve the performance and we dont care about saving old/previous logs.

Are there any steps to clear out the logs completely while maintaining it as ORLogging? I am hoping that doing that will improve the performance a little bit. Can somebody point me to the documentation relating to this?

Answer>
If you do not care about saving logs you can use either the web console or db2 sql commend to purge logs for the ORLogging setup

Using WebConsole
=================
1. Login to web console and navigate to Log Management/LogView as shown here. you should see the log view for your job else create one Bby clicking on New Log View

2. Select the Log View that you use to view your OR logging events in xmeta and Click on "Purge Log" on the left navigation
3.Using sql
========

Here are steps to remove all rows in the logging table

as db2inst1 (Instance owner)

1) db2 "select count(*) from xmeta.LOGGING_XMETAGEN_LOGGINGEVENT1466CB5F"

record the number of rows in this table

2) create an empty file called empty.txt

3) db2 "import from empty.txt of del replace into xmeta.LOGGING_XMETAGEN_LOGGINGEVENT1466CB5F"

If you hit -964 then you need to increate logfilsize and increate LOGPRIMARY or cobination of both.

Wednesday, April 7, 2010

Searching Info Center - limiting the search scope doesn't work

Link
http://publib.boulder.ibm.com/infocenter/iisinfsv/v8r1/topic/com.ibm.swg.im.iis.common.ic.doc/common/iiypcmic_personalizing_table_of_contents.html

Personalizing the information center by narrowing the search scope

You can narrow the search scope by product or focus area. Narrowing the search scope is helpful if you are working with only a subset of the products or features. Follow these steps to search only specific sections, based on the organization in the Contents view.

To narrow the search scope:

Click the Search scope () link next to the Search field. The Select Search Scope window opens.
Select Search only the following topics and click the New button. The New Search List window opens.
In the List Name field, type a name for your search filter. The OK button remains unavailable until you provide a list name.
In the Topics to search list, select the categories that you want to include in your search. You can expand categories to select only certain subcategories.
Click OK.
Click Search only the following topics.
Select the search list that you created. Click OK.
Enter your search terms in the Search field. The results are limited to your search list.
To modify the search scope or to remove the search filter, click Search scope again and modify the filter.

The search list that is currently active is shown at the top of the information center, next to the Search scope link. The search list that you used most recently persists across sessions.

Select Max for surrogate key in EE

The surrogate key stage in 8.0 can be used with DB Sequences or using a Flat File to manage the keys (set the 'Source Type' option to Flat File). Fortunately, seeding the flat file with the max value of the table is easy, unfortunately it can't be done in the same job as the job that generates the keys.

There are 2 ways to do it --

1. Create a job that does a 'select max(..)' from the target table, connected to a surrogate key stage (set the 'Update Action' to 'Update'). This will seed the state file with the max value. In the job that needs the keys, use the surrogate key stage and this state file to generate keys (SK stage takes 1 input link, and produces 1 output link with the SK column appended). If this is the only thing that inserts rows into the table, you don't need run the 'Update' job again, the state file will remember where it left off for next time.

2. When using a Flat File to manage keys, you can supply an initial value in the surrogate key stage. This value can be a job parameter, so you can hook this together with something that does the select max() and sets the job param in a job sequencer.

BTW -- the new surrogate key operator was designed for the parallel execution environment, so key generation is handled without the use of @PartitionNum or any of that. It also supports multiple jobs (running in parallel) that are getting keys from the same state file.

And finally... you can also use the surrogate key generation functionality directly in a transformer (rather than using the SKG stage). It requires a little set up in the transformer stage properties, then you can use the utility function 'NextSurrogateKey()' as the derivation for the SK column.

If the any of this sounds like something that you want to try, let me know and I can set you up with some simple examples.

Send File as email attachment from Datastage?

Question
Does anyone have an example of sending a file as an email attachment from Datastage???

Answer

Check the below links to get some idea:

http://www.shelldorado.com/articles/mailattachments.html

http://www.sharewareconnection.com/titles/uuencode.htm

Sequential file stage File Name Column option

Questions:
In the sequential file stage on the parallel canvas there is an option to specify a File Name column. This, in theory, allows you to read from multiple files (if the Read Method is set to "File Pattern") and to populate the file name in the specified outbound column.

However, if I specify a wildcard in the File Pattern such as C:\Test_Data\*, I don't get the individual file names in the outbound column - I get C:\Test_Data\* for the file name outbound column. That's rather lame and is a defect if you ask me - anyone else experience this? This is on a windows implementation.

Answer
By default the sequential file stage takes all the files returned by the pattern and cats them together reading from one big stream of data, so it is not possible to determine an individual file name for each record.

You can get the individual file names by setting APT_IMPORT_PATTERN_USES_FILESET. This will change the behavior of sequential file stage patterns so it will create a file set with the returned files. This has the advantages of better parallelism depending on configuration and leaves the file names available to populate a file name column.

Tuesday, April 6, 2010

Shared library (dsdb2.so) failed to load / IS 8.0 on AIX

Issues
main_program: Fatal Error: Fatal: Shared library (dsdb2.so) failed to load: errno = (2), system message = ( 0509-022 Cannot load module /opt/IBM/InformationServer/Server/DSComponents/bin/dsdb2.so.
0509-150 Dependent module libdb2.a(shr.o) could not be loaded.
0509-022 Cannot load module libdb2.a(shr.o).
0509-026 System error: A file or directory in the path name does not exist.
0509-022 Cannot load module /opt/IBM/InformationServer/Server/DSComponents/bin/dsdb2.so.
0509-150 Dependent module /opt/IBM/InformationServer/Server/DSComponents/bin/dsdb2.so could not be loaded.)

Resolution
Looks like that you are using DB2 plugin. I think that you should put the library path in the dsenv. You will need to stop and re-start the DataStage to pick up the changes in dsenv.

SK generator gaps (Key doesn't created sequentially)

Resolution

1. Depends on what you mean, but I guess the simple answer is, whenever you use the file option with a block size > 1. Which is the 2nd worst performing configuration for it.

2. However, IF you do not use the 'start from highest value' option, then the SKG will backfill the 'gaps' in subsequent runs.

Some Important Links (IS Specific) - IBM

Silent installation of DataStage clients

http://publib.boulder.ibm.com/infocenter/iisinfsv/v8r1/topic/com.ibm.swg.im.iis.productization.iisinfsv.install.doc/topics/wsisinst_silent.html?resultof=%22%73%69%6c%65%6e%74%22%20

Installation & Uninstallation of DataStage

http://publib.boulder.ibm.com/infocenter/iisinfsv/v8r1/index.jsp?topic=/com.ibm.swg.im.iis.productization.iisinfsv.install.doc/topics/wsisinst_uninstall_manual_windows.htmls

Minimum system requirements for Information Server 8.1 are documented here:

http://www.ibm.com/software/data/infosphere/info-server/overview/requirements.html

Thursday, March 4, 2010

DataStage Performance Tuning Tips

Some of the Key factors for the consideration

Staged the data coming from ODBC/OCI/DB2UDB stages or any database on the server using Hash/Sequential files for optimum performance also for data recovery in case job aborts.
Tuned the OCI stage for 'Array Size' and 'Rows per Transaction' numerical values for faster inserts, updates and selects.
Tuned the 'Project Tunables' in Administrator for better performance.
Used sorted data for Aggregator.
Sorted the data as much as possible in DB and reduced the use of DS-Sort for better performance of jobs
Removed the data not used from the source as early as possible in the job.
Worked with DB-admin to create appropriate Indexes on tables for better performance of DS queries
Converted some of the complex joins/business in DS to Stored Procedures on DS for faster execution of the jobs.
If an input file has an excessive number of rows and can be split-up then use standard logic to run jobs in parallel.
Before writing a routine or a transform, make sure that there is not the functionality required in one of the standard routines supplied in the sdk or ds utilities categories.
Constraints are generally CPU intensive and take a significant amount of time to process. This may be the case if the constraint calls routines or external macros but if it is inline code then the overhead will be minimal.
Try to have the constraints in the 'Selection' criteria of the jobs itself. This will eliminate the unnecessary records even getting in before joins are made.
Tuning should occur on a job-by-job basis.
Use the power of DBMS.
Try not to use a sort stage when you can use an ORDER BY clause in the database.
Using a constraint to filter a record set is much slower than performing a SELECT … WHERE….
Make every attempt to use the bulk loader for your particular database. Bulk loaders are generally faster than using ODBC or OLE.
Minimise the usage of Transformer (Instead of this use Copy modify Filter Row Generator)
Use SQL Code while extracting the data
Handle the nulls
Minimise the warnings
Reduce the number of lookups in a job design
Use not more than 20stages in a job
Use IPC stage between two passive stages Reduces processing time
Drop indexes before data loading and recreate after loading data into tables
Gen\'ll we cannot avoid no of lookups if our requirements to do lookups compulsory.
There is no limit for no of stages like 20 or 30 but we can break the job into small jobs then we use dataset Stages to store the data.
IPC Stage that is provided in Server Jobs not in Parallel Jobs
Check the write cache of Hash file. If the same hash file is used for Look up and as well as target disable this Option.
If the hash file is used only for lookup then \ enable Preload to memory\ . This will improve the performance. Also check the order of execution of the routines.
Don\'t use more than 7 lookups in the same transformer; introduce new transformers if it exceeds 7 lookups.
Use Preload to memory option in the hash file output.
Use Write to cache in the hash file input.
Write into the error tables only after all the transformer stages.
Reduce the width of the input record - remove the columns that you would not use.
Cache the hash files you are reading from and writting into. Make sure your cache is big enough to hold the hash files.
Use ANALYZE.FILE or HASH.HELP to determine the optimal settings for your hash files

This would also minimize overflow on the hash file. (Need for Server Jobs)

If possible break the input into multiple threads and run multiple instances of the job.
Staged the data coming from ODBC/OCI/DB2UDB stages or any database on the server using Hash/Sequential files for optimum performance also for data recovery in case job aborts.
Tuned the OCI stage for 'Array Size' and 'Rows per Transaction' numerical values for faster inserts updates and selects.
Tuned the 'Project Tunables' in Administrator for better performance.
Used sorted data for Aggregator.
Sorted the data as much as possible in DB and reduced the use of DS-Sort for better performance of jobs
Removed the data not used from the source as early as possible in the job.
Worked with DB-admin to create appropriate Indexes on tables for better performance of DS queries
Converted some of the complex joins/business in DS to Stored Procedures on DS for faster execution of the jobs.
If an input file has an excessive number of rows and can be split-up then use standard logic to run jobs in parallel.
Before writing a routine or a transform make sure that there is not the functionality required in one of the standard routines supplied in the sdk or ds utilities categories.
Constraints are generally CPU intensive and take a significant amount of time to process. This may be the case if the constraint calls routines or external macros but if it is inline code then the overhead will be minimal.
Try to have the constraints in the 'Selection' criteria of the jobs itself. This will eliminate the unnecessary records even getting in before joins are made.
Tuning should occur on a job-by-job basis.
Use the power of DBMS.
Try not to use a sort stage when you can use an ORDER BY clause in the database.
Using a constraint to filter a record set is much slower than performing a SELECT WHERE .
Make every attempt to use the bulk loader for your particular database. Bulk loaders are generally faster than using ODBC or OLE.

Monday, February 22, 2010

DataStage Director Can't open (IS 8.1)

Problem:
The job can be created using DS Designer and compiled as well. But, while opening in the Director, the error message was thrown like below:
Error calling sunroutine DSR_EXEJOB ( Action=5); check Datastage is set up correctly in project SBIDEMO ( The connection is brocken (81002))

Cause:
The Director crashes with an 81002 error when a job is invoked and log view is selected even though you can login to the Designer and create and compile a job. The issue occurs if the jobmon (px engine job monitor) application crashes or is not running (in our case it crashed with nothing in its log file after running several jobs successfully) and is due to the changes introduced with 8.1 that direct DataStage logging to xmeta instead of the local DSEngine log files.

Resolution:
It is possible to revert to the old 8.0.1 functionality by setting the following project level
parameters in the DSParams file to the following values:

RTLogging=1
ORLogging=0

After making the switch back to DSEngine level logging an error will be logged in the job log indicating that the job monitor is not running but the job will run to completion: main_program: Failed to initialize job monitoring. Monitor information will not be generated.
The job monitor can be started manually via the following steps while in the InformationServer/Server directory:

# . ./DSEngine/dsenv
# ./PXEngine/java/jobmoninit start

Tuesday, January 12, 2010

DataStage logging changes introduced at 8.1

Abstract
DataStage logging was changed at release 8.1 to log job run detail records into the operational repository, (xmeta) rather than the local project level log files, (RT_LOGxxx) that we utilized in prior releases for each job. As a result of this change we have seen the following issues:

Job run times can increase and Information Server client applications may not be as responsive depending on the amount of log data that is generated by the jobs.

In some cases no job log entries are written to the log as viewed via the DataStage Director client or Web Console, even though jobs appear to run and in some cases job log detail entries may be missing or do not show up until later.

Job log purge operations may run slowly or fail depending on the amount of log entries.

Content
These issues can be worked around by reverting to the logging mechanism that was in place prior to release 8.1 by implementing the following project level changes on the DataStage engine server.
Edit the project level DSParams file, typically located in:
/opt/IBM/InformationServer/Server/Projects/%projectName% for Linux/UNIX and C:\IBM\InformarmationServer\Server\Projects\%projectName% for Windows and modify the following 2 lines as shown to revert to pre 8.1 logging:

RTLogging=1
ORLogging=0

Keep in mind that newly created projects inherit their settings from the DSParams file that is located in /opt/IBM/InformationServer/Server/Template/DSParams for Linux/UNIX and C:\IBM\InformationServer\Server\Template\DSParams for Windows by default and that it should also be modified to ensure that new projects use the pre 8.1 logging mechanism.

After switching to RTLogging, the existing log details entries in the repository can still be viewed via the Web Console or Server Console but they will not be accessible using the DataStage Director. These log entries should be purged when they are no longer required by scheduling a background purge of the log entries; up to 10,000 at a time is suggested to minimize the memory requirements and to ensure that we do not run into WebSphere Application Server out of memory issues by trying to purge all the log entries as one task.

The following commands will schedule a task to delete up to 10,000 log entries at a time and will loop until all the DataStage entries are purged. This could take some time depending on how many log entries you have to purge. This method has the advantage in that it can be run without stopping other activities.

Use the following command line, (from the …InformationServer/ASBServer/bin directory) to create the scheduled auto purge task (replace with a suite admin user/password):

For Windows:

LoggingAdmin -user -password -create -schedule -name "DS job event purge task" -frequency -minutes 30 -threshold 10000 -percentage 100 -includeCategories IIS-DSTAGE-RUN

For Linux/UNIX:

./LoggingAdmin.sh -user -password -create -schedule -name "DS job event purge task" -frequency -minutes 30 -threshold 10000 -percentage 100 -includeCategories IIS-DSTAGE-RUN

After all DS log events have been purged, the scheduled task can be deleted with the following command line:

For Windows:

LoggingAdmin -user -password -delete -schedule -name "DS job event purge task"

For Linux/UNIX:

./LoggingAdmin.sh -user -password -delete -schedule -name "DS job event purge task"

APAR JR31806 has been submitted for this issue

My_DataStage_Tips