Analytics & Information Management: September 2013

Datastage Overview

What Is DataStage?

Sometimes DataStage is sold to and installed in an organization and its IT support staff are expected to maintain it and to solve DataStage users' problems. In some cases IT support is outsourced and may not become aware of DataStage until it has been installed. Then two questions immediately arise: "what is DataStage?" and "how do we support DataStage?".

This white paper addresses the first of those questions, from the point of view of the IT support provider. Manuals, web-based resources and instructor-led training are available to help to answer the second. DataStage is actually two separate things.

Ø In production (and, of course, in development and test environments) DataStage is just another application on the server, an application which connects to data sources and targets and processes ("transforms") the data as they move through the application. Therefore DataStage is classed as an "ETL tool", the initials standing for extract, transform and load respectively. DataStage "jobs", as they are known, can execute on a single server or on multiple machines in a cluster or grid environment. Like all applications, DataStage jobs consume resources: CPU, memory, disk space, I/O bandwidth and network bandwidth.

Ø DataStage also has a set of Windows-based graphical tools that allow ETL processes to be designed, the metadata associated with them managed, and the ETL processes monitored. These client tools connect to the DataStage server because all of the design information and metadata are stored on the server. On the DataStage server, work is organized into one or more "projects". There are also two DataStage engines, the "server engine" and the

"Parallel engine".

The server engine is located in a directory called DSEngine whose location is recorded in a hidden file called /.dshome (that is, a hidden file called .dshome in the root directory) and/or as the value of the environment variable DSHOME. (On Windows-based DataStage servers the folder name is Engine, not DSEngine, and its location is recorded in the Windows registry rather than in /.dshome.)

DataStage Engines

The server engine is the original DataStage engine and, as its name suggests, is restricted to running jobs on the server. The parallel engine results from acquisition of Orchestrate, a parallel execution technology developed by Torrent Systems, in 2003. This technology enables work (and data) to be distributed over multiple logical "processing nodes" whether these are in a single machine or multiple machines in a cluster or grid configuration. It also allows the degree of parallelism to be changed without change to the design of the job

Design-Time Architecture

Let us take a look at the design-time infrastructure. At its simplest, there is a DataStage server and a local area network on which one or more DataStage client machines may be connected. When clients are remote from the server, a wide area network may be used or some form of tunnelling protocol (such as Citrix MetaFrame) may be used instead.

InfoSphere DataStage provides these features and benefits:

• Powerful, scalable ETL platform—supports the collection, integration and transformation of large volumes of data, with data structures ranging from simple to complex.

• Support for big data and Hadoop—enables you to directly access big data on a distributed file system, and helps clients more efficiently leverage new data sources by providing JSON support and a new JDBC connector.

• Near real-time data integration—as well as connectivity between data sources and applications.

• Workload and business rules management—helps you optimize hardware utilization and prioritize mission-critical tasks.

• Ease of use—helps improve speed, flexibility and effectiveness to build, deploy, update and manage your data integration infrastructure.

• Rich support for DB2Z and DB2 for z/OS—including data load optimization for DB2Z and balanced optimization for DB2 on z/OS

Powerful, scalable ETL platform

• Manages data arriving in near real-time as well as data received on a periodic or scheduled basis.

• Provides high-performance processing of very large data volumes.

• Leverages the parallel processing capabilities of multiprocessor hardware platforms to help you manage growing data volumes and shrinking batch windows.

• Supports heterogeneous data sources and targets in a single job including text files, XML, ERP systems, most databases (including partitioned databases), web services, and business intelligence tools.

Support for big data and Hadoop

• Includes support for IBM InfoSphere BigInsights, Cloudera, Apache and Hortonworks Hadoop Distributed File System (HDFS).

• Offers Balanced Optimization for Hadoop capabilities to push processing to the data and improve efficiency.

• Supports big-data governance including features such as impact analysis and data lineage.

Workload and business rules management

• Helps enable policy-driven control of system resources and prioritization of different classes of workloads.

• Helps you optimize hardware utilization and prioritize tasks, control job activities where resources exceed specified thresholds, and assess and reassign the priority of jobs as they are submitted into the queue.

• Integrates with IBM Operational Decision Management (formerly ILOG JRules), allowing you to implement decision logic within IBM InfoSphere Information Server.

Near real-time data integration

• Captures messages from Message Oriented Middleware (MOM) queues using Java Message Services (JMS) or WebSphere MQ adapters, allowing you to combine data into conforming operational and historical analysis perspectives.

• Provides a service-oriented architecture (SOA) for publishing data integration logic as shared services that can be reused over the enterprise.

• Can simultaneously support high-speed, high reliability requirements of transactional processing and the large volume bulk data requirements of batch processing.

Ease of use

• Includes an operations console and interactive debugger for parallel jobs to help you enhance productivity and accelerate problem resolution.

• Helps reduce the development and maintenance cycle for data integration projects by simplifying administration and maximizing development resources.

• Offers operational intelligence capabilities, smart management of metadata and metadata imports, and parallel debugging capabilities to help enhance productivity when working with partitioned data.

Ab Initio Overview

Ab Initio is suite of applications containing the various components, but generally when people name Ab Initio, they mean “Ab Initio Co>operation system”, which is primarily a GUI based ETL Application. It gives user the ability to drag and drop different components and attach them, quite akin to drawing

The strength of Ab Initio-ETL is massively parallel processing which gives it capability of handling large volume of data

Ab Initio software companies are located in :-

Let’s componentize Ab Initio

Co>operation System

EME(Enterprise Meta>Environment)

Additional Tools

Data profiler

Plan-IT etc

Co>operating System is ETL application; it comes packaged with EME (mentioned in next paragraph). This is GUI based application. Quite simple is design due to drag and drop features, most of the features are quite basic and so basic learning curve is quite steep. Now it has further two flavor or sub classes:

Batch Mode

Continuous Flow

The both primarily doing the similar things, but classically different in mode of processing as the name suggested. The “Batch Mode” is primarily used by most of costumer gives the benefit of moving bulk data (daily/multiple times a day).

Continuous mode is more like “Click/Trigger” driven; say when you click on a web page the data flow starts, some of very large web based application run on Ab Initio server using Continuous flow

EME is more like source control for Ab Initio, but it has many additional features like

Meta data management

Business Metadata management

Process metadata management

Impact Analysis

Documentation tools

Run History Tracking

And surely Check-in and check-out

Ab Initio has come up with certain other application to complement the ETL suite; I will not be covering these in details, just one liner

Data profiler – It is data profiling tool, got the features for data quality analysis

Plan-IT – It is primarily a scheduler built by Ab Initio to run Ab Initio jobs. It can be integrated with Ab Initio jobs.

Strength and weakness based on the following

· Cost to purchase

· Total Cost of Ownership

· Platform (OS and DBMS)

· Ease of use (wizards, drag & drop, etc)

· Learning curve

· Performance

· Available expertise

· Ab Initio Support and Other Resources

Cost of Purchase – It is one of the costliest ETL tool in the market, with cost ranging from 500k to 5M, it depends upon the number of servers Ab Initio is installed, number of developer license and type of license, and batch flow is comparatively cheaper than continuous flow.

Comparing it with other major ETL tools like Informatica with similar functionalities, the pricing difference will be clearly evident

Total Cost of Ownership – The cost of ownership comes in 3 parts

Annual maintenance charges

Cost of employing/training Ab initio resources

Development cost

Annual Maintenance charges – It is generally the percentage of initial cost and it is significant due to high initial cost. This number may differ based on the NDA and initial investment. A rough 10% maintenance charges is significant outflow.

Development cost – covered under training and resources

Available Expertise/Ab Initio Resources (training/employing) -

It is high end tool, so the developer community is not massive like many open source application, so employing these resources come with premium price.

Additionally Ab Initio is such a close community, so if you are ETL developer and want to explore/learn Ab Initio generally you will hit a wall and as I recall really there are 2 options

Work for an Organization who own Ab initio

There are only handful of organization who train in Ab Initio, so pay a premium to join the club

Platform – Like most of other ETL tools, it can work in various platforms.

On Database front, it can connect to all the major databases available in the market. So there is nothing to choose between this tool with respect to others. It allows connection to DB either by ODBC client or native mode, I believe some of other ETL tools may not have native mode supported

Ease of use – Being GUI based it is easy to use, simple component s, drag and drop and various indicators if connections are not completely made. In comparison to other tools, there is nothing much to choose in that end.

Creating custom based components and re-using those is one feature, I really liked in Ab Initio

There are certain set of components which are difficult to use and may require bit of scripting experience, but it is okay

Learning Curve – Learning curve is quick to start with; the difficult components can take some time. Ab Initio has designed certain components very cleverly, it takes bit of experience to utilize those optimally and take bit of time. I guess learning time of about 15 man days for a programmer with about 2-3 years of experience will give enough fluency in designing application

Ab Initio Support and other resources –

Covering both topics in one go – Ab Initio is treating their application like a fort/sacred book, with little information and literature available on the market, so as a consequence, there is not enough information material on web.

Not enough resources on web

Hit productivity of team, when struck with a technical/design issue

Unavailability of training material hurt training new employees

Just like any application/scripting language, there is potential of not using the application optimally, I really believe lack of proper training and open discussion has hit Ab Initio developers really hard, where they missing and still groping in dark having following set of problems

No Access or standard set of best practices. Organization tend to have their own best practices if any and under tough external review generally these will fall much short of best

Missing Input from other communities having parallel functionalities and smaller developer pool restricts better ideas/inputs

Though Ab Initio provides training for users (costumers), but it does not cover each and every aspect of Ab Initio and advanced training comes with a cost,

Ab Initio provide support to their customers, it is of decent quality, but it takes generally long turnaround time

Performance –

Used for :-

· Massive data processing, where time is of essence and performance and through put is critical, Ab Initio stands head and shoulder above others

· Available data can be split and processed in parallel giving it huge processing advantage.

· Theoretically, it is possible to design a system using Ab Initio architect where any additional processing power can be achieved by adding additional resources in parallel, thus allowing any scale-up easy and possible

· Ab Initio components like compressed indexed files and similar gives Ab initio an edge when dealing with huge dataset. Though this concept is not unheard of, in past, but Ab Initio implemented it successfully

· Some new scripting features known as PDL (Program definition Language) in Ab Initio allows flexibility, which is quite well received by Ab Initio developers and not easily available in other ETL tools

· Ab Initio has put some effort in component design taking care of small issues like memory management/memory foot print. Though these are not critical essentially, but in time critical system, these provide an edge

HOPE THIS WILL HELP :) by Gururajv007

Analytics & Information Management

Like us on Facebook

No of Viewers

Thursday, 26 September 2013

Datastage

Wednesday, 25 September 2013

Ab Initio