Analytics & Information Management: October 2013

Informatica Overview

The Informatica Platform Simplifies Your ETL Processes

Initiate ETL Projects Quickly and Cost-Effectively

Serving as the foundation for all data integration projects, the Informatica Platform lets IT organizations initiate the ETL process from virtually any business system, in any format. As part of the Informatica Platform, Informatica PowerCenter delivers robust yet easy-to-use ETL capabilities that simplify the development and deployment of smaller departmental data marts and data warehouses. In addition, the ETL capabilities facilitate reuse from one project to another.

Enhance ETL with Universal Data Access Capabilities

PowerCenter improves the flexibility of your ETL process with the ability to extract more enterprise data-types than any other technology on the market. Complemented by Informatica PowerExchange and PowerCenter Options, PowerCenter delivers successful ETL initiatives with access to virtually any enterprise data-type, including:

Structured, unstructured, and semi-structured data
Relational, mainframe, file, and standards-based data
Message queue data
Automate Most ETL Processes for Fewer Errors and Greater Productivity

PowerCenter makes your ETL developers' jobs easier with cross-functional tools, reusable components, and an enterprise-wide platform that automates many ETL processes. For data warehousing and ETL developers, that means fewer ETL errors and emergency fixes, less risk of rework, faster development time, and greater productivity.

Features

PowerCenter Enterprise forms the foundation for all your data and enterprise integration initiatives—including data governance, data migration, and enterprise data warehousing—setting the standard for high-performance enterprise data integration and quality software. PowerCenter Enterprise scales to support large volumes of disparate data sources and meets demands for security and performance.

Development Agility

Collaborative,
Team-Based Development

PowerCenter provides a variety of graphical development environments designed for data integration developers as well as business users. The shared metadata repository allows groups of users to collaborate on integration projects, enabling rapid iteration cycles that result in significant time savings

Prototype to Production
with a Click

Users can create virtual prototypes of reports and integration jobs without having to move data from its original sources. They'll be able to profile, integrate, and cleanse data on the fly, prototyping integration work in hours instead of weeks. With a few clicks, prototypes can be converted to a physical integration without recoding.

Automated Test Development

Typically, 30 percent of software development is spent on testing code. Data integration projects are no different. Only PowerCenter provides testing tools that automatically generate test cases—saving anywhere from 50 to 80 percent of testing effort while providing significantly more test coverage than manual approaches.

Management Confidence

Reliability and High Availability

Informatica customers count on PowerCenter Enterprise to run their critical business processes. Our high-availability products provide checkpoint recovery so that on the rare occasion that you do have a failure, we pick up the integration job right where it left off.

Traceability and Lineage

Maintaining code and tracking issues can be a daunting challenge when you're hand coding. With an extensible metadata repository that tracks data lineage for you, PowerCenter Enterprise traces the path of data back to the source or its final destination, providing the detailed documentation required by government regulations.

Proactive Monitoring

Only Informatica protects data integration projects with an automated early warning system that alerts your IT team to processes and data quality that differ from the norm as soon as issues arise. PowerCenter Enterprise lets IT monitor workflows, sessions, change control activities, and correlate events across multiple systems

Informatica PowerCenter architecture

Informatica PowerCenter uses a client-server architecture containing several components, described in general terms below, and illustrated in Figure 1. You may find it useful to familiarize yourself with PowerCenter’s architecture before beginning the installation.

For a detailed description of the components that make up PowerCenter, see Chapter 1, “Product Overview,” in Informatica PowerCenter Getting Started.

Informatica PowerCenter contains the following components licensed for use with AX Datasource:

Informatica domain – The primary unit for management and administration of services in PowerCenter. Your license agreement restricts you to a single domain.
Node – A logical representation of a machine in a domain. The node that hosts the domain is the master gateway for the domain. Your license agreement restricts you to a single node.
Informatica Services – A Windows service that starts the Service Manager on a node.
Service Manager – Starts and runs the application services on a machine in a domain.
Integration Service – Reads workflow information from the PowerCenter repository, and runs sessions and workflows that extract, transform, and load data.
Repository Service – Manages connections to the PowerCenter repository.
Informatica Administrator – A Web application for managing the Informatica domain, PowerCenter security, and the PowerCenter repository.
Informatica domain configuration database – Stores the information (metadata) related to the configuration of the Informatica domain.
PowerCenter repository – Stores the information (metadata) required to extract, transform, and load data. Resides in a relational database.
PowerCenter Client, which consists of:

Designer – Allows you to define sources and targets, and create mappings with transformation instructions, for use in workflows.
Workflow Manager – Allows you to create, schedule, and run workflows.
Workflow Monitor – Allows you to monitor scheduled and running workflows.
Repository Manager – Allows you to administer the PowerCenter repository: assign permissions to users and groups, manage folders, and view PowerCenter repository metadata.

Name changes

The following name changes have occurred

Previous name (in Informatica PowerCenter 8.6.1)	New name (in Informatica PowerCenter 9.0.1)
PowerCenter Server	PowerCenter
PowerCenter Administration Console	Informatica Administrator
PowerCenter domain	Informatica domain

Deployment Flexibility

Get the Connectivity You Deserve

Whether it's structured data in a database, unstructured data like emails or PDF files, social media data in the Cloud, or enterprise applications like SAP or Oracle applications, PowerCenter Enterprise has a high-speed connector that makes data integration quick and easy.

Map Once, Deploy Anywhere

Only Informatica provides a single graphical environment that lets developers create data integration and quality mappings that can be implemented across a variety of technologies. Powered by Vibe™, PowerCenter Enterprise gives you the flexibility to support deployment virtually, on proprietary ETL engines, even on Hadoop, without any recoding.

Meet Your Data Delivery Needs

Not all data is created equal—and neither are your data delivery needs. PowerCenter Enterprise provides a wide variety of technologies for performing data integration, meeting needs from big data and batch to real-time ultra-messaging for high-speed trading.

· Advanced XML Data Integration Option

The Informatica PowerCenter Advanced XML Data Integration Option enables real-time access to hierarchical data otherwise locked in XML files and messages.

· Data Integration Analyst Option

Available for Informatica PowerCenter and Informatica Data Services, the Data Integration Analyst option empowers business analysts to perform data integration tasks themselves while IT retains control of the overall data integration process.

· Data Validation Option

The Informatica PowerCenter Data Validation Option reduces the time and costs of upgrade testing, data integration project testing, and production data auditing and verification by up to 90%—with no programming skills required.

· Enterprise Grid Option

The Informatica PowerCenter Enterprise Grid Option adds PowerCenter's native data integration grid capabilities, including partitioning and high availability, for more cost-effective performance, dynamic scalability and reliability.

· High Availability Option

The Informatica PowerCenter High Availability Option minimizes service interruptions during hardware and/or software outages.

· Metadata Exchange Option

The Informatica PowerCenter Metadata Exchange Options provide access to technical and business metadata from third-party data modeling tools, business intelligence software, and source and target database catalogs.

· Partitioning Option

The Informatica PowerCenter Partitioning Option helps IT organizations take advantage of parallel data processing in multiprocessor and grid-based hardware environments.

· Pushdown Optimization Option

The Informatica PowerCenter Pushdown Optimization Option enables data transformation processing, where appropriate, to be pushed down into relational databases or appliances to improve overall performance and throughput.

· Unstructured Data Option

The Informatica PowerCenter Unstructured Data Option expands PowerCenter's data access capabilities to include unstructured data formats, providing virtually unlimited access to all data formats.

Components under Workflow Manager Tool

Here is a concise idea about the Workflow Manager in Informatica.

1. Workflow: This is top level object and the entire task (process) for data loading has to be defined under the workflow. It is like a Mapping that integrates different kind of tasks as a Unit.

2. Task: A task is an individual process to perform a very specific activity during data loading. There are 10 different kinds of tasks that can be grouped under a Workflow:

2.1 Session:

This is a compulsory task for data loading.

A session is an instance of Mapping Program or in other words a running instance of a mapping is referred as Session. For one Mapping Program we can create one or more Sessions. Generally we require one session for one mapping but for Parallel data loading we may create multiple sessions.

2.2 Command:

To execute operating System Commands or Programs. For example: If we need to inform all the users about data loading process, we can write Shell Script at Operating System and execute them via 'Command Task' just before the session execution.

2.3 Email:

To send emails to users using Mail Server (if configured). This job can be done via Command Task also but Email Task is integrated part of Workflow Manager and is much simple compared to Command Task.

2.4 Decision:

This task is used to evaluate condition based on other tasks' values to decide next course of actions. It is like IF statement.

2.5 Control:

This task is to control the flow of tasks within the Workflow. For example: If we need that control should not reach to specific task (like a Command Task) when a condition fails then we can use Control Task.

2.6 Event Wait:

This task is used to define an event and when the particular event fires (activates) then process continues.

2.7 Event Raise:

This task is used to fire (activate) an event forcefully.

2.8 Assignment:

This task is used to assign values to Parameters and Variables used within workflow.

2.9 Timer:

This task is used to specify time of execution (delay) for a task.

2.10 Worklet:

This task is to define reusable Workflow. If we need to execute set of tasks again and again under different Workflows then it is better to define them as Worklet and use under different Workflows.

Note:

Three types of tasks (Session, Command and Email) can be defined as Reusable Tasks. Reusable task means: A task is created as an Independent Task and it is used within WorkFlow or Worklet. So if a task is created within workflow or worklet directly then it is non-reusable task.

So if we need a task to be executed once within a Workflow or Worklet then create it as non-reusable task otherwise create them as reusable (independent) task.

Worklet Task also can be defined as Reusable Task via separate Menu Interface.

Teradata Overview

Teradata is an enterprise software company that develops and sells a relational database management system (RDBMS) with the same name. In February 2011, Gartner ranked Teradata as one of the leading companies in data warehousing and enterprise analytics. Teradata was a division of the NCR Corporation, which acquired Teradata on February 28, 1991. Teradata's revenues in 2005 were almost $1.5 billion with an operating margin of 21%. On January 8, 2007, NCR announced that it would spin-off Teradata as an independently traded company, and this spin-off was completed October 1 of the same year, with Teradata trading under the NYSE stock symbol TDC.^[6]

The Teradata product is referred to as a "data warehouse system" and stores and manages data. The data warehouses use a "shared nothing architecture, which means that each server node has its own memory and processing power. Adding more servers and nodes increases the amount of data that can be stored. The database software sits on top of the servers and spreads the workload among them. Teradata sells applications and software to process different types of data. In 2010, Teradata added text analytics to track unstructured data, such as word processor documents, and semi-structured data, such as spreadsheets.
Teradata's product can be used for business analysis. Data warehouses can track company data, such as sales, customer preferences, product placement, etc.

Teradata is made up of following components –

Processor Chip – The processor is the BRAIN of the Teradata system. It is responsible for all the processing done by the system. All task are done according to the direction of the processor.

Memory – The memory is known as the HAND of the Teradata system. Data is retrieved from the hard drives into memory, where processor manipulates, change or alter the data. Once changes are made in memory, the processor directs the information back to the hard drive for storage.

Hard Drives – This is known as the SPINE of the Teradata system. All the data of the Teradata system is stored in the hard drives. Size of hard drives reflects the size of the Teradata system

Teradata has Linear Scalability
One of the most important asset of Teradata is that it has Linear Scalability. There is no limit on Teradata system. We can grow it to as many times as we want. Any time you want to double the speed of Teradata system, just double the numbers of AMPs and PE. This can be better explained with the help of an example

- Teradata takes every table in the system and spread evenly among different AMPs. Each Amp works on the portion of records which it holds.

- Suppose a EMPLOYEE table has 8 different employee id’s. Now in a 2 AMP system each AMP will hold 4 rows in its DISK to accommodate total 8 rows.

2 AMP SYSTEM

At the time of data retrieval each AMP will work on its DISK and send 4 rows to PE for further processing. If we suppose, one AMP will take 1 microseconds (MS) to retrieve 1 rows, then the time taken to retrieve 4 rows is 4 MS. And as we know that AMPs work in parallel, so both the AMPs will retrieve all 8 records in 4 MS only (4 MS time for each AMP).

Now we double the AMP in our system, and we use total 4 AMP. As Teradata distribute the records evenly among all AMPs, so now each AMP will store 2 records of the table.

4 AMP SYSTEM

Now according to our time scale, the time taken by each AMP for retrieving 2 records is 2MS.
So all 4 AMPs, working parallel, will retrieve the 8 records in 2MS only. Which was previously 4MS for the 2 AMP system.

Hence we double our speed by doubling the number of AMPs in our system.

This is the power of parallelism in Teradata. It is also known as ‘DIVIDE and CONQUER’ theory, according to which we are dividing the work equally and getting the result faster. To achieve the desirable speed we can increase the number of AMPs accordingly.

Partition Primary Index – Advantage and Disadvantage

Advantage of Partition Primary Index –

Partitioned Primary Index is one of the unique features of Teradata, which is used for distribution of rows based on different partitions so that they can be retrieved much faster than any other conventional approach.
Maximum partitions allowed by Teradata – 65,535 ( suggest if any up gradation )
It also reduces the overhead of scanning the complete table (or FTS) thus improving performance.
In PPI tables row is hashed normally on the basis of its PI, but actual storage of row in AMP will take place only in its respective partition. It means rows are sorted first on the basis of there partition column and then inside that partition they are sorted by there row hash.
Usually PPI’s are defined on a table in order to increase query efficiency by avoiding full table scans without the overhead and maintenance costs of secondary indexes.
Deletes on the PPI table is much faster.
For range based queries we can effectively remove SI and use PPI, thus saving overhead of SI subtable.

Disadvantage of Partition Primary Index –

PPI rows are 2 bytes are longer so it will use more PERM space.
In case we have defined SI on PPI table then as usual size of SI sub table will also increase by 2 bytes for each referencing rowed
A PI access can be degraded if the partition column is not part of the PI. For e.g. if query specifying a PI value but no value for the PPI column must look in each partition for that table, hence loosing the advantage of using PI in where clause.
When we are doing joins to non-partitioned tables with the PPI table then that join may be degraded. If one of the tables is partitioned and other one is non-partitioned then sliding window merger join will take place.
The PI can’t be defined UNIQUE when the portioning columns are not the part of PI.

Technology and product

Teradata is a massively parallel processing system running, a shared nothing architecture. Its technology consists of hardware, software, database, and consulting. The system moves data to a data warehouse where it can be recalled and analyzed.

The systems can be used as back-up for one another during downtime, and in normal operation balance the work load across themselves.

In 2009, Forrester Research issued a report, "The Forrester Wave: Enterprise Data Warehouse Platform," by James Kobielus, rating Teradata the industry's number one enterprise data warehouse platform in the "Current Offering" category.

Marketing research company Gartner Group placed Teradata in the "leaders quadrant" in its 2009, 2010, and 2012 reports, "Magic Quadrant for Data Warehouse Database Management Systems".

Teradata is the most popular data warehouse DBMS in the DB-Engines database ranking.

In 2010, Teradata was listed in Fortune’s annual list of Most Admired Companies

Active enterprise data warehouse

Teradata Active Enterprise Data Warehouse is the platform that runs the Teradata Database, with added data management tools and data mining software.
The data warehouse differentiates between “hot and cold” data – meaning that the warehouse puts data that is not often used in a slower storage section. As of October 2010, Teradata uses Xeon 5600 processors for the server nodes.
Teradata Database 13.10 was announced in 2010 as the company’s database software for storing and processing data.
Teradata Database 14 was sold as the upgrade to 13.10 in 2011 and runs multiple data warehouse workloads at the same time. It includes column-store analyses.
Teradata Integrated Analytics is a set of tools for data analysis that resides inside the data warehouse

Backup, archive, and restore

BAR is Teradata’s backup and recovery system.
The Teradata Disaster Recovery Solution is automation and tools for data recovery and archiving. Customer data can be stored in an offsite recovery center.

Platform family

Teradata Platform Family is a set of products that include the Teradata Data Warehouse, Database, and a set of analytic tools. The platform family is marketed as a smaller and less expensive than the other Teradata solutions

Analytics & Information Management

Like us on Facebook

No of Viewers

Thursday, 10 October 2013

Informatica

Name changes

Friday, 4 October 2013

Teradata

Partition Primary Index – Advantage and Disadvantage

Active enterprise data warehouse

Backup, archive, and restore

Platform family