Program Training‎ > ‎

Data Warehouse with Pentaho Data Integration

posted Jul 8, 2011, 10:34 AM by Editor KampusBI   [ updated Jul 8, 2011, 10:41 AM ]


Data analysis as part of business intelligence solutions is a growingly demanded needs. Unfortunately in most transactional systems, rarely data is organized in an effective and efficient aggregates. In those cases, producing an analytical report require an exhaustive and time consuming efforts. 
Thus, to successfully build a comprehensive BI solution we need to transform our transactional data into analytical friendly format, i.e. Multi Dimensional Modelling. The transformation is usually stored in a data warehouse. 

Of course, usually in data warehouse process we use ETL (Extract, Transform and Load) to help us in productivity. Pentaho Data Integration (Kettle) is one of the most popular ETL tool and is open sourced. 

This course presents an overview of data warehouse, multi dimensional, ETL concept, the extensive use of Pentaho Data Integration to build a powerful data warehouse solution. The practical approach of this course involved many cases we face in our daily jobs so that we can get a highly valuable skills from attending the course.

Who Should Attend ? 

This  course  is  designed  for  those  new  to  Data  Warehouse  and  ETL  or  need  to 
understand the basics of Pentaho Data Integration. 


At the completion of this course, attendee should be able to : 
  • Understand  the  concepts  and  topics  of  Data  Warehouse,  Dimensional  Modeling,  OLAP and ETL 
  • Use Pentaho Data Integration to build simple jobs / transformations 
  • Consume data from several data sources  
  •  Building and Populating Fact and Dimensional tables
  • Troubleshooting techniques 
  • Schedule job / transformation 

Course Duration 

5 days / 30 hours 

Course Prerequisites 

Basic understanding one of several popular DBMS (Oracle, SQL Server, MySQL, etc) and of Structured Query Language (SQL)  
No understanding of other Pentaho tools is needed 

Course Requirements

  • PC or Laptop with minimum of 2GHz CPU, 1 GB of RAM, DVD Drive and 2 GB of available hard disk space. 
  •  Softwares : 
    • Microsoft Windows XP Pro 
    • Java Runtime Environment (JRE) 
    • Apache Tomcat 
    • MySQL 5.0 Database Server 
    • Pentaho Data Integration 
    • Pentaho Data Analysis (Mondrian) 

Course Outline

  1. Introduction to Data Warehouse 
    • Data Warehouse.
    • Online  Transaction  Processing  (OLTP)  and  Online  Analytical  Processing (OLAP). 
    • Data Warehouse and OLAP. 
    • Delivering Solution with ETL (Extract, Transform, Load) Tool. 
  2. Installation and Configuration 
    • Java Runtime Environment / Java Development Kit. 
    • Pentaho Data Integration. 
    • XAMPP package (Apache HTTP Server and MySQL). 
    • SQLYog – a GUI based mysql client. 
    • Data and Script samples. 
  3. Short Introduction to MySQL 
    • MySQL Storage Engines.
    • Administering MySQL via PHPMyAdmin.
    • PHI-Minimart sample database installation.
  4. Pentaho Data  Integration (Kettle) 
    • Introducing Kettle as Pentaho’s ETL Suite
    • Architecture
    • Components 
      • Spoon : Graphical UI Designer for job / transformation steps
      • Pan : Command line batch script for transformation execution
      • Kitchen : Command line batch script for transformation execution
      • Carte : Cluster server 
    • Job / Transformation 
    • Step and Hop.
    • Row and Meta Data.
    • Relation between job and transformation.
  5. Getting Started with Spoon
    • File system and RDBMS based Repository 
    • Spoon Development Environment 
    • Database Connections
    • Job and Transformation
    • Creating job
    • Creating transformation
    • Calling transformation from job
    • Configuring Log
  6. Multi Dimensional Modelling 
    • Normalized versus Multi Dimensional Model
    • Fact and Dimension Tables
    • Star Schema and Snowflake Schema
    • Tasks :
      • Create a Kettle  transformation  to map PHI-Minimart  transactional database sample to dimensional modeling database.
      • Create logs for each steps.
  7. Change Data Capture (CDC)
    • What is CDC ?
    • Why CDC is so hard that heavily relied on data source ?
    • SQL Server 2008’s CDC feature demonstration. 
    • Tasks :
      • Create a Kettle  transformation  to map PHI-Minimart  transactional database  sample to dimensional modeling database.
      • Create logs for each steps.
  8. Slowly Changing Dimension (SCD)
    • Slowly Changing Dimension to solve master data historical problems. 
    • SCD Types 
    • Use of Kettle’s step to solve several SCD types with several schema : 
      • Insert / Update 
      • Punch Through
  9. Orphan / Late Arrival
    • What is Late Arrival Dimension?
    • Typical Situations where Late Arrival occurs.
    • Best practice of Late Arrival’s handling.
  10. OLAP View of Multidimensional Data (Mondrian / JPivot) 
    • Mondrian Installation
    • Creating scheme based on our fact and dimension tables 
    • View and navigate our Cube using Web Browser
  11. Data Staging 
    • What is Data Staging?
    • Background : Physical I/O  versus In-Memory Processing 
    • Task :
      • Create  a  transformation  to  join  from  3  data  sources  :  text  file,  Excel spreadsheet, and RDBMS
      • Create a currency staging table to solve sequential dependence problem.
  12. Advance Controls
    • Environment Variables.
    • Shared Objects. 
    • Error Handling.
    • Email job results.
    • Task :
      • Create a dynamic tables dump using variable and looping control. 
      • Refining existing transformations to use email alert. 
  13. Automation
    • Using Windows Task Scheduler to schedule ETL running job and transformation.