The Date dimension is a well known construct in general data warehousing. In many cases, the data for a date dimension is generated using a database stored procedure or shell-script.
Another approach to obtain the data for a date dimension is to generate it using an ETL tool like Pentaho Data Integration, a.k.a. Kettle. I think this approach makes sense for a number of reasons:
- When you tend to use a particular ETL tool, you will be able to reuse the date dimension generator over an over, and on different database platforms.
- You won’t need special database privileges beyond the ones you need already. Privileges for creating tables and to perform DML will usually be available, whereas you might need to convince a DBA that you require extra privileges to create and execute stored procedures.
In addition to these general considerations, you can pull a neat little trick with Kettle to localize the data and format of the date attributes. I wouldn’t go as far as to say that this feature is Kettle specific: rather, it relies on the localization support built into the java platform and the way you can put that to use in Kettle transformations.
Prerequisites
In this tip, the steps to create a date dimension are described using Kettle 2.5.1 (Generally available Release) and MySQL 5.1.20 (Beta). You will be able to follow through the example using earlier (and later) versions of both products though - I am not using any functionality that is specific to these particular version of the products. The recipe does not really require that you understand anything about data warehouses or date dimensions, but you will probably appreciate it better if you do ;)
Overview
The transformation to generate the data for the date dimension follows a pretty straightforward design. The graphical representation of the transformation is shown below:
First, the dimension table is created (Prepare). After that, rows are generated to fill it (Input). However, the generated rows are almost empty and barren - we still need to derive and add data to fill the attributes of the date dimension (Transformation). Finally, the data is stored in the date dimension table (Output).
Step-by-Step
The remainder of this article describes in detail how to build this transformation. The majority of steps is probably not very interesting to moderately experienced Kettle users, but may be of use to beginning users.
Note for users that are completely new to Kettle - it is advisable to review the first few chapters of the Spoon user guide (Spoon is the name of Kettle tool you use to design the ETL process). It explains how to start up the tool, create a new transformation, add and connect steps etc. You can find it in the docs/English directory beneath the Kettle home directory.
MySQL JDBC driver: setting the characterEncoding property to UTF8
You need to create a (JDBC) connection to MySQL in the usual, straightforward way:
In addition, you need to set the characterEncoding property of the JDBC driver:
This ensures MySQL will be able to understand the utf8 encoded data that we may produce to generate a date dimension in the, say, Chinese language. Note that you cannot just use a statement like SET NAMES utf8 to do this. This is not specific to Kettle, but has to do with the way the MySQL JDBC driver (Connector/J) handles character sets. Please refer to the “Using character sets and unicode” section of the Connector/J documentation for more information on this topic.
Creating the date dimension table
In this particular case, it seemed convenient to create the dimension table as part of the transformation. This is done using the “Execute SQL Script” step shown below:
The “Execute SQL Script” step executes the following script to create the date dimension table:
DROP TABLE IF EXISTS dim_date ; CREATE TABLE IF NOT EXISTS dim_date ( date_key smallint unsigned NOT NULL, date date NOT NULL, date_short char(12) NOT NULL, date_medium char(16) NOT NULL, date_long char(24) NOT NULL, date_full char(32) NOT NULL, day_in_year smallint unsigned NOT NULL, day_in_month tinyint unsigned NOT NULL, is_first_day_in_month char(10) NOT NULL, is_last_day_in_month char(10) NOT NULL, day_abbreviation char(3) NOT NULL, day_name char(12) NOT NULL, week_in_year tinyint unsigned NOT NULL, week_in_month tinyint unsigned NOT NULL, is_first_day_in_week char(10) NOT NULL, is_last_day_in_week char(10) NOT NULL, month_number tinyint unsigned NOT NULL, month_abbreviation char(3) NOT NULL, month_name char(12) NOT NULL, year2 char(2) NOT NULL, year4 year NOT NULL, quarter_name char(2) NOT NULL, quarter_number tinyint NOT NULL, year_quarter char(7) NOT NULL, year_month_number char(7) NOT NULL, year_month_abbreviation char(8) NOT NULL, PRIMARY KEY(date_key), UNIQUE(date) ) ENGINE=MyISAM DEFAULT CHARACTER SET utf8 DEFAULT COLLATE utf8_unicode_ci
This is by no means a complete date dimension. The most important limitation is that it only contains attributes that are immediately derivable from the calendar. So, attributes to denote business specific periods like the fiscal year and holidays are not included.
Generating 10 years worth of days
The grain of the date dimension is days - a row in the date dimension represents a single day. In this case, the “Generate Rows” is configured to generate 3660 rows, which roughly corresponds with enough days to last 10 years:
In the example, this step is also used to provide parameters to generate the date dimension data. As we’ll see in a moment, the inital_date field effectively specifies the first date that goes into the date dimension. The language_code and country_code fields are used to localize the textual attributes of the date dimension, and the local_yes and local_no fields are used for boolean attributes.
There are other ways to get these parameters into our transformation. For example, we could have used an “Add Constants” step with a similar result. Another possibility would be to get this data from the environment using a “Get Variables” step, and this would allow the parameters to be specified at transformation run-time.
Counting the days
Although we certainly generate enough rows, they are all identical. In order to have each row represent a single distinct day, we need a way to ‘count’ the generated rows. We do this by adding a “Sequence” step:
In this case, we use the “Add sequence” step to generate an incrementing number within the scope of the transformation. As we’ll see later on, we can add this to our initial date to get a series of consecutive dates.
Calculating date dimension Attributes
The previous steps form a basis from which we can derive all of the attributes that currently make up our date dimension. To actually calculate the date attributes, we use a “Modified Java Script Value” step:
Kettle comes with an embedded Rhino javascript engine. The “Modified Java Script Value” step lets you use it to run javascript code to as part of the transformation.
The javascript code is executed for each row that comes out of the previous steps. In the script code, one can reference the values from the input rows, perform some processing on them, and generate new output fields.
One of the fortunate characteristics of the Rhino engine is that it lets us use java classes inside the javascript code. Let’s take a look at the script to see how we can use that to generate the localized data for our data dimension attributes.
Initialization
The first thing we do in the javascript code is to get data from the current input row. In the “Generate Rows” step, we added the language_code and country_code fields to specify a locale. Here, in the script, we use the following piece of code to turn that into a java.util.Locale object:
//Create a Locale according to the specified language code var locale = new java.util.Locale( language_code.getString() //get the ISO639 language_code from the input row , country_code.getString() //get the ISO3166 country_code from the input row );
The java.util.Locale class represents a particular cultural region. It forms a cornerstone of the internationalization support built into the java platform, and provides information to many other classes to generate appropriately localized output.
We will be using the locale on a number of occasions, but first, we use our it to initialize a java.util.Calendar object:
//Create a calendar, use the specified locale var calendar = new java.util.GregorianCalendar(locale);
(Note that the java platform currently only provides one concrete Calender Class: the java.util.GregorianCalendar. Unfortunately, java does not seem to provide a built-in recipe for dealing with, for example, Islamic or Hebrew calendars).
We require the calendar object to obtain an instance of the java.util.Date Class that represents the date corresponding to the current row. To do that, we first set the calendar’s current date using the initial_date field that was specified in the “Generate Rows” step:
//Set the initial date calendar.setTime(initial_date.getDate());
We need this to add the number of days generated by our “Add Sequence” step:
//set the calendar to the current date by adding DaySequence days calendar.add(calendar.DAY_OF_MONTH,DaySequence.getInteger() - 1);
(Note that we substract 1 from the DaySequence value. This is because our sequence starts at 1, and we want the specified initial date to be included in our date dimension).
We conclude the initialization of the script by retrieving a java.util.Date object that represents the date for the current row.
//get the calendar date var date = new java.util.Date(calendar.getTimeInMillis());
This java.util.Date instance is assigned to the date variable in the script, allowing it to be used as an output field of the javascript step. We require this in order to fill the date column of the date dimension table. We will also be using the date variable later on in this script to derive the value of other date dimension attributes.
Getting Text representations of full dates
Our date dimension has a number of attributes to denote a complete date containing day, month and year parts, in various formats: date_short, date_medium, date_long and date_full. These are all generated using the java.text.DateFormat class.
To do that, we first need to create an appropriate DateFormat instance using the static getDateInstance() method, passing our locale object as well as a constant that specifies whether we want to short, medium, long or full format. Then, we can pass the java.util.Date object for which we want to obtain the textual representation to the format method of the newly created java.text.DateFormat instance:
//en-us example: 9/3/07 var date_short = java.text.DateFormat.getDateInstance( java.text.DateFormat.SHORT , locale ).format(date); //en-us example: Sep 3, 2007 var date_medium = java.text.DateFormat.getDateInstance( java.text.DateFormat.MEDIUM , locale ).format(date); //en-us example: September 3, 2007 var date_long = java.text.DateFormat.getDateInstance( java.text.DateFormat.LONG , locale ).format(date); //en-us example: Monday, September 3, 2007 var date_full = java.text.DateFormat.getDateInstance( java.text.DateFormat.FULL , locale ).format(date);
Formatting date parts
Extracting and formatting different date parts is most easily done by applying the format function on a subclass of java.text.Dateformat, the java.text.SimpleDateFormat class. The java.text.SimpleDateFormat class allows formatting of dates based on date and time patterns:
//day in year: 1..366 var simpleDateFormat = java.text.SimpleDateFormat("D",locale); var day_in_year = simpleDateFormat.format(date);
In this example, we pass both the locale and a date pattern to the constructor to create an instance of the java.text.SimpleDateFormat class. The pattern is passed as the string "D", specifying a day-in-year format.
Once we created the java.text.SimpleDateFormat instance, we can apply a new pattern to it using the applyPattern() method. Calling the format method again, we obtain the date in the desired format:
//day in month: 1..31 simpleDateFormat.applyPattern("d"); var day_in_month = simpleDateFormat.format(date); //en-us example: "Monday" simpleDateFormat.applyPattern("EEEE"); var day_name = simpleDateFormat.format(date); //en-us example: "Mon" simpleDateFormat.applyPattern("E"); var day_abbreviation = simpleDateFormat.format(date); //week in year, 1..53 simpleDateFormat.applyPattern("ww"); var week_in_year = simpleDateFormat.format(date); //week in month, 1..5 simpleDateFormat.applyPattern("W"); var week_in_month = simpleDateFormat.format(date); //month number in year, 1..12 simpleDateFormat.applyPattern("MM"); var month_number = simpleDateFormat.format(date); //en-us example: "September" simpleDateFormat.applyPattern("MMMM"); var month_name = simpleDateFormat.format(date); //en-us example: "Sep" simpleDateFormat.applyPattern("MMM"); var month_abbreviation = simpleDateFormat.format(date); //2 digit representation of the year, example: "07" for 2007 simpleDateFormat.applyPattern("y"); var year2 = simpleDateFormat.format(date); //4 digit representation of the year, example: 2007 simpleDateFormat.applyPattern("yyyy"); var year4 = simpleDateFormat.format(date);
Dealing with Quarters
Although the java.text.SimpleDateFormat class is useful, it does not provide any functionality for working with quarters. We do want our date dimension to contain attributes to represent the quarter, so we have to reside to computing these manually:
//handling Quarters is a DIY
var quarter_name = "Q";
var quarter_number;
switch(parseInt(month_number)){
case 1: case 2: case 3: quarter_number = "1"; break;
case 4: case 5: case 6: quarter_number = "2"; break;
case 7: case 8: case 9: quarter_number = "3"; break;
case 10: case 11: case 12: quarter_number = "4"; break;
}
quarter_name += quarter_number;
Although this will do for now, this solution doesn’t really cut it because it does not produce localized output. Anyway, it is better than nothing so we’ll just have to make do with it.
Period demarcation flags
Our date dimension has a few attributes that are used to indicate the start and end of week and month periods. We use simple yes/no type flags, but we allow the actual “yes” and “no” values to be specified by the user in the “Generate Rows” step. We retrieve them with the following piece of code:
//get the local yes/no values var yes = local_yes.getString(); var no = local_no.getString();
We can now use them these to flag the start and end of week and month periods.
The start (and of course, also the end) of the week are subject to the locale. In order to find out if we are dealing with the first day of a week, we use the getFirstDayOfWeek() method of the java.util.Calendar class. By comparing its return value with the day of week of the current row, we can see if we happen to be dealing with the first day of the week:
//initialize for week calculations var first_day_of_week = calendar.getFirstDayOfWeek(); var day_of_week = java.util.Calendar.DAY_OF_WEEK; //find out if this is the first day of the week var is_first_day_in_week; if(first_day_of_week==calendar.get(day_of_week)){ is_first_day_in_week = yes; } else { is_first_day_in_week = no; }
Note that we obtain the current day of the week by passing the value of the DAY_OF_WEEK constant to the get method of the java.util.Calendar object that we initialized at the start of the script.
In order to set the value for the is_last_day_in_week attribute of the date dimension, we simply find out if the next day happens to be the first day of the week. If it is, then by definition, the current row represents the last day of the week:
//calculate the next day
calendar.add(calendar.DAY_OF_MONTH,1);
//get the next calendar date
var next_day = new java.util.Date(calendar.getTimeInMillis());
//find out if this is the first day of the week
var is_last_day_in_week;
if(first_day_of_week==calendar.get(day_of_week)){
is_last_day_in_week = yes;
} else {
is_last_day_in_week = no;
}
(Note that we have already used similar code to add a day to a date when we added the day sequence to the initial date.)
We can use similar logic to calculate the values for the is_first_day_in_month and is_last_day_in_month indicators. This is actually easier, because the first day in the month is not dependant upon the locale (at least - not within one calendar). So, we only need to find out if the day of month is equal to one:
//find out if this is the first day of the month
var is_first_day_of_month;
if(day_in_month == 1){
is_first_day_in_month = yes;
} else {
is_first_day_in_month = no;
}
//find out if this is the last day in the month
var is_last_day_of_month;
if(java.text.SimpleDateFormat("d",locale).format(next_day)==1){
is_last_day_in_month = yes;
} else {
is_last_day_in_month = no;
}
A few more date attributes
We conclude the computation of the date attributes by adding a few more useful labels:
//a few useful labels var year_quarter = year4 + "-" + quarter_name; var year_month_number = year4 + "-" + month_number; var year_month_abbreviation = year4 + "-" + month_abbreviation;
Like when we calculated the quarters, this is actually not a very good method because the results will not be localized. That said, the result will make sense for many locales, and we don’t really have a better way to deal with it right now.
Defining the step outputs
We just calculated all the required values to fill the attributes of our date dimension. We just need to get them out of the script and into the outputs of the step.
Every variable declared in the javascript (using the var keyword) can be used as an output field of the javascript step. The easiest way to generate the outputs is by hitting the “Get Variables” button at the bottom of the dialog. This simply adds an output field for each variable declared in the script:
By default, the data type for all the outputs added in this way is set to the String type. Although it is good practice to choose a more specific data type, it is almost always unncessary in this case, as all integer type values will be correctly converted implicitly when we insert them into the database. There is one exception in this case, and that is the date output. Inside the script, it is an instance of a java.util.Date class, and we must set the type to “Date” in the output too. Otherwise, the (java) string representation of the java.util.Date object will be sent as output, and this is not automatically recognized as a date by MySQL.
Discarding Fields
We are now almost ready to insert the rows into the date dimension table. We only need to discard all fields in the stream that do not correspond with any of the columns in our date dimension table. We use a “Select Values” step to do that:
We use the “Get Fields To Select” button to pull in all available fields, and after that, simply select and delete each field that we do not need. As a final step, we rename the DaySequence field to date_key to map it to the date_key column in our date dimension table.
Inserting data into the table
In the final step, we add the generated data to the dim_date table we created in the very first step of the transformation:
We only need to specify the connection and the table name here, and the step will then automatically attempt to map the fields of the incoming rows to table columns.
We could have used the “Insert / Update” step, or even the “Execute SQL Script” step too to write the data to the dimension table, but that would require a little bit of extra work.
Running the transformation
After building the transformation, you can run it by hitting the “running man” icon on the toolbar. This will open a dialog where you can set a number of properties for the transformation. Hit “Launch” button there and after that, the transformation will be executed:
Closing Notes
I hope you enjoyed this tip. If you want to, you can download the kettle transformation here, and use it as you see fit.
If you are interested in open source data warehousing, register for the MySQL Enterprise Data Warehousing Seminar, Thursday, September 06, 2007 and hear what Robin Schumacher has to say about that subject. (Note that this is a general MySQL data warehousing seminar - this post and the seminar are unrelated)













Hi,
I am new to this pentaho tool and tried some of basic transformation.. in my transformation i hve created the date dimension table but i want to extract date from the date dimension table based on the given date range... could u plz repy me as soon as possible...
Thanks & Regards
Arunn
Hi Arunn,
Take a look at the penthao forums at http://forums.pentaho.org/ or on the freenode ##pentaho IRC channel. There are many people there that can offer help.
I'm not sure I understand what you mean by "extract date from the date dimension table" but with in Kettle, you can use a "Table Input Step", and there you can enter an arbitrary SELECT statement. You can select a particular range there by setting up an appropriate WHERE condition.
Hi Roland,
Thk u for ur help...but my problem is different,actually am having 2 table input(source table) the 1st table contains start_date & end_date and 2nd table contains date(dimension table).During the tranformation i want to extract the date based on the range(start_date & end_date) and load it on to the target table.
for ex.
1st table
project_id start_date end_date
110 01/01/2008 03/01/2008
2nd table
pk_date_id date
.
.
50 01/01/2008
51 02/01/2008
52 03/01/2008
.
.
target table
project_id start_date end_date expense_date
110 01/01/2008 03/01/2008 01/01/2008
110 01/01/2008 03/01/2008 02/01/2008
110 01/01/2008 03/01/2008 03/01/2008
each row from 1st table is compare to the 2nd table and generate the no. of rows based on the range...
could u plz explain me in a brief manner as soon as possible...
Thanks
Arunn
Hi Arunn,
Please understand that the pentaho forums should be the first place to look for help. As is always the case with voluntary help offered through community channel, you should be patient and not stess too often that you need help urgently. If you really need help "...as soon as possible..." then maybe you should look into getting a pentaho subscription so you can call for support whenever you need it. Thank you for understanding.
Now, as for your problem, there are at least two ways to solve this, but most probably more than that. I bet you can find out for yourself too if you take some time to read the kettle documentation. Anyway,
1) you can simply write the appropriate SQL to join your tables in your table input step.
2) you can use a kettle database join step
kind regards, i hope this helps.
Roland
Hi Roland,
sorry for the urgent expectation... i was tried all the object(sql script,database join etc..) to filter the rows based on range but it was not worked fine after tht i hve created the cartesian product with condition it's work fine.. thk for ur reply....
Thanks
Arunn
Hi Roland,
I hve created 3 tranformation in kettle but i wnt to schedule the transformation in particular interval.i was tried it by using job, it execute the trans. but data is not available...
i want to execute the transformation in particular interval.
Thanks
Arunn
So what's wrong with cron or at?
Hi Roland,
i was tried by using at command even though it was not executed properly...
Ex.
PMA.bat, my batch file contains 2 transformations
pan.bat /rep:"My_Sample_Repository" /trans:"PMA_Project_Transformation" /dir:/ /user:admin /pass:admin /level:Basic
pan.bat /rep:"My_Sample_Repository" /trans:"PMA_Expenditure_Transformation" /dir:/ /user:admin /pass:admin /level:Basic
i want to execute this 2 transformation one after another..
In command prompt i schedule the batch file by using "at" command
ex. at 15:30 "PMA.bat"
Well, I'm sorry but I guess you're going to have to troubleshoot your problem then.
Please allow me to remind you of the pentaho forums (forums.pentaho.org) and the ##pentaho irc channel on freenode. But are an excellent source of community support.
kind regards,
Roland Bouman