Wednesday, March 15, 2017

Slowly Changing Dimensions

I was planning to explain Slowly Changing Dimension (SCDs) since long time so this article is all about SCD(s).

Slowly Changing Dimensions (SCDs) are those dimensions that have data that may change slowly over the period, rather than changing on a time-based regular schedule. For example, a dimension in your Sales database tracks the sales information of your company's salespeople. Creating sales reports are easy until a salesperson is transferred from one department (office) to another. How do you handle such a change in your sales dimension?
 
Here is a scenario:
An Employee "XXX" belongs to Department D1 from DateOfJoining till 01/01/2008, so whatever sales made by this employee added to department D1. After couple of years, that employee moved in to department D2 on 01/01/2010. Now all his sales contribution should be added to D2  from 01/01/2010 onwards and the old sales should still belong to D1.


Types of SCDs:
There are different types of management methodologies referred as Type 0 through 6.

SCD Type 0:
The Type 0 method is a passive approach to managing dimension value changes, in which no action is taken. Values remain as they were at the time the dimension record was first entered.Type 0 provides the least control or no control over managing a slowly changing dimension.

The most common slowly changing dimensions are Types 1, 2, and 3.

SCD Type 1:
This methodology overwrites old data with new data and therefore does not track historical data at all.

Here is an example of a database table that keeps customer's email information:
CustomerID
CustomerName
Email
101
Cust101
abc@xyz.com
102
Cust102
pqr@xyz.com

If Cust101 email changes from abc@xyz.com to def@xyz.com then we need simple update statement to overwrite the old value with new one.
Here is how table will look after update command:
CustomerID
CustomerName
Email
101
Cust101
def@xyz.com
102
Cust102
pqr@xyz.com

The main disadvantage to Type 1 SCDs is that there is no historical record kept in the data warehouse. But an advantage to Type 1 SCDs is that they are very easy to maintain.

SCD Type 2:
The Type 2 SCD tracks historical data by creating multiple records in the dimensional tables with separate surrogate keys. With Type 2, we have unlimited history preservation as a new record is inserted each time a change is made.

Lets consider above example again. We need to maintain history of email address as well, in that case, we need two more columns (FromDate & ToDate).

CustomerID
CustomerName
Email
FromDate
ToDate
101
Cust101
abc@xyz.com
2011-01-01
9999-12-31
102
Cust102
pqr@xyz.com
2014-10-25
9999-12-31

If Cust101 email changes from abc@xyz.com to def@xyz.com on 2017-02-15, then we need to insert a new record and update the ToDate of previous record
Here is how table will look after update command:

CustomerID
CustomerName
Email
FromDate
ToDate
101
Cust101
abc@xyz.com
2011-01-01
2017-02-14
101
Cust101
abc@xyz.com
2017-02-15
9999-12-31
102
Cust102
pqr@xyz.com
2014-10-25
9999-12-31

SCD Type 3:
The Type 3 method tracks changes using separate columns. Whereas Type 2 had unlimited history preservation, Type 3 has limited history preservation, as it’s limited to the number of columns we designate for storing historical data. Where the original table structure in Type 1 and Type 2 was very similar, Type 3 will add additional columns to the tables:

CustomerID
CustomerName
OriginalEmail
FromDate
CurrentEmail
101
Cust101
abc@xyz.com
2017-02-15
abc@xyz.com
102
Cust102
pqr@xyz.com
NULL
NULL

Type 3 is less common because it involves changing the physical tables and is not scalable.

Reference: MSDN