Saturday, September 3, 2011

Incremental Load in SSIS

Problem Description: Perform Incremental Load using SSIS package. There is one Source table with ID (may be Primary Key), CreatedDate and ModifiedDate along with other columns. The requirement is to load the destination table with new records and update the existing records (if any updated records are available).

Soultion:
You can use Lookup Transformation where you compare source and destination data based on some id/code and get the new and updated records, and then use Conditoional Split to select the new and updated rows before loading the table.

However, I don't recommend this approach, specially when destination table is very huge and volume of delta is very high.

You can do it in simple steps:
  1. Find the Maximum ID & Last ModifiedDate from destination and store in package variables.
  2. Pull the new and updated records from source and load to a staging table using above variables.
  3. Insert and Update the records using Execute SQL Task
Here is the an step-by-step example to do this:
 
STEP1:
Create a new Package IncrementalLoad.dtsx and add following package variables:








VariableNameDescriptionExamle
DestinationConnStrConnection string for
destination server/db
Data Source=(local);
Initial Catalog=TestHN;
Provider=SQLNCLI10.1;
Integrated Security=SSPI;
Auto Translate=False;
MaxIDMax ID from
destination table
100
MaxModifiedDateMax Date from
destination table
2010:11:10 11:50:20.003
SourceConnStrConnection string for
source server/db
Data Source=(local);
Initial Catalog=TestHN;
Provider=SQLNCLI10.1;
Integrated Security=SSPI;
Auto Translate=False;



STEP2:
Create Source, Destination, and staging tables using below code:

-- Source Table (create in source DB)
IF OBJECT_ID('dbo.TestSource','U') IS NOT NULL
DROP TABLE dbo.TestSource
GO
CREATE Table dbo.TestSource
(
  [ID] [int] IDENTITY(1,1)
  ,[Code] [varchar](10)
  ,[Description]  [varchar](100)
  ,[CreatedDate]  [datetime] NOT NULL default GETDATE()
  ,[ModifiedDate] [datetime] NOT NULL default GETDATE()
)
GO

-- Destination Table (create in destination DB)
IF OBJECT_ID('dbo.TestDestination','U') IS NOT NULL
DROP TABLE dbo.TestDestination
GO
CREATE Table dbo.TestDestination
(
  [ID] [int]
  ,[Code] [varchar](10)
  ,[Description]  [varchar](100)
  ,[CreatedDate]  [datetime] NOT NULL
  ,[ModifiedDate] [datetime] NOT NULL
)
GO

-- Staging Table (create in destination DB)
IF OBJECT_ID('dbo.TestDestinationSTG','U') IS NOT NULL
DROP TABLE dbo.TestDestinationSTG
GO
CREATE Table dbo.TestDestinationSTG
(
  [ID] [int]
  ,[Code] [varchar](10)
  ,[Description] [varchar](100)
  ,[CreatedDate] [datetime] NOT NULL
  ,[ModifiedDate] [datetime] NOT NULL
)
GO

STEP3:
Create two OLE DB Connection Manager, one for Source Server and another for Destination Server.
In the connection manager properties, set the expression ConnectionString with respective variables as shown below:

STEP4:
Drag and drop Execute SQL Task and name it -
Execute SQL Task - Get Max ID and Last ModifiedDate.
Double click on EST and set  following properties:
ResultSet = Single row
Connection = Destination Server
SQLStatement =
SELECT
   ISNULL(MAX(ID) ,0) MaxID,
   ISNULL(MAX(ModifiedDate),'2000-01-01') MaxModifiedDate
FROM TestDestination (NOLOCK)

Drag and drop another Execute SQL Task to Truncate staging table. Rename the task - Execute SQL Task - Truncate staging Table and set following properties:
ResultSet = None
Connection = Destination Server
SQLStatement = Truncate Table dbo.TestDestinationStg

 
Drag and drop Data Flow Task and name it Data Flow Task - Pull New and Updated rows.
Double click on DFT or click on Data Flow tab.
Now drag and drop OLE DB Source and select Source Server in OLE DB connection manager, select SQL command in Data access mode, and write following T-SQL code in SQL command text:
SELECT [ID],[Code],[Description],
[CreatedDate],[ModifiedDate]
FROM TestSource
WHERE [ID] > ? OR [ModifiedDate] >= ?
 
Click on Columns to generate metadata and set Parameters for above query:
 

Now drag and drop OLE DB Destination task and select Destination Server in OLE DB Connection manager, Table or view - fast load in Data access mode, and dbo.TestDestinationSTG in Name of the table or view.
Now click on Mapping to map the metadat with source columns.

Drag and drop one more Execute SQL Task and rename it to Execute SQL Task - Insert and Updated new rows and set following properties:
ResultSet = None
Connection = Destination Server
SQLStatement =
-- INSERT New Records
INSERT INTO TestDestination
([ID],[Code],[Description],
[CreatedDate],[ModifiedDate])
SELECT
   [ID],[Code],[Description],
   [CreatedDate],[ModifiedDate]
FROM TestDestinationStg
WHERE ID > ?


--UPDATE modified records
UPDATE D
SET D.[ID] = S.[ID]
,D.[Code] = S.[Code]
,D.[Description] = S.[Description]
,D.[CreatedDate] = S.[CreatedDate]
,D.[ModifiedDate] = S.[ModifiedDate]
FROM TestDestination D
JOIN TestDestinationStg S
   ON S.ID = D.ID
WHERE
   S.ID <= ? AND
   S.ModifiedDate > D.ModifiedDate

Click on Parameter Mapping and set the parameters as shown below:

Finally package will look like below snapshot:
 

STEP5:
We are done with package development. Its time to test the package. I will test this package in three steps:
1. Insert few dummy records in Source table while keeping Destination table empty.
--CASE 1: First Time Execution
INSERT INTO dbo.TestSource ([Code],[Description])
VALUES ('AAA','American Automobile Association')
WAITFOR DELAY '00:00:01.100' -- delay between two rows
INSERT INTO dbo.TestSource ([Code],[Description])
VALUES ('ABC','Associated Builders & Contractors')
WAITFOR DELAY '00:00:01.150'
INSERT INTO dbo.TestSource ([Code],[Description])
VALUES ('RR','Road Runner')
GO

Now execute the package and check the destination data. You can add Data viewers in Data Flow to see the result at run time as shown below:

2. Insert few more records in Source table to check whether new records are inserted into Destination table.
--CASE 2: Only New Records
INSERT INTO dbo.TestSource ([Code],[Description])
VALUES ('TSL','Trina Solar Limited')
WAITFOR DELAY '00:00:01' -- delay between two rows
INSERT INTO dbo.TestSource ([Code],[Description])
VALUES ('FSLR','First Solar, Inc.')

Now execute the package and check for new records:

3. Insert few more records and update few existing records in Source table and check whether both table are in sync.
--CASE 3 -- New & Updated Records
INSERT INTO dbo.TestSource ([Code],[Description])
VALUES ('LDK','LDK Solar Co., Ltd')


UPDATE dbo.TestSource
SET [Description] = 'Associated Builders and Contractors',
ModifiedDate = GETDATE()
WHERE [Code] = 'ABC'

Finaly check source and destination tables for match. Since I have both Source & Destination tables in one server, I can use EXCEPT command (for this example) which retunrs no records i.e. both the tables are in sync.


Cheers!!!

5 comments:

  1. Great article with nice example... thanks a lot.

    ReplyDelete
  2. Why do we need to take the max(ID) from destination?? Max(ModifiedDate) From destination is fine ..rt? Then using this Max(ModifiedDate) , retrieve records from Source where the modifiedDate > Max(ModifiedDate) .

    Or if you are using 2008 R2 , you can use CDC feature

    ReplyDelete
  3. There are several ways to do that Manuraj :)
    I agree your solution will work for the incremental load, albeit, you would require to handle which record to insert and which to update, which is kind of similar to your another suggestion of using SCD.

    Preference of the solution should depends on the performance and then simplicity. How about you share some statistics on the performance of different approaches so that everybody get benefited out of that!

    ReplyDelete
  4. HI ,
    You did a fabulous work....well explained.......

    ReplyDelete
  5. Hi Hari

    I have some performance comparisons (and another method using hashing) at http://markgstacey.net/2013/05/11/incremental-loads-in-ssis-using-the-lookup-component-and-fnv1a-hash-in-a-synchronous-script-component/

    ReplyDelete

Note: Only a member of this blog may post a comment.