This is the staging table:
create table SourceWithDuplicates
(SurrogateKey int identity(1,1),
Code varchar(12),
Description varchar(20),
OtherAttribute1 int,
OtherAttribute2 Varchar(10),
TieBreakerColumn Varchar(12))
As a general rule I always add a surrogate key column to my staging tables (that makes some DBAs happy); an in this case it would actually help in the de-duplication process.
Some sample data:
Insert into SourceWithDuplicates
values('Abc', 'Row Description 1', 1, 'Other1','1')
Insert into SourceWithDuplicates
values('Abc', 'Row Description 1', 1, 'Other1', '1')
Insert into SourceWithDuplicates
values('Abc', 'Row Description 1', 1, 'Other2','2')
Insert into SourceWithDuplicates
values('Def', 'Row Description 2', 1, 'Other4','a')
Insert into SourceWithDuplicates
values('Def', 'Row Description 2', 1, 'Other5','a')
Insert into SourceWithDuplicates
values('Ghi', 'Row Description 3', 1, 'Other5','a')
Insert into SourceWithDuplicates
values('Jkl', 'Row Description 4', 1, 'Other5','a')
The requirements:
Code and description values need to be unique in the target table; in case they are duplicates, choose the row with the highest value in the tie breaker column. If the value in the tiebreaker column is also duplicated; then choose any of the rows.
So the data looks like:
SurrogateKey Code Description OtherAttribute1 OtherAttribute2 TieBreaker ------------ ------------ -------------------- --------------- --------------- ----------
1 Abc Row Description 1 1 Other1 1
2 Abc Row Description 1 1 Other1 1
3 Abc Row Description 1 1 Other2 2
4 Def Row Description 2 1 Other4 a
5 Def Row Description 2 1 Other5 a
6 Ghi Row Description 3 1 Other5 a
7 Jkl Row Description 4 1 Other5 a
Following the requirements, only 4 rows kept should be chosen; those with surrogate keys:
2 3, 4 or 5, 6 and 7
This is the query using the Rank() function that resolve our issue:
Select *,
rank() over (Partition by Code, description order by TieBreakerColumn desc, SurrogateKey) MyRank
from SourceWithDuplicates
and this is its output:
Surrogatekey Code Description MyRank
------------ ------------ -------------------- --------------------
3 Abc Row Description 1 1
1 Abc Row Description 1 2
2 Abc Row Description 1 3
4 Def Row Description 2 1
5 Def Row Description 2 2
6 Ghi Row Description 3 1
7 Jkl Row Description 4 1
Notice that if we filter the result from the previous query using WHERE MyRank=1 we will get the rows to be kept.
So, in order to fulfill the original requirement we can create a package with a source component that uses the previous query and then add a Conditional Split transformation to split the pipeline in 2 flows. Rows with MyRank=1 go to the destination table; rows where MyRank>1 go to the error table.
Notice that in SSIS you can use Sort transformation to remove duplicates but you don’t have control over the rows that get removed, not to mention the potential performance issue as this transformation is a fully blocking component.
UPDATE: I had a second though after I first made this post. In most cases it may feel more natural to use Row_Number() function instead of Rank() and that way we don't need to be worry about possible 'ties'. However RANK() should work fine too.
Rows "2, 4 or 5, 6 and 7" are expected but in result rows 3, 4, 6 and 7 are chosen
ReplyDeleteGood catch Stevo, and thanks for pointing it out. I have corrected it in the post.
ReplyDeleteHi,
ReplyDeleteIn my package i would like to check text columns are ranging in a limit say one column is 20 char and other is 30 char etc. if the particular column exceeds the limit then it should be redirected to error files.
Same way numeric / decimal columns --if it contains non numeric values then i need to redirect it to error files.
How to go about it.
Thanks,
Regards
Viji
Viji,
ReplyDeleteThat question is not really related to what is showed in this post. You can however try asking in my favorite SSIS forum: http://social.msdn.microsoft.com/forums/en-US/sqlintegrationservices/threads/
It seems rank() doesn't work in my case, I have to use row_number() instead.
ReplyDeleteIt seems rank() doesn't work in my case, I have to use row_number() instead.
ReplyDeleteThanks Rafael! Just what I was looking for and worked like a charm!
ReplyDelete