Today, I’d like to introduce to you my new project, a SQLServer to AWS Redshift data migration tool . There’s not much tooling for this out there on the Internet, so I hope this tool is going to be valuable for some of you. It’s going to be written in Python 3.7 and once finished, it will be published to my Github account under a MIT Licence. What I’m currently doing is going to be described here in this blog in 2 phases.
Phase #1 will be all about SQL Server coding, there I’ll need to:
- extract and filter the data from the SQL Server tables I need to transfer to AWS Redshift
- I’ll need to persist this data using dynamically generated BCP commands into .csv files ( these .csv files will be split based on the target Redshift tables )
- I’ll need to store these .csv files on a local hard drive.
Phase #2 will be about Python and AWS Boto3 libraries and wrapping this tool all together to push the data through all the way to AWS Redshift. That means:
- Upload the .csv files from Phase #1 into a AWS S3 bucket
- Run the copy commands to load these .csv files to AWS Redshift target tables
- Do the cleanup of the files and write log data
As soon as I have this initial version out, I would like to extend this tool to be capable of running incremental data loads based on watermarks as well.
Sometimes you might need in your ETL project to move data between stages with explicit column name definition in the queries. I typically end up using the following code snippet I’ve written for straight moves between stages. You could do this without dynamic SQL as well, but I prefer this approach as I usually use this snippet in a iteration to reload all tables based on evaluating some configuration logic inside a config table. The dynamic SQL also helps to append to the columns list stage-specific values, replacing values,dynamically evaluated lookups etc. As you can guess from my previous posts, the next step is creating the INSERT INTO (@COLUMN_LIST) SELECT @COLUMN_LIST statements and kicking this off with a declared batch size as a variable. Simple stuff but having explicitly defined column names is often a must. Next time I’d like to blog on AWS Batch and event-driven data processing so stay tuned.
SET ANSI_NULLS ON;
SET QUOTED_IDENTIFIER ON;
DECLARE @STMT_MOVE_PRE_STG_COLS nvarchar(max);
DECLARE @COLUMN_LIST nvarchar(max);
DECLARE @schema_name nvarchar(255);
DECLARE @table_name nvarchar(255);
SET @schema_name = 'Production'
SET @table_name = 'Product'
SET @STMT_MOVE_PRE_STG_COLS = 'SELECT @COLUMN_LIST_IN =
(SELECT '','' + C.NAME
FROM SYS.COLUMNS C
INNER JOIN SYS.TABLES T ON C.Object_ID = T.Object_ID
INNER JOIN SYS.SCHEMAS S ON S.Schema_ID = T.Schema_ID
WHERE S.NAME = '''+ @schema_name +''' AND T.NAME = '''+ @table_name + '''
ORDER BY C.NAME
FOR XML PATH ('''')), 1, 1, '''')';
EXECUTE SP_EXECUTESQL @STMT_MOVE_PRE_STG_COLS, N'@COLUMN_LIST_IN NVARCHAR(MAX) OUT', @COLUMN_LIST out;
All materials regarding my talk are now published in my github account, you can check this out here:
A couple of days back, I was asked how would I use SQL grouping functions to split huge data load into separate batches. Below is the code I came up with. The next logical step would be to load the statements into a temp table, iterate through it and execute the statements with sp_executesql. It is needed to be said, that if you have big gaps of missing IDs in the PK you are scanning, this might not be the best and most accurate solution.
DECLARE @BATCHSIZE INT = 10000;
/*CHECK THE MIN AND MAX IDs IN THE SOURCE TABLE*/
SELECT MIN(SalesOrderID) MinID,MAX(SalesOrderID) MaxID FROM [Sales].[SalesOrderHeader];
'INSERT INTO TARGET(..) SELECT COL1,COL2,.. FROM SOURCE WITH (NOLOCK) WHERE SalesOrderID BETWEEN '+
CAST(MIN(i.SalesOrderID) AS VARCHAR(10)) + ' AND ' +
CAST(MAX(i.SalesOrderID) AS VARCHAR(10)) + '; '
SalesOrderID / @BATCHSIZE PartitionID
FROM [Sales].[SalesOrderHeader] WITH (NOLOCK)
GROUP BY i.PartitionID
--ORDER BY i.PartitionID;