Tool for migrating data from MSSQL to AWS Redshift part 2 / 3

As promised, here’s an update on this project. In the MSSQL part, the T-SQL code is pretty much ready, you can check out the installation script here:

https://github.com/datahappy1/mssql_to_redshift_data_transfer_tool/tree/master/install/mssql

So how is this going to work? The Python wrapper will call the Stored Procedure [MSSQL_to_Redshift].[mngmt].[Extract_Filter_BCP] using PyMSSQL module like this:


EXEC [mngmt].[Extract_Filter_BCP]
@DatabaseName = N'AdventureWorksDW2016',
@SchemaName = N'dbo',
@TargetDirectory = N'C:\mssql_to_redshift\files',
@DryRun = 'False'

You’ll provide the Database name and the schema name, that’s the database you’re connecting to and the schema that’s containing the source tables with the data you are about to transfer into AWS Redshift.

You’ll also provide the target directory on your hard drive, that’s the location where the .csv files will be generated using bcp in the xp_cmdshell wrapper inside the Stored Procedure. This target directory will be created for you inside the Python code in the final version. The last parameter is called DryRun. When set to True, BCP extraction query is modified to return 0 rows for each file using “WHERE 1 = 0” pattern.

Once the Python coding part is ready, these Stored Procedure parameters will be internal and you’ll set these values as arguments while running the Python app.

This SP returns a Python-ready “string tuple” with the generated file names from the current run, in the case it succeeded. This tuple will be used further in the Python code to guarantee we pick up and move over to AWS Redshift only the expected set of files.

The main thing here is, that you need to fill out a table called mngmt.ControlTable. In the Github installation script, I loaded this table with the AdventureWorks DataWarehouse 2016 database columns for the demo purposes. So this table holds values like this:

ControlTable The IsActive flag determines, if the column makes it to the generated .csv file created for the corresponding table. Column_id is defining the order of the columns persisted into the .csv file.

The Stored Procedure [MSSQL_to_Redshift].[mngmt].[Extract_Filter_BCP] is writing the logs to a table called [mngmt].[ExecutionLogs] like this:

ExecutionLogs

And that’s all for now. Have a look at the installation build script, that should be pretty self-explanatory.

 

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s