Tool for migrating data from MSSQL to AWS Redshift part 3 / 3

This tool is now mostly finished. I just need to finish the integration tests for Pytest testing module. Otherwise please feel free to clone, fork or contribute.

You can find it here:

https://github.com/datahappy1/mssql_to_redshift_data_transfer_tool

 

Advertisements

Tool for migrating data from MSSQL to AWS Redshift part 2 / 3

As promised, here’s an update on this project. In the MSSQL part, the T-SQL code is pretty much ready, you can check out the installation script here:

https://github.com/datahappy1/mssql_to_redshift_data_transfer_tool/tree/master/install/mssql

So how is this going to work? The Python wrapper will call the Stored Procedure [MSSQL_to_Redshift].[mngmt].[Extract_Filter_BCP] using PyMSSQL module like this:


EXEC [mngmt].[Extract_Filter_BCP]
@DatabaseName = N'AdventureWorksDW2016',
@SchemaName = N'dbo',
@TargetDirectory = N'C:\mssql_to_redshift\files',
@DryRun = 'False'

You’ll provide the Database name and the schema name, that’s the database you’re connecting to and the schema that’s containing the source tables with the data you are about to transfer into AWS Redshift.

You’ll also provide the target directory on your hard drive, that’s the location where the .csv files will be generated using bcp in the xp_cmdshell wrapper inside the Stored Procedure. This target directory will be created for you inside the Python code in the final version. The last parameter is called DryRun. When set to True, BCP extraction query is modified to return 0 rows for each file using “WHERE 1 = 0” pattern.

Once the Python coding part is ready, these Stored Procedure parameters will be internal and you’ll set these values as arguments while running the Python app.

This SP returns a Python-ready “string tuple” with the generated file names from the current run, in the case it succeeded. This tuple will be used further in the Python code to guarantee we pick up and move over to AWS Redshift only the expected set of files.

The main thing here is, that you need to fill out a table called mngmt.ControlTable. In the Github installation script, I loaded this table with the AdventureWorks DataWarehouse 2016 database columns for the demo purposes. So this table holds values like this:

ControlTable The IsActive flag determines, if the column makes it to the generated .csv file created for the corresponding table. Column_id is defining the order of the columns persisted into the .csv file.

The Stored Procedure [MSSQL_to_Redshift].[mngmt].[Extract_Filter_BCP] is writing the logs to a table called [mngmt].[ExecutionLogs] like this:

ExecutionLogs

And that’s all for now. Have a look at the installation build script, that should be pretty self-explanatory.

 

Tool for migrating data from MSSQL to AWS Redshift part 1 / 3

Today, I’d like to introduce to you my new project, a SQLServer to AWS Redshift data migration tool . There’s not much tooling for this out there on the Internet, so I hope this tool is going to be valuable for some of you. It’s going to be written in Python 3.7 and once finished, it will be published to my Github account under a MIT Licence. What I’m currently doing is going to be described here in this blog in 2 phases.

Phase #1 will be all about SQL Server coding, there I’ll need to:

  • extract and filter the data from the SQL Server tables I need to transfer to AWS Redshift
  • I’ll need to persist this data using dynamically generated BCP commands into .csv files ( these .csv files will be split based on the target Redshift tables )
  • I’ll need to store these .csv files on a local hard drive.

Untitled Diagram

Phase #2 will be about Python and AWS Boto3 libraries and wrapping this tool all together to push the data through all the way to AWS Redshift. That means:

  • Upload the .csv files from Phase #1 into a AWS S3 bucket
  • Run the copy commands to load these .csv files to AWS Redshift target tables
  • Do the cleanup of the files and write log data

Untitled Diagram2

As soon as I have this initial version out, I would like to extend this tool to be capable of running incremental data loads based on watermarks as well.