A few thoughts on AWS Batch S3 event-driven usage

AWS Batch is a great service. This is what AWS says about it: AWS Batch enables developers, scientists, and engineers to easily and efficiently run hundreds of thousands of batch computing jobs on AWS. AWS Batch dynamically provisions the optimal quantity and type of compute resources (e.g., CPU or memory optimized instances) based on the volume and specific resource requirements of the batch jobs submitted. With AWS Batch, there is no need to install and manage batch computing software or server clusters that you use to run your jobs, allowing you to focus on analyzing results and solving problems. AWS Batch plans, schedules, and executes your batch computing workloads across the full range of AWS compute services and features, such as Amazon EC2 and Spot Instances.

What I want to write about in this blogpost is how to make the AWS Batch service work for you in a real-life S3 file arrival event-driven scenario. I use this approach for decoupling the metadata of the file that arrived to spin up a Batch data-processing job where the metadata from the file arrival event define the application logic and the  validations that are processed in the Batch job and when all succeeds, then the Batch job picks up the file itself for processing.

Let’s look at the 2 possible options I ‘ve worked with so far below:


Scenario #1 : A file arrives to a s3 bucket, CloudTrail logs capture the event and raise it to CloudWatch service, and this triggers AWS Batch job as it is a valid CloudWatch target. Use this scenario in case you don’t need to involve heavy logic in the arguments you pass to your Batch job. Typically you would use just basic metadata like the s3 key, s3 “file path” etc.

*Note: Don’t forget to have your CloudTrail log files repository in another bucket then the bucket you use for the file arrival event, otherwise the CloudTrail log files can easily keep triggering the Batch job 🙂

Scenario #2: A file arrives to a s3 bucket, Lambda function has this event set as a input, and this Lambda function triggers a AWS Batch job using the standard BOTO3 API library. Use this scenario when you need more logic before triggering the Batch job. Typically you might want to split the s3 file “file path”, or use the file size etc. and add some additional conditional logic for the arguments you provide to the Batch job.

Both of these solutions have some serious downside though. Solution #1 is weak in the way, that you are not able to add more complex conditional logic for the Batch job arguments. Solution #2 is weak in the way, that AWS Lambda Function has a 5 minute timeout , but the Batch job can run much longer, and therefore you never hear back from the Batch job execution in the context of the Lambda Function. Ofcourse you can follow up watching over the Batch job in CloudWatch logs or in the AWS Batch Dashboard, but in this case, you might want to try out the AWS Step functions. They allow you to add orchestration to your Lambda functions firing the Batch jobs. You can see more about AWS Step functions running Lambdas firing Batch jobs here .


TSQL Large data loads split by a declared batch size

A couple of days back, I was asked how would I use SQL grouping functions to split huge data load into separate batches. Below is the code I came up with. The next logical step would be to load the statements into a temp table, iterate through it and execute the statements with sp_executesql. It is needed to be said, that if you have big gaps of missing IDs in the PK you are scanning, this might not be the best and most accurate solution.

USE [AdventureWorks2012];


SELECT MIN(SalesOrderID) MinID,MAX(SalesOrderID) MaxID FROM [Sales].[SalesOrderHeader];

--MIN(i.SalesOrderID) MinID,
--MAX(i.SalesOrderID) MaxID,
CAST(MIN(i.SalesOrderID) AS VARCHAR(10)) + ' AND ' +
CAST(MAX(i.SalesOrderID) AS VARCHAR(10)) + '; '
	SalesOrderID / @BATCHSIZE PartitionID
	FROM [Sales].[SalesOrderHeader] WITH (NOLOCK)
) i
GROUP BY i.PartitionID
--ORDER BY i.PartitionID;