Tagging AWS S3 objects in a file processing pipeline

This is just a quick tip how to keep your AWS file processing pipelines tied together in your logging and monitoring platform under one consistent trace. This is critical for investigation purposes. Also this is a real life scenario used often within AWS data processing operations pipelines. In this case, AWS Lambda A is a file generator ( a relational database data extraction tool ), Lambda B is processing additional file validation logic before this file gets send out. Boto3 calls in the Lambda functions are used to put and get the S3 object tags.

Diagram bez názvu

  1. Lambda function A generates a version 4 uuid used for the trace_id, starts logging under the trace_id and generates a csv file in a S3 bucket
  2. Lambda function A tags the csv file with a key “trace_id” and it’s value being the uuid
  3. Lambda function B gets the csv file
  4. Lambda function B reads the csv file tag with the trace_id and continues processing this file further while continuously logging under the same trace_id


I have one sidenote here: Step #3 and #4 could have had swapped order, depends on your use case though. Getting the tag of the object first will leave a minimal gap in the trace events, but might be more complex on the coding side of things.