Python sudoku game generator and solver

I just finished the final version of my Sudoku game generator and solver written in Python 3+. It is available here: https://github.com/datahappy1/sudoku

One interesting fact here, in order to consider this project done, I wanted the sudoku solver to be able to solve the “worlds hardest sudoku” with ease. The “worlds hardest sudoku game” is described here.

whs

 

And here is the calculated solution my sudoku solver came up with:

8 1 2 7 5 3 6 4 9
9 4 3 6 8 2 1 7 5
6 7 5 4 9 1 2 8 3
1 5 4 2 3 7 8 9 6
3 6 9 8 4 5 7 2 1
2 8 7 1 6 9 5 3 4
5 2 1 9 7 4 3 6 8
4 3 8 5 2 6 9 1 7
7 9 6 3 1 8 4 5 2
Advertisements

Tagging AWS S3 objects in a file processing pipeline

This is just a quick tip how to keep your AWS file processing pipelines tied together in your logging and monitoring platform under one consistent trace. This is critical for investigation purposes. Also this is a real life scenario used often within AWS data processing operations pipelines. In this case, AWS Lambda A is a file generator ( a relational database data extraction tool ), Lambda B is processing additional file validation logic before this file gets send out. Boto3 calls in the Lambda functions are used to put and get the S3 object tags.

Diagram bez názvu

  1. Lambda function A generates a version 4 uuid used for the trace_id, starts logging under the trace_id and generates a csv file in a S3 bucket
  2. Lambda function A tags the csv file with a key “trace_id” and it’s value being the uuid
  3. Lambda function B gets the csv file
  4. Lambda function B reads the csv file tag with the trace_id and continues processing this file further while continuously logging under the same trace_id

 

I have one sidenote here: Step #3 and #4 could have had swapped order, depends on your use case though. Getting the tag of the object first will leave a minimal gap in the trace events, but might be more complex on the coding side of things.

AWS Glue job in a S3 event-driven scenario

I am working with PySpark under the hood of the AWS Glue service quite often recently and I spent some time trying to make such a Glue job s3-file-arrival-event-driven. I succeeded, the Glue job gets triggered on file arrival and I can guarantee that only the file that arrived gets processed, however the solution is not very straightforward. So this is the 10000 ft overview:

event_driven_glue (1) (1)

 

  1. File gets dropped to a s3 bucket “folder”, which is also set as a Glue table source in the Glue Data Catalog
  2. AWS Lambda gets triggered on this file arrival event, this lambda is doing this boto3 call besides some s3 key parsing, logging etc.
    def lambda_handler(event, context):
        ...
        ...
        ...
        # parsedjobname = .. parsed out from the "folder" name in the s3 file arrival event 
        # fullpath = .. parsed from the key in the s3 file arrival event
        try:
            glue_client.start_job_run(JobName=parsedjobname, Arguments={'--input_file_path': full_path})
            return 0
        except ClientError as e:
            logging.error("terminating - , %s", str(e))
            return 1
  3. The glue job corresponding to the “folder” name in the file arrival event gets triggered with this Job parameter set:glue_param
  4. The glue job loads into a Glue dynamic frame the content of the files from the AWS Glue data catalog like:
     datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "your_glue_db", table_name = "your_table_on_top_of_s3", transformation_ctx = "datasource0") 

    It also appends the filename to the dynamic frame, like this:

     from pyspark.sql.functions import input_file_name
    datasource1 = datasource0.toDF().withColumn("input_file_name", input_file_name()) 

    and at last, it converts the dynamic frame back to a dataframe like this:

     datasource2 = datasource0.fromDF(datasource1, glueContext, "datasource2") 
  5. In this step, we filter the dataframe to process further only the rows from the file related to the S3 file arrival event.
     datasource3 = Filter.apply(frame = datasource2, f = lambda x: x["input_file_name"] == args["input_file_path"]) 

    Let’s print out some metadata to the console for debugging purposes as well:

    print "input_file_path from AWS Lambda:" , args["input_file_path"]
    print "Filtered records count: ", datasource3.count()
    
  6. We can start to work with the filtered dataframe as we need in the Glue job now. You should consider to schedule some maintenance job or data retention policy on the file arrival bucket.

To guarantee that each file gets processed only once and never again ( that’s in case it would get dropped to the source bucket multiple times ) I would enhance the Lambda function with a logging write / lookup mechanism handling the filename ( or file content hash) in a DynamoDB logger table.

Spinning up AWS locally using Localstack

Recently I came across this github project called Localstack. It allows you to spin up a local AWS environment as a service or as a Docker container. You can utilize such a tool in your integration testing in your CI/CD pipelines while not paying a cent for the used AWS services or also for all kinds of “hacking AWS” efforts. I’m pretty sure there is many more usage scenarios. Today I’d like to show you how this awesome stack works.

For this step by step tutorial, I will work in my Ubuntu environment and utilize Pipenv,  so make sure to check that out if you haven’t already.

Now let’s get our hands dirty, let’s clone the Localstack Git repo.

git clone https://www.github.com/localstack/localstack localstack_playground

Let’s CD into the folder containing the codebase

cd localstack_playground

and now let’s install into Pipenv the localstack tool with all the dependencies (npm) and related packages (awscli-local) needed:

pipenv --three
pipenv install npm
pipenv install localstack
pipenv install awscli-local

Lets start the pipenv env shell

pipenv shell

Lets start the localstack

localstack start

Now the service is running:

locastack_started

Let’s open a new terminal window and we can start to hit the mocked up AWS services running now locally. We’ll create a s3 bucket called tutorial, list my buckets , change the access control list for this s3 bucket, upload a file we create and then remove this bucket and list all my buckets again to see that everything worked and the teardown cleanup phase successfully passed. For these s3 calls, we’ll use the awslocal cli wrapper around localstack, but you can proceed using Boto3 as well.

awslocal s3 mb s3://tutorial
awslocal s3 ls
echo Hello World! >> helloworld.txt
awslocal s3api put-bucket-acl --bucket tutorial --acl public-read
awslocal s3 cp helloworld.txt s3://tutorial

Let’s see the s3 objects in the browser using this URL ( the port of the mocked up s3 service is 4572 as you can see in the screenshot above:

try this url:  http://localhost:4572/tutorial/

locastack_s3_ls

try this url:  http://localhost:4572/tutorial/helloworld.txt

locastack_s3_obj

Now we shall remove the object, the bucket and list my buckets to see there is no bucket left

awslocal s3 rm s3://tutorial/helloworld.txt
awslocal s3 rb s3://tutorial
awslocal s3 ls

Now it’s clear you can easily work with AWS services like S3. Other services on this list work great as well, for instance let’s try to create a SNS topic and publish a message into it.

awslocal sns create-topic --name datahappy_topic
# you get back the topic arn id
awslocal sns publish --topic-arn "arn:aws:sns:us-east-1:123456789012:datahappy-topic" --message "datahappy about local mocked up sns"

Enjoy!

API connection “retry logic with a cooldown period” simulator ( Python exercise )

This is a very simple API call “circuit-breaker” style simulator I’ve written in Python. But since it’s a stateless code snippet, you should call it most likely a “retry logic with a cooldown period” simulator. But there are valid use case scenarios when stateless is the desired state type. For example when a validation of a dataset against a service can either pass or fail and throw an exception and cause the execution flow to halt. This is typical for data-flow styled apps. In such case the circuit-open state is not acceptable. Anyway the goal is to make sure, that whenever we have a connection (or timeout) error during the API call ,we retry after 10 seconds, after 20 seconds, after 30 seconds and then quit trying. The ConnectionError exception is simulated using a non-routable address 10.255.255.1.

However in the microservices world, if you want to implement the full-scale state-full circuit-breaker, please have a look at this article.

import datetime
import time
import logging
import requests

iterator = 1
attempt = 1

# while cycle to simulate connection error using non-routable IP address
while iterator < 40:
    try:
        # if iterator inside the range to simulate success
        if 1 < iterator < 8:
            r = requests.get('https://data.police.uk/api/crimes-at-location?date=2017-02&location_id=884227')
        # else iterator outside the range to simulate the error event
        else:
            r = requests.get('http://10.255.255.1')
        if r.status_code != requests.codes.ok:
            logging.error('Wrong request status code received, %s', r.status_code)
        r = r.json()
        print(r)
        attempt = 1
    except (requests.exceptions.ConnectionError, requests.exceptions.Timeout,
            requests.exceptions.ConnectTimeout, requests.exceptions.ReadTimeout) as conn_err:
        print(f'bingo, ConnectionError, now lets wait {attempt * 10} seconds before retrying', (datetime.datetime.now()))
        time.sleep(attempt * 10)
        attempt = attempt + 1
        if attempt > 3:
            logging.error('Circuit-breaker forced exit')
            raise conn_err
    iterator = iterator + 1

Also try avoiding Python time.sleep() method in AWS Lambdas as this would not be cost efficient, AWS Step Functions would be much more appropriate.

*Updated March 14th 2019: Considering the cost analysis in this article, it might be actually ok to have time.sleep() in the Lambda, depends on your use case though.

https://blog.scottlogic.com/2018/06/19/step-functions.html

Tool for migrating data from MSSQL to AWS Redshift part 2 / 3

As promised, here’s an update on this project. In the MSSQL part, the T-SQL code is pretty much ready, you can check out the installation script here:

https://github.com/datahappy1/mssql_to_redshift_data_transfer_tool/tree/master/install/mssql

So how is this going to work? The Python wrapper will call the Stored Procedure [MSSQL_to_Redshift].[mngmt].[Extract_Filter_BCP] using PyMSSQL module like this:


EXEC [mngmt].[Extract_Filter_BCP]
@DatabaseName = N'AdventureWorksDW2016',
@SchemaName = N'dbo',
@TargetDirectory = N'C:\mssql_to_redshift\files',
@DryRun = 'False'

You’ll provide the Database name and the schema name, that’s the database you’re connecting to and the schema that’s containing the source tables with the data you are about to transfer into AWS Redshift.

You’ll also provide the target directory on your hard drive, that’s the location where the .csv files will be generated using bcp in the xp_cmdshell wrapper inside the Stored Procedure. This target directory will be created for you inside the Python code in the final version. The last parameter is called DryRun. When set to True, BCP extraction query is modified to return 0 rows for each file using “WHERE 1 = 0” pattern.

Once the Python coding part is ready, these Stored Procedure parameters will be internal and you’ll set these values as arguments while running the Python app.

This SP returns a Python-ready “string tuple” with the generated file names from the current run, in the case it succeeded. This tuple will be used further in the Python code to guarantee we pick up and move over to AWS Redshift only the expected set of files.

The main thing here is, that you need to fill out a table called mngmt.ControlTable. In the Github installation script, I loaded this table with the AdventureWorks DataWarehouse 2016 database columns for the demo purposes. So this table holds values like this:

ControlTable The IsActive flag determines, if the column makes it to the generated .csv file created for the corresponding table. Column_id is defining the order of the columns persisted into the .csv file.

The Stored Procedure [MSSQL_to_Redshift].[mngmt].[Extract_Filter_BCP] is writing the logs to a table called [mngmt].[ExecutionLogs] like this:

ExecutionLogs

And that’s all for now. Have a look at the installation build script, that should be pretty self-explanatory.

 

Tool for migrating data from MSSQL to AWS Redshift part 1 / 3

Today, I’d like to introduce to you my new project, a SQLServer to AWS Redshift data migration tool . There’s not much tooling for this out there on the Internet, so I hope this tool is going to be valuable for some of you. It’s going to be written in Python 3.7 and once finished, it will be published to my Github account under a MIT Licence. What I’m currently doing is going to be described here in this blog in 2 phases.

Phase #1 will be all about SQL Server coding, there I’ll need to:

  • extract and filter the data from the SQL Server tables I need to transfer to AWS Redshift
  • I’ll need to persist this data using dynamically generated BCP commands into .csv files ( these .csv files will be split based on the target Redshift tables )
  • I’ll need to store these .csv files on a local hard drive.

Untitled Diagram

Phase #2 will be about Python and AWS Boto3 libraries and wrapping this tool all together to push the data through all the way to AWS Redshift. That means:

  • Upload the .csv files from Phase #1 into a AWS S3 bucket
  • Run the copy commands to load these .csv files to AWS Redshift target tables
  • Do the cleanup of the files and write log data

Untitled Diagram2

As soon as I have this initial version out, I would like to extend this tool to be capable of running incremental data loads based on watermarks as well.

 

A few thoughts on AWS Batch with S3 event-driven usage scenarios

AWS Batch is a great service. This is what AWS says about it: AWS Batch enables developers, scientists, and engineers to easily and efficiently run hundreds of thousands of batch computing jobs on AWS. AWS Batch dynamically provisions the optimal quantity and type of compute resources (e.g., CPU or memory optimized instances) based on the volume and specific resource requirements of the batch jobs submitted. With AWS Batch, there is no need to install and manage batch computing software or server clusters that you use to run your jobs, allowing you to focus on analyzing results and solving problems. AWS Batch plans, schedules, and executes your batch computing workloads across the full range of AWS compute services and features, such as Amazon EC2 and Spot Instances.

What I want to write about in this blogpost is how to make the AWS Batch service work for you in a real-life S3 file arrival event-driven scenario. I use this approach for decoupling the metadata of the file that arrived to spin up a Batch data-processing job where the metadata from the file arrival event define the application logic and the  validations that are processed in the Batch job and when all succeeds, then the Batch job picks up the file itself for processing.

Let’s look at the 2 possible options I ‘ve worked with so far below:

s3_event_driven_batch

Scenario #1 : A file arrives to a s3 bucket, CloudTrail logs capture the event and raise it to CloudWatch service, and this triggers AWS Batch job as it is a valid CloudWatch target. Use this scenario in case you don’t need to involve heavy logic in the arguments you pass to your Batch job. Typically you would use just basic metadata like the s3 key, s3 “file path” etc.

*Note: Don’t forget to have your CloudTrail log files repository in another bucket then the bucket you use for the file arrival event, otherwise the CloudTrail log files can easily keep triggering the Batch job 🙂

Scenario #2: A file arrives to a s3 bucket, Lambda function has this event set as an input, and this Lambda function triggers a AWS Batch job using the standard BOTO3 API library. Use this scenario when you need more logic before triggering the Batch job. Typically you might want to split the s3 file “file path”, or use the file size etc. and add some additional conditional logic for the arguments you provide to the Batch job.

Both of these solutions have some serious downside though. Solution #1 is weak in the way, that you are not able to add more complex conditional logic for the Batch job arguments. Solution #2 is weak in the way, that AWS Lambda Function has a 15 minute timeout , but the Batch job can run much longer, and therefore you never hear back from the Batch job execution in the context of the Lambda Function. So you’d have to have another Lambda function acting as a Batch job status poller. Ofcourse, you can follow up watching over the Batch job in CloudWatch logs or in the AWS Batch Dashboard, but in this case, you might want to try out the AWS Step functions. They allow you to add orchestration to your Lambda functions firing the Batch jobs. You can see more about AWS Step functions running Lambdas firing Batch jobs here .

Dummy .csv or flat .txt file generator in Python 3.7

I just finished my dummy csv or flat text file generator written in Python 3.7.

In my opinion, such project is quite unique. I use this tool for large files generation, so I can do performance testing loads in ETL data-ingestion pipelines without loading production data in Dev / Test environments or without the need to de-identify PII.

Feel free to clone, fork or contribute with new features and feedback.

The project is located here:

https://github.com/datahappy1/dummy_file_generator

 

In the future,I’d like to make it also AWS Server-less design event driven project, so stay tuned 🙂