Spinning up a Docker container with Flask and Python

Imagine you need to replicate an existing web API returning a JSON ( listing all feeds in some system ) on your local machine for further development purposes and possible extensions. Today I’ll demonstrate how to achieve this using Docker container , Python and Flask. Note that this tutorial requires some previous experience with Python and Docker. Have a look at Flask, it’s a powerful and easy to use Python web framework.

The source web API we’ll be replicating is returning a valid JSON structure listing all the feeds:

[{
  "feed_name": "feed1",
  "feed_type": "feed type 1",
  "filemasks": [
    "filemask11",
    "filemask12"
  ]
},
{
  "feed_name": "feed2",
  "feed_type": "feed type 2",
  "filemasks": [
    "filemask21",
    "filemask22"
  ]
}]

Let’s save this dummy JSON file as feeds.json on our local file system.

Next we’ll setup the environment and start with Docker:

mkdir docker-api

mkdir docker-api/app

mkdir docker-api/feeds

cd docker-api


#1) create Dockerfile:

FROM python:3.6-stretch

COPY . .

RUN pip install -r requirements.txt

WORKDIR /app/

ENTRYPOINT ["python3"]

CMD ["app.py"]


#2) create requirements.txt:

Flask==0.10.1


#3) download the feeds.json file from the website
to your local filesystem into docker-api/feeds/

 

Let’s move forward with the Python application, which is reading from the Docker image

the feeds.json file and exposing this JSON to the Flask web API. We won’t be stepping into

any actions like GET or PUT, just returning the complete JSON file listing all the feeds.

 

I prepared the Python app in the location docker-api/app/app.py and it is looking like this:

import os
from flask import Flask
from flask import Response

app = Flask(__name__)
@app.route('/api/v1/feeds')

def returner():
    os.chdir("..")
    path = os.path.abspath(os.curdir) + '/feeds/feeds.json'

    with open(path,"r") as f:
        data = f.read()
        resp = Response(response=data,
                        status=200,
                        mimetype="application/json")
        return(resp)

if __name__ == '__main__':
    app.run(debug=True, host='0.0.0.0')

The next step is spinning up the docker container ( once we build the image of course )

cd docker-api/ 

docker build -t feeds . 

docker run -d -p 5000:5000 feeds 

docker container list 

*optionally docker container kill(or stop) container_id 
incase you need to "restart" the container
Btw. docker kill vs docker stop is an interesting topic and is discussed for example here
Let’s confirm that your Python project structure is looking like this:
docker

and voila, after running docker run -d -p 5000:5000 feeds , if you lookup the webpage

localhost:5000/api/v1/feeds

in your web browser, you should be getting the response with the desired JSON listing all the feeds.

You might want to check-out also curl.

Advertisements

Fibonacci sequence ( Python exercise )

Let’s continue with the simple Python exercises I’ve been messing around lately. This is a classical question at Dev job interviews, the Fibonacci sequence code. The idea behind this is to come up with code, that sums up the previous 2 member values for a member in the sequence, simply expressed like 1,2,3,5,8,13…

Below are my personal takes on this problem.

1: The nice and performing solution

#get fibonacci
import sys

def main(arg):
    seq_len = arg
    seq_len_iterator = 2
    var1 = 1
    var2 = 2
    fibonacci = ([var1, var2])

    while seq_len_iterator < seq_len:

        var3 = var1 + var2
        fibonacci.append(var3)

        i = len(fibonacci)
        var1 = fibonacci[i-2]
        var2 = fibonacci[i-1]

        seq_len_iterator = seq_len_iterator + 1

    print(f'Fibonacci sequence for {seq_len} sequence members goes like: {fibonacci}')

if __name__ == '__main__':
    try:
        arg = int(sys.argv[1])
        main(arg)
    except:
        print(f'Invalid input, must be integer!')

Execute with the needed sequence member count argument like for instance :

python.exe C:/codility/fibonacci/__main__.py 10

 

2: The alternative “nested-iterations” solution ( Not performing over ~30 sequence members count, durations exponentially grow, however its another example of a valid solution and can be useful if you need to warm oneself during long winter cold nights somewhere outside 🙂 )

#get fibonacci
import sys


def main(arg):
    seq_len = arg
    seq_len_iterator = 2
    iterator = 1
    var1 = 1
    var2 = 2
    fibonacci = ([var1, var2])

    while seq_len_iterator < seq_len:
        if iterator == var1 + var2:
            fibonacci.append(iterator)
            var1 = var2
            var2 = iterator
            iterator = iterator + 1
            seq_len_iterator = seq_len_iterator + 1
            #print(f'Fibonacci member found in try #: {iterator}')
        else:
            iterator = iterator + 1

    print(f'Fibonacci sequence for {seq_len} sequence members goes like: {fibonacci}')


if __name__ == '__main__':
    try:
        arg = int(sys.argv[1])
        main(arg)
    except:
        print(f'Invalid input, must be integer!')

Execute with the needed sequence member count argument like for instance :

python.exe C:/codility/fibonacci/__main__.py 10

Binary gap length ( Python exercise )

Sometimes I like to mess around http://www.codility.com , doing the excercises trying to keeping my development skills fresh. This is my take on the binary gap length problem using basic Python 3. The binary gap length is an excercise where you need to come up with a code, returning the longest sequence of zeros in a 16 digit “binary” string. This question also often shows up at developer job interviews.

#get max binary zeros gap
import re
import sys


def get_binary_gap(input_seq):
    iterator = 0
    iterator_zeros = '0'

    while iterator < 16:
        if iterator_zeros in input_seq:
            stack = len(iterator_zeros)
            if stack > iterator:
                output = stack

        elif iterator_zeros not in input_seq and len(iterator_zeros) == 1:
            output = 0

        else:
            pass

        iterator_zeros = iterator_zeros + '0'
        iterator = iterator + 1
    return output


def main(arg):
    input_seq = arg

    if len(input_seq) == 16 and bool(re.match("^[0-1]{1,16}$", input_seq)):
        output = get_binary_gap(input_seq)
        print(f'The max binary gap of zeros in sequence {input_seq} is {output}')

    else:
        print(f'invalid input sequence {input_seq}')


if __name__ == '__main__':
    arg = str(sys.argv[1])
    main(arg)

Execute with the binary sequence argument like for instance :

python.exe C:/codility/binary_gap/__main__.py 0100000101010100

Materials for my presentation “Get your ETL flow under statistical process control” presented at Bulgaria Web Summit 2018

  • the presentation slides: BWS018
  • the detailed word doc: BWS2018 talk
  • DummyData you need to insert into the dbo.Logger_History table to start
  • and of course the SQL Server database objects scripted out below:
/* ==Scripting Parameters==

Source Server Version : SQL Server 2017 (14.0.1000)
 Source Database Engine Edition : Microsoft SQL Server Enterprise Edition
 Source Database Engine Type : Standalone SQL Server

Target Server Version : SQL Server 2017
 Target Database Engine Edition : Microsoft SQL Server Enterprise Edition
 Target Database Engine Type : Standalone SQL Server
*/

USE [master]
GO

CREATE DATABASE [Logger]
GO

USE [Logger]
GO
/****** Object: Table [dbo].[Logger_History] Script Date: 4/13/2018 5:32:06 PM ******/
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
CREATE TABLE [dbo].[Logger_History](
 [Logger_History_ID] [int] IDENTITY(1,1) NOT NULL,
 [Execution_UTC_Datetime] [datetime] NULL,
 [Job_Name] [nvarchar](255) NULL,
 [Job_Status] [char](1) NULL,
 [Job_Last_Modified_By] [nvarchar](255) NULL,
 [Source_File_Type] [nvarchar](255) NULL,
 [Source_File_Name] [nvarchar](255) NULL,
 [Customer_Name] [nvarchar](255) NULL,
 [Data_Feed_Type] [nvarchar](255) NULL,
 [Server_Name] [nvarchar](255) NULL,
 [Stage_Name] [nvarchar](255) NULL,
 [Stage_Row_Count] [int] NULL,
 [Processing_Start_Datetime] [datetime] NULL,
 [Processing_End_Datetime] [datetime] NULL,
 [Processing_Duration] [int] NULL,
 [Moved_To_ES] [bit] NULL,
 [IsProcessed] [bit] NULL,
PRIMARY KEY CLUSTERED 
(
 [Logger_History_ID] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]
GO
ALTER TABLE [dbo].[Logger_History] ADD DEFAULT ((0)) FOR [Moved_To_ES]
GO
ALTER TABLE [dbo].[Logger_History] ADD DEFAULT ((0)) FOR [IsProcessed]
GO
/****** Object: StoredProcedure [dbo].[gen_3sigma_event] Script Date: 4/13/2018 5:32:06 PM ******/
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO

CREATE PROC [dbo].[gen_3sigma_event] AS
BEGIN

SET NOCOUNT ON;

BEGIN TRY
BEGIN TRANSACTION

--DECLARATIONS
 DECLARE @ID int, @gen_stmt nvarchar(2000);

SELECT
 Customer_Name,Data_Feed_Type,Stage_Name,
 AVG([Stage_Row_Count]) AS AVG_Stage_Row_Count,
 SQRT(VAR([Stage_Row_Count])) AS SDEV_Stage_Row_Count,
 AVG([Processing_Duration]) AS AVG_Processing_Duration,
 SQRT(VAR([Processing_Duration])) AS SDEV_Processing_Duration
 INTO #tmp_aggregations
 FROM [dbo].[Logger_History]
 GROUP BY Customer_Name,Data_Feed_Type,Stage_Name;

--CALCULATE SCP OUTLIERS
 SELECT
 i.[Logger_History_ID],i.[Execution_UTC_Datetime],
 i.[Sigma_Status_Stage_Row_Count],i.[Sigma_Status_Processing_Duration],
 i.[Stage_Row_Count],i.[Processing_Duration],
 i.[-3Sigma_Stage_Row_Count],i.[+3Sigma_Stage_Row_Count],
 i.[-3Sigma_Processing_Duration],i.[+3Sigma_Processing_Duration],
 i.Server_Name,i.Data_Feed_Type,i.Customer_Name,i.Stage_Name,i.Job_Name,i.Job_Status,
 i.Source_File_Type,i.Source_File_Name,i.[Processing_Start_Datetime],i.[Processing_End_Datetime]
 INTO #tmp_outliers
 FROM (
 SELECT 
 --Row_Count Outlier calculation:
 CASE 
 WHEN [Stage_Row_Count] >= ta.AVG_Stage_Row_Count-ta.SDEV_Stage_Row_Count and Stage_Row_Count <= ta.AVG_Stage_Row_Count + SDEV_Stage_Row_Count
 THEN 'Fits 1 Sigma'
 WHEN [Stage_Row_Count] >= ta.AVG_Stage_Row_Count-(2*SDEV_Stage_Row_Count) and Stage_Row_Count <= ta.AVG_Stage_Row_Count + (2*SDEV_Stage_Row_Count)
 THEN 'Fits 2 Sigma'
 WHEN [Stage_Row_Count] >= ta.AVG_Stage_Row_Count-(3*SDEV_Stage_Row_Count) and Stage_Row_Count <= ta.AVG_Stage_Row_Count + (3*SDEV_Stage_Row_Count)
 THEN 'Fits 3 Sigma'
 ELSE 'SCP Outlier'
 END AS Sigma_Status_Stage_Row_Count
 --,ta.[AVG_Stage_Row_Count] as MEAN,ta.SDEV_Stage_Row_Count as SDEV,ta.AVG_Stage_Row_Count-ta.SDEV_Stage_Row_Count AS [-1Sigma],ta.AVG_Stage_Row_Count+ta.SDEV_Stage_Row_Count AS [+1Sigma],ta.AVG_Stage_Row_Count-(2*ta.SDEV_Stage_Row_Count) AS [-2Sigma],ta.AVG_Stage_Row_Count+(2*ta.SDEV_Stage_Row_Count) AS [+2Sigma]
 ,ta.AVG_Stage_Row_Count - (3*SDEV_Stage_Row_Count) AS [-3Sigma_Stage_Row_Count]
 ,ta.AVG_Stage_Row_Count + (3*SDEV_Stage_Row_Count) AS [+3Sigma_Stage_Row_Count]

--Processing_Duration Outlier calculation:
 ,CASE 
 WHEN [Processing_Duration] >= ta.AVG_Processing_Duration-ta.SDEV_Processing_Duration and Processing_Duration <= ta.AVG_Processing_Duration + SDEV_Processing_Duration
 THEN 'Fits 1 Sigma'
 WHEN [Processing_Duration] >= ta.AVG_Processing_Duration-(2*SDEV_Processing_Duration) and Processing_Duration <= ta.AVG_Processing_Duration + (2*SDEV_Processing_Duration)
 THEN 'Fits 2 Sigma'
 WHEN [Processing_Duration] >= ta.AVG_Processing_Duration-(3*SDEV_Processing_Duration) and Processing_Duration <= ta.AVG_Processing_Duration + (3*SDEV_Processing_Duration)
 THEN 'Fits 3 Sigma'
 ELSE 'SCP Outlier'
 END AS Sigma_Status_Processing_Duration
 --,ta.[AVG_Stage_Row_Count] as MEAN,ta.SDEV_Stage_Row_Count as SDEV,ta.AVG_Stage_Row_Count-ta.SDEV_Stage_Row_Count AS [-1Sigma],ta.AVG_Stage_Row_Count+ta.SDEV_Stage_Row_Count AS [+1Sigma],ta.AVG_Stage_Row_Count-(2*ta.SDEV_Stage_Row_Count) AS [-2Sigma],ta.AVG_Stage_Row_Count+(2*ta.SDEV_Stage_Row_Count) AS [+2Sigma]
 ,ta.AVG_Processing_Duration - (3*SDEV_Processing_Duration) AS [-3Sigma_Processing_Duration]
 ,ta.AVG_Processing_Duration + (3*SDEV_Processing_Duration) AS [+3Sigma_Processing_Duration]

,lm.*
 FROM [dbo].[Logger_History] lm
 INNER JOIN #tmp_aggregations ta 
 ON lm.Customer_Name = ta.Customer_Name
 AND lm.Data_Feed_Type = ta.Data_Feed_Type 
 AND lm.Stage_Name = ta.Stage_Name
 WHERE 
 IsProcessed = 0 AND
 Moved_To_ES = 0
 ) i
 WHERE i.Sigma_Status_Stage_Row_Count = 'SCP Outlier' 
 OR i.Sigma_Status_Processing_Duration = 'SCP Outlier' ;
 
 UPDATE [dbo].[Logger_History]
 SET [IsProcessed] = 1
 WHERE [IsProcessed] = 0;

--LOG SCP OUTLIERS TO WINDOWS EVENT LOG EVENTS
 WHILE EXISTS (SELECT * FROM #tmp_outliers)
 BEGIN
 
 SELECT TOP(1)
 @ID = [Logger_History_ID]
 FROM #tmp_outliers
 ORDER BY [Logger_History_ID];

SELECT @gen_stmt = 
 'EXEC xp_logevent 60000,'+
 '''MESSAGE: !This is a SCP 3 Sigma Outlier based on '+
 CASE 
 WHEN #tmp_outliers.[Sigma_Status_Stage_Row_Count] = 'SCP Outlier' 
 THEN 'Stage Row Count'
 WHEN #tmp_outliers.[Sigma_Status_Processing_Duration] = 'SCP Outlier'
 THEN 'Processing Duration'
 WHEN #tmp_outliers.[Sigma_Status_Stage_Row_Count] = 'SCP Outlier'
 AND #tmp_outliers.[Sigma_Status_Processing_Duration] = 'SCP Outlier'
 THEN 'Stage Row Count AND Processing Duration'
 END +';'+CHAR(10)+
 'Customer: '+#tmp_outliers.[Customer_Name]+' ;'+CHAR(10)+
 'Job Name: '+#tmp_outliers.[Job_Name]+' ;'+CHAR(10)+
 'Job Status: '+#tmp_outliers.[Job_Status]+' ;'+CHAR(10)+
 'Execution UTC Datetime: '+CAST(#tmp_outliers.[Execution_UTC_Datetime] AS VARCHAR(20))+' ;'+CHAR(10)+
 'Source File Type: '+#tmp_outliers.[Source_File_Type]+' ;'+CHAR(10)+
 'Source File Name: '+#tmp_outliers.[Source_File_Name]+' ;'+CHAR(10)+
 'Data Feed Type: '+#tmp_outliers.[Data_Feed_Type]+' ;'+CHAR(10)+
 'Server Name: '+#tmp_outliers.[Server_Name]+' ;'+CHAR(10)+
 'Stage Name: '+#tmp_outliers.[Stage_Name]+' ;'+CHAR(10)+
 'Stage_Row_Count: '+CAST(#tmp_outliers.[Stage_Row_Count] AS VARCHAR(10))+' ;'+CHAR(10)+
 'Processing_Duration: '+CAST(#tmp_outliers.[Processing_Duration] AS VARCHAR(10))+' ;'+CHAR(10)+
 ''', informational;'
 FROM #tmp_outliers
 WHERE @ID = [Logger_History_ID];

EXEC (@gen_stmt);
 --PRINT @gen_stmt;

UPDATE [dbo].[Logger_History]
 SET [Moved_To_ES] = 1
 WHERE @ID = [Logger_History_ID];

DELETE FROM #tmp_outliers WHERE [Logger_History_ID] = @ID;

END

COMMIT TRANSACTION
END TRY

BEGIN CATCH

IF @@TRANCOUNT > 0
 ROLLBACK TRANSACTION;
 
 DECLARE @ErrorNumber INT = ERROR_NUMBER();
 DECLARE @ErrorLine INT = ERROR_LINE();
 DECLARE @ErrorMessage NVARCHAR(4000) = ERROR_MESSAGE();
 DECLARE @ErrorSeverity INT = ERROR_SEVERITY();
 DECLARE @ErrorState INT = ERROR_STATE();
 
 PRINT 'Actual error number: ' + CAST(@ErrorNumber AS VARCHAR(10));
 PRINT 'Actual line number: ' + CAST(@ErrorLine AS VARCHAR(10));
 
 RAISERROR(@ErrorMessage, @ErrorSeverity, @ErrorState);

END CATCH

END
GO

 

TSQL code to list all columns in a DB containing only NULLs

A colleague of mine lately asked me if I know about a quick way, how to get a list of all the columns in a Database, that contain only NULLs. Well, here’s a script I came up with. I am pretty sure, there are other ( even better ) options out there, but for some reason I liked this so I decided to share the code here.

DECLARE @ID int;
DECLARE @query1 nvarchar(MAX);
DECLARE @TableName nvarchar(255);
DECLARE @ColumnName nvarchar(255);
DECLARE @SchemaName nvarchar(255);

USE AdventureWorksDW2014; --USE THE DB YOU ARE INTERESTED IN
SET NOCOUNT ON;

IF OBJECT_ID('tempdb..#tmp_counts') IS NOT NULL
DROP TABLE #tmp_counts;

IF OBJECT_ID('tempdb..#RESULTSET') IS NOT NULL
DROP TABLE #RESULTSET;

CREATE TABLE #tmp_counts (
ID int IDENTITY (1,1) PRIMARY KEY,
SchemaName nvarchar(255),
TableName nvarchar(255),
ColumnName nvarchar(255));

CREATE TABLE #RESULTSET (
SchemaName nvarchar(255),
TableName NVARCHAR(255),
ColumnName NVARCHAR(255),
[CountOfRows] int,
[CountOfRows_NULL] int);

INSERT INTO #tmp_counts
SELECT
s.name [Schema],
t.name [TableName],
c.name [ColumnName]
FROM sys.tables t
INNER JOIN sys.schemas s ON t.schema_id = s.schema_id
INNER JOIN sys.columns c ON c.object_id = t.object_id
WHERE c.Is_Nullable = 1
ORDER BY 1,2;
WHILE EXISTS (SELECT * FROM #tmp_counts)

BEGIN

SELECT
TOP(1) @ID = ID,
@SchemaName = Schemaname,
@TableName = TableName,
@ColumnName = ColumnName
FROM #tmp_counts
ORDER BY ID;

SELECT @query1 =
'INSERT INTO #RESULTSET  SELECT '''+@SchemaName+''','''+ @TableName+ ''' AS TableName,'''+@ColumnName+ ''' AS ColumnName,
(SELECT COUNT(*) AS CountOfRows FROM ['+@SchemaName+'].['+@TableName+'] WITH (NOLOCK)),
(SELECT COUNT(*) AS CountOfRows_NULL FROM ['+@SchemaName+'].['+@TableName+'] WITH (NOLOCK) WHERE ['+@ColumnName+'] IS NULL);';

EXEC (@query1);

DELETE FROM #tmp_counts WHERE ID = @ID;
END

--YOUR FINAL QUERY IS HERE:
SELECT 'Column with all rows NULL' [Check Description], *
FROM #RESULTSET
WHERE CountOfRows > 0 AND
CountOfRows = CountOfRows_NULL;

SSRS cascading report parameters using MDX queries

SSRS report parameters cascading is a regular usability requirement. In this tutorial, I will demonstrate how to proceed using MDX queries. The background to this is, that the default queries generated by the SSRS wizards are far below the standard we wish to deliver. Let’s dive in using the famous Multidimensional AdventureWorks DW OLAP Project.

Lets start by creating a Dataset for the first parameter in the cascade. Start in the Dataset Query designer as shown below.

QueryDesigner1

The MDX query I used is like this:

WITH
MEMBER [Measures].[ParCaption] AS [Product].[Category].CURRENTMEMBER.MEMBER_CAPTION
MEMBER [Measures].[ParValue] AS [Product].[Category].CURRENTMEMBER.UNIQUENAME

SELECT
{[Measures].[ParCaption], [Measures].[ParValue]} ON COLUMNS,
[Product].[Category].ALLMEMBERS ON ROWS
FROM [Adventure Works]

Next step is actually creating these report parameters, the first Parameter P_ProductCategory should be set like this:

Parameter_1

Parameter_2

The second parameter needs to be created exactly the same way once you prepare its Dataset as described in the next step.

Continue by creating another SSRS Dataset used for the second parameter P_ProductSubcategory in the cascade. This parameter value gets calculated on the fly as you pick the first parameter value.

WITH
MEMBER [Measures].[ParCaption] AS [Product].[Subcategory].CURRENTMEMBER.NAME
MEMBER [Measures].[ParValue] AS [Product].[Subcategory].CURRENTMEMBER.UNIQUENAME

SELECT
{[Measures].[ParCaption], [Measures].[ParValue]} ON COLUMNS,
[Product].[Subcategory].[Subcategory] ON ROWS
FROM [Adventure Works]
WHERE STRTOSET(@P_ProductCategory)

Notice the STRTOSET function. In case we would look for a boolean value, we could use STRTOMEMBER instead. In case we would look for multiple parameters, you would write WHERE ( STRTOSET(@P_ProductCategory), STRTOMEMBER(@ProductBooleanParameter) )

To make this work, we need to set the parameters of the second Dataset like this:

Dataset_Report

Notice you might run into an error (actually a VS bug) when writing the MDX query related to the Dataset in the query editor saying  “The query contains the XXXXXName parameter, which is not declared.” In that case, review the forum here but the solution is rather quick. Spoiler: Look for the Query Parameters icon in the top menu ( highlighted in orange box in the Query designer printscreen in the first screenshot from above) and set your parameters for the first time manually with some default value as well, that should make things work here.

Next step is creating the result dataset for the SSRS Report matrix. The query I used is trivial and is set like this:

SELECT (
[Product].[Category].[Category],
[Product].[Subcategory].[Subcategory],
[Product].[Product].[Product]) ON ROWS,
[Measures].[Order Count] ON COLUMNS
FROM (
SELECT (STRTOSET(@P_Product_Category), STRTOSET(@P_Product_Subcategory)) ON COLUMNS
FROM [Adventure Works]
)

Notice here, that in MDX you cannot use the same dimension hierarchy more then once, so you cannot use it in the SELECT and WHERE at the same time. This is the reason I decided to go for a Sub-Select, but there are many other options you can easily find on the internet. And here you go, after choosing Bikes and Clothing in the Product Category Parameter, you get only the relevant Product Subcategories, below are few screenshots of the simple SSRS Report:

Parameter_3

Report

TSQL histogram

Sometimes I like to fiddle around with TSQL. Not sure how useful this trick might be, but here is a code I came up with, that delivers histograms based on your data and a few variables you define. So as you can see, the dataset for this script is quite known Adventure Works DW dbo.factInternetSales . You can define the bucket count and the Bar chart width variables to fine tune your output. This histogram splits the dataset into the declared buckets based on the ProductKey FK column values.

SET NOCOUNT ON;

USE AdventureWorksDW2014;

DECLARE @BucketCount DECIMAL(8,3) = 10;
DECLARE @BarChartWidth INT = 100;
DECLARE @iKeyCount DECIMAL(8,3) = (SELECT COUNT(DISTINCT ProductKey) FROM FactInternetSales);
DECLARE @iBucketSize DECIMAL(8,3) = @iKeyCount / @BucketCount;

SELECT
i.*,
REPLICATE('=', @BarChartWidth * i.SumSalesAmount / (SELECT SUM(SalesAmount) FROM factInternetSales)) AS [BarChart]
FROM (
     SELECT
     ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) Bucket_ID,
     MIN(ii.ProductKey) Bucket_Range_From,
     MAX(ii.ProductKey) Bucket_Range_To,
     COUNT(ii.Bucket_ProductKeys_Count) Bucket_ProductKeys_Count,
     SUM(ii.SumSalesAmount) SumSalesAmount
     FROM (
          SELECT
          ProductKey,
          CEILING(CAST((ROW_NUMBER() OVER (ORDER BY ProductKey) )/@iBucketSize AS DECIMAL(8,3))) iBucket_ID,
          COUNT(*) Bucket_ProductKeys_Count,
          SUM(SalesAmount) SumSalesAmount
          FROM FactInternetSales
          GROUP BY ProductKey
          ) ii
      GROUP BY ii.iBucket_ID
      ) i;

And the result may look like this:

histogram1

But I would also like to see more accurate solution. So digging deeper, I came up with a code, that splits the dataset into the declared buckets based on the composite PK SalesOrderNumber, SalesOrderLineNumber. ( I also added precision to the decimal Datatype ) This code still feels quite straightforward to me, but gets a little bit more complex.

SET NOCOUNT ON;

USE AdventureWorksDW2014;

DECLARE @BarChartWidth INT = 100;
DECLARE @BucketCount DECIMAL(38,18) = 10;
DECLARE @iCount DECIMAL(38,18) = (SELECT COUNT(*) FROM FactInternetSales);
DECLARE @iBucketSize DECIMAL(38,18) = @iCount / @BucketCount;

;WITH CTE AS
(
     SELECT
     i.iKey,
     i.iID,
     i.iBucket_ID,
     i.SumSalesAmount
     FROM (
          SELECT
          TOP 100 PERCENT
          CAST([SalesOrderNumber] AS VARCHAR) + '-' + CAST([SalesOrderLineNumber] AS VARCHAR) iKey,
          ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) iID,
          CEILING(CAST((ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) ) / @iBucketSize AS DECIMAL(38,18))) iBucket_ID,
          [SalesOrderNumber],
          [SalesOrderLineNumber],
          SUM(SalesAmount) SumSalesAmount
          FROM FactInternetSales
          GROUP BY [SalesOrderNumber],[SalesOrderLineNumber]
          ORDER BY CAST(REPLACE([SalesOrderNumber],'SO','') AS INT),[SalesOrderLineNumber]
          ) i
)

SELECT
i.iBucket_ID,
i.Bucket_Range_From,
i.Bucket_Range_To,
(SELECT Cfrom.iKey FROM CTE Cfrom WHERE Cfrom.iID = i.Bucket_Range_From) Bucket_Range_SalesOrderKey_From,
(SELECT Cto.iKey FROM CTE Cto WHERE Cto.iID = i.Bucket_Range_To) Bucket_Range_SalesOrderKey_To,
i.Bucket_Row_Count,
i.Bucket_Sales_Amount,
REPLICATE('=', @BarChartWidth * i.Bucket_Sales_Amount / (SELECT SUM(SalesAmount) FROM factInternetSales)) AS [BarChart]
FROM (
     SELECT
     TOP 100 PERCENT
     ii.iBucket_ID,
     MIN(ii.iID) Bucket_Range_From,
     MAX(ii.iID) Bucket_Range_To,
     COUNT(ii.iID) Bucket_Row_Count,
     SUM(ii.SumSalesAmount) Bucket_Sales_Amount
     FROM
     CTE ii
     GROUP BY ii.iBucket_ID
     ORDER BY ii.iBucket_ID
     ) i
ORDER BY iBucket_ID;

The final result is as expected ( check the Bucket_Row_Count column ) and can look like this:

histogram2

 

Compare multiple DB schemas

Sometimes you may need to do a quick examination on how are the DBs, that are supposed to have the same schema, different. This quick query allows you to seek for the outlying and the intersecting columns across multiple databases. If you are looking only for the outlying columns and in which DBs they are, the code below is to be used. If you are looking for the intersection of multiple DBs, just switch from HAVING COUNT(0) <> to HAVING COUNT(0) = and remove the list of databases that the columns are in..

In my example, I created 3 DBs Test 1-3 with some differences between the schemas. The key determining a unique column is defined in the Compound_key column.

schema_compare

IF OBJECT_ID('tempdb..#DB_Schema_Compare') IS NOT NULL
DROP TABLE #DB_Schema_Compare;
CREATE TABLE #DB_Schema_Compare
( [DB_Name] NVARCHAR(100),
[Compound_Key] NVARCHAR(Max)
)

USE Test3;

INSERT INTO #DB_Schema_Compare
SELECT
'Test3',
--s.name SchemaName,
--t.name TableName,
--c.name ColumnName,
--c.is_computed ColumnIsComputed,
--c.is_identity ColumnIsIdentity,
--c.is_nullable ColumnIsNullable,
--c.collation_name ColumnCollationName,
--c.max_length ColumnMaxLength,
CAST(
CAST(s.name AS VARCHAR(200)) + '^' +
CAST(t.name AS VARCHAR(200)) + '^' +
CAST(c.name AS VARCHAR(200)) + '^' +
CAST(c.is_computed AS VARCHAR(1)) + '^' +
CAST(c.is_identity AS VARCHAR(1)) + '^' +
CAST(c.is_nullable AS VARCHAR(1)) + '^' +
ISNULL(CAST(c.collation_name AS VARCHAR(200)) + '^','^') +
CAST(c.max_length AS VARCHAR(10)) + '^' +
CAST(ty.name AS VARCHAR(200))
AS VARCHAR(MAX)) CompoundKey
FROM sys.columns c
inner join sys.tables t on c.object_id = t.object_id
inner join sys.schemas s on s.schema_id = t.schema_id
inner join sys.types ty on ty.system_type_id = c.system_type_id;

SELECT
i.*,
STUFF(( SELECT ',' + SUB.[DB_Name] AS [text()]
FROM
#DB_Schema_Compare SUB
WHERE
SUB.Compound_Key = i.Compound_Key
ORDER BY SUB.[DB_Name]
FOR XML PATH('')
), 1, 1, '' )
AS [DBs_we_HAVE_this_compound_key_in]
FROM (
     SELECT
     COUNT(0) CNT,
     Compound_Key
     FROM
     #DB_Schema_Compare
     GROUP BY [Compound_Key]
     HAVING COUNT(0) <> (SELECT COUNT(DISTINCT [DB_Name]) FROM #DB_Schema_Compare)
) i;

Waterfall chart in Excel 2013 and older

Excel 2016 delivers a waterfall chart type, but what about older versions? This is a neat and nice workaround I actually learned from the business folks. The yellow values are coming from a data model or any other data source except for the Measure Bridge End, which is a SUM of the cell C11 and the range D12:D16 . The rest of the values is calculated as shown in the comments. Hopefully all the steps taken are understandable from the full-size image located here.

capture

TSQL Large data loads split by a declared batch size

A couple of days back, I was asked how would I use SQL grouping functions to split huge data load into separate batches. Below is the code I came up with. The next logical step would be to load the statements into a temp table, iterate through it and execute the statements with sp_executesql. It is needed to be said, that if you have big gaps of missing IDs in the PK you are scanning, this might not be the best and most accurate solution.


USE [AdventureWorks2012];
GO

DECLARE @BATCHSIZE INT = 10000;

/*CHECK THE MIN AND MAX IDs IN THE SOURCE TABLE*/
SELECT MIN(SalesOrderID) MinID,MAX(SalesOrderID) MaxID FROM [Sales].[SalesOrderHeader];
/**/

SELECT
--i.PartitionID,
--MIN(i.SalesOrderID) MinID,
--MAX(i.SalesOrderID) MaxID,
'INSERT INTO TARGET(..) SELECT COL1,COL2,.. FROM SOURCE WITH (NOLOCK) WHERE SalesOrderID BETWEEN '+
CAST(MIN(i.SalesOrderID) AS VARCHAR(10)) + ' AND ' +
CAST(MAX(i.SalesOrderID) AS VARCHAR(10)) + '; '
AS STMT
FROM
(
	SELECT
	SalesOrderID,
	SalesOrderID / @BATCHSIZE PartitionID
	FROM [Sales].[SalesOrderHeader] WITH (NOLOCK)
) i
GROUP BY i.PartitionID
--ORDER BY i.PartitionID;