All materials regarding my talk are now published in my github account, you can check this out here:
All materials regarding my talk are now published in my github account, you can check this out here:
And here we are , at the final part of this tutorial. This third part is going to be mainly about the Data quality issue fixing and possible reporting layers. I have been working with this framework in an environment, where we’ve had SSRS reports embedded in a .NET web application. The links in the SSRS subreports for each Data quality rule would navigate you onto the specific business objects in the application and there you would fix the errors. That’s the part where I find this solution really straightforward and powerful. If you’re in a completely different setup environment, you can possibly write your own simple web page for the error fixing module or you can come up with any other possible UI you can think of. The same with the reporting module. I have used a simple SSAS Tabular model with Excel sheets pushed onto the internal Sharepoint site, but you can really go in any possible direction you can think of.
So once you have the DQ schema tables filled with rules and their results, you can easily create an SSRS report dashboard ( dataset joining dql.ValidationResult and dql.ValidationRun tables ) looking something like this for example:
The datasets for this report are a piece of cake to come up with when you think about the data model we have from the second part of this tutorial. You can use for example:
DECLARE @CurrValidationRun_ID INT; SELECT @CurrValidationRun_ID = ISNULL(MAX(ValidationRun_ID),0) FROM dql.ValidationRun WHERE ValidationRunEnd IS NOT NULL; DECLARE @PrevValidationRun_ID INT = @CurrValidationRun_ID - 1; SELECT VR.ValidationRule_ID, VR.ValidationRuleName, (SELECT VRP.ValidationErrorCount FROM dql.ValidationResult VRP WHERE VRP.ValidationRun_ID = @PrevValidationRun_ID AND VRP.ValidationRule_ID = VR.ValidationRule_ID) AS PrevErrorCount, (SELECT VRP.ValidationErrorCount FROM dql.ValidationResult VRC WHERE VRC.ValidationRun_ID = @CurrValidationRun_ID AND VRC.ValidationRule_ID = VR.ValidationRule_ID) AS CurrentErrorCount FROM dql.ValidationRule VR;
or you can use CROSS APPLY or what ever technique you are the most comfortable with. The Data quality rule column in the table should be a link to a subreport containing the specific rule errors with links to the CRM application business objects.
When it comes to the drill down into the subreports, you need to query the dql.ValidationResultItem table but you are facing one issue here. As you’ve seen, I have used 6 ResultCol columns for the results and in the Contact with multiple invoicing addresses example you have 6 columns filled with data. But what if you had used less than 6 columns in some rules? So how can you prepare the report column layout in the specific rule results subreport? The same problem is with the SSRS matrix / table column headers. I leave this part up to your imagination.
You can go with a simplistic approach, show all 6 columns in the table report layout and have the column visibility and column header value based on some Data quality rule – specific configuration and solve this issue with an lookup expression. You could also choose a more technical approach and build dynamic SQL statements using the Data quality query metadata and prepare a dataset for a matrix style report with some pivoting / unpivoting. I would not go into more details here as everyone is working in an different environment and this is really the fun part where you can spend some time evaluating the best available options. I leave the reporting for Management up to you as well. It is really the fun part where sky is the limit and the solution is totally based on your preferences.
This lightweight Data quality framework can be considered pretty easy to build yet very powerful. You can further expand the functionality in many ways. You can even make this framework call available Datasets used for example for addresses cleaning and have some auto corrections being done. You should not forget about row-level security , so the users cannot touch each others owned business entities. But the main thought behind this is to provide the users an easy to use interface through which they can easily fix the Data quality issues that they are responsible for. Just remember to keep it simple and well performing!
The database back-end part of this simple framework will be made of these parts:
schema dql – containing all the Data quality objects mentioned below
table dql.ValidationRule – containing a list of Data quality rules with their meta information
table dql.ValidationRun – containing information about the validation runs
table dql.ValidationResult – containing aggregated information with the results needed for your reporting
table dql.ValidationRuleItem – containing the lowest level info related to the business object specific errors, ie. a list of the Contacts having more than 1 invoicing address filled in SQL_VARIANT datatype columns
stored procedure dql.ValidationRunner – responsible for the Data quality rules evaluation, maintenance of data and aggregations for reporting
stored procedure dql.ValidationRuleAPI – responsible for the Data quality rules management
SQL Agent job DataQuality – responsible for scheduled execution of the stored procedure dql.ValidationRunner
Here is a relational schema to give you a better idea:
And how do we bring the Data Quality evaluation to life? First we need to fill the Data quality rules and their meta information into the dql.ValidationRule table.
INSERT INTO dql.ValidationRule ( ValidationRuleName , ValidationRuleDescription , ValidationRuleQuery , ValidationRuleColumns , IsActive ) SELECT 'Contacts with multiple invoicing Address', 'List of all contacts with their owners having multiple invoicing addresses in the CRM', 'SELECT Contact_ID, ContactName, ContactOwner_ID, ContactOwnerName, AddressType, COUNT(*) AS [InvoicingAddressCount] FROM dbo.Contact WITH (NOLOCK) WHERE AddressType = ''Invoicing'' GROUP BY Contact_ID, ContactName, ContactOwner_ID, ContactOwnerName, AddressType HAVING COUNT(*) > 1', 'Contact_ID,ContactName,ContactOwner_ID,ContactOwnerName,AddressType,InvoicingAddressCount', 1 UNION ...
At this phase you can build your simple stored procedure API. Just prepare a stored procedure called dql.ValidationRuleAPI that can either insert, update or delete a Data quality rule based on the provided parameters.
Next you need to create the stored procedure dql.ValidationRunner which will take care of the main part. It has to create a row in the dql.ValidationRun table to mark the validation run has started. Then it needs to iterate through the table dql.ValidationRule, use the value from the ValidationRuleQuery column, build a dynamic SQL statement and fill the result table dql.ValidationResultItem with each Data quality query results. ( No need for cursors here, you can refer to this link to use much more lightweight solution using a temp table )
DECLARE @statement NVARCHAR(MAX); SET @statement = N'YOUR DATA QUALITY QUERY SELECT.. GOES HERE'; INSERT INTO dql.ValidationResultItem( ValidationRun_ID , ValidationRule_ID , ResultCol1 , ResultCol2 , ResultCol3 , ResultCol4 , ResultCol5 , ResultCol6 ) EXEC sp_executesql @statement;
**Note: In case you’ll have different result column count for each query, you need to generate the INSERT INTO… statement also in a dynamic SQL manner. You can use some simple logic based on the column count metadata for the specific rule. The result would look like EXEC sp_executesql @statement where @statement is set something like : ‘INSERT INTO dql.ValidationResultItem (…) SELECT DATA QUALITY QUERY SELECT GOES HERE’…
After each rule being finished, this SP needs also to write the number of found results for the rule into the dql.ValidationResult table. Once all the rules are finished and all the results are inserted into the result table, this SP needs to mark the end of the validation run back into the dql.ValidationRun table and do some maintenance so the result table does not grow out of control. This table could easily become really huge, being considered the job is scheduled once in a hour, evaluating in average 50 rules with about approx. 100 results per rule. At any cost avoid using DELETE statement here, use TRUNCATE. I have spilled the results for the last 2 runs into a temp db table, truncated the result item table and inserted the spilled results back. I am pretty much sure you can come up with a more elegant solution.
Here is a simple UML diagram explaining the dql.ValidationRunner stored procedure workflow.
One important note. DO make sure, all the Data quality queries have NOLOCK hints. You don’t want to stop your production CRM application with your queries. Also I strongly advise to have the framework run on a different server then your CRM database production.
This time I would like to talk about building your own simple Data quality framework using T-SQL. This framework will be implemented as a SQL Agent Job evaluating the Data quality queries against your clients CRM Database and persisting the results from the last 2 runs in a separate DB schema with some front end for Data quality issues fixing and reporting. This solution will also be generic ( not dependent on column names ) so you can combine all kinds of Data quality rules the business can think of.
This tutorial will have 3 parts. In the first part, I would like to give an high perspective overview of what we are building and how this Data quality tool can work. The second part will focus on a high level overview of DB back-end steps needed and the third part will focus on the Data quality fixing and reporting part.
Let me start with saying, that Data quality is absolutely critical when it comes to a group of people entering data into a system. I have a lot of experience from an environment, where there was an internal CRM / ERP system being developed. Issues in Data quality would really be business-stoppers back there ( and probably anywhere else ). Imagine that you would like to invoice your client, but someone from the back-office team had inserted incomplete address or you have multiple addresses assigned to the client and you don’t know which one should be used for invoicing. And trust me, managers who cannot invoice their clients are not the right people to mess with. I am pretty sure that you can come up with a plenty of issues like this.
With the solution I am about to present, you don’t have to buy an pricey 3rd party application to solve these kinds of issues. So lets go through the possible scenarios.
1.Business Analyst knows exactly what is needed to ensure the required data quality at the clients side and manages these rules written in SQL language in an API he is provided with. This API could be implemented as a simple stored procedure used for insert / delete / update based on the parameters provided.
2.CRM Manager comes up with rules needed to have the required data quality in the client application. Let’s suppose this person does not know how to write SQL code, so these rules are handed over to the Business Analyst as verbally described definitions ( i.e. 1 contact can have multiple addresses but only one address can be used for invoicing ) and the Business Analyst translates these rules into the SQL language for example like this:
SELECT Contact_ID, ContactName, ContactOwner_ID, ContactOwnerName, AddressType, COUNT(*) AS [InvoicingAddressCount] FROM dbo.Contact WITH (NOLOCK) WHERE AddressType = 'Invoicing' GROUP BY Contact_ID, ContactName, ContactOwner_ID, ContactOwnerName, AddressType HAVING COUNT(*) > 1
and passes them into the API.
3.You could also have the DB developer come up with these rules and implementing them. ( definitely not my preferred scenario )
4.Once the rules are in place, the DB job is scheduled and started and the rules are already being evaluated against the CRM Database, there will be a need for 2 front ends on the other side. One front end will be used for the Data quality issues fixing by the CRM users and the second front end will be used for reporting to Management. Seeing how well do your employees manage their business objects Data quality is an interesting KPI.
Next time, I will talk more about the DB project implementation.