Fuzzy is an adjective and it means “having shapes that do not have clear edges”.
Some time ago, I have been asked by my boss to lookup our company clients in the Panama Papers database extract, recently published on the internet. The challenge seemed pretty clear. Hundreds of thousands of company clients from the region located in CRM MSSQL database table on one side and about 1000+ clients on the Panama Papers side in a .xls file. Problem was, that there were quite many company type abbreviations i.e. Company XXX AG, Company XXX GMBH and you had only a very limited time to come up with some results.
The first thing that came across my mind was a SSIS fuzzy lookup data flow task, but this was a one-off and a in-a-hurry task, so there was no time for setting up a full size SSIS fuzzy matching project.
The second thought I had pointed me to this Excel add-in I am going to write about today.
Here is the link to the Microsoft download site where you can download this add-in.
I am pretty sure, that this is no rocket science and you will find using this tool pretty straightforward, but let me describe my personal experience.
So first, you need to install the setup.exe file and run Excel again and then you need to go into File -> Options -> Add-Ins -> Manage Excel Add-Ins and enable this Add-in. Once you have this Add-in in the tabs section, you are ready for some fuzzy matching magic.
So at this phase you need to copy these 2 datasets into two separate sheets and you also need to create table objects on the datasets. ( Insert -> Table 🙂 )
Next step is to choose the columns to be matched and the number of matched results per 1 row from the left table.
What I suggest is, to start with a small dataset and try to find the perfect Similarity Threshold between this small dataset and the set containing the results we are looking for, in this case the Panama Papers dataset. Once you are happy with the results, you can apply this setup on the whole dataset of your clients. As you will notice, when the matches are being processed, your Excel might freeze or stop responding. No worries, just let it run and keep on waiting for the results!
In my case, threshold 85 was the perfect choice. I got also some false-positive results, but at least I was sure I am not missing out some clients and deleting these results is quite quick anyway.
Please notice, that you have to setup your Windows Locale settings like language and formats and area to US, otherwise this add-in will not deliver any results and Excel will very likely crash.