What exactly is dirty data?

Well, the truth is that it can mean different things to different people, working with different types of data, but at its most basic level, dirty data is anything incorrect.

But what does that look like in the real world I hear you cry? Well, in the next 1,000 words or so, I am going to try and answer that question in some detail…

Misspelt Names

This happens more than you think. If it’s supplier names, it could be a simple switch of letters from ABC Printing to ABC Printign, a missing letter such as T Shoemit instead of T Shoesmith, or something much more subtle like AT Jones, instead of TA Jones which may not be easily picked up.

If you’re dealing with personal information, it’s doubly important to get the name right because of data protection regulations such as GDPR. Very recently I received a piece of mail for my new limited company ‘The Classification Guru Ltd’, the address was correct, and my first and middle names were correct, but I had someone else’s surname and a business name that wasn’t mine.

When I checked on Companies House, I could see that the surname and the business name were related to one person – everything else was my information. What I suspect happened in this instance is one of several things: firstly that the list of names for mailings was in Excel, and someone possibly hadn’t filtered all columns and the information was mixed up. Secondly, it could have been that some lines of data were removed, which caused some of the information in certain columns to shift up or down and misalign.

Or, it could have been something as simple as a cut and paste error that caused the problem. This could have easily been rectified by applying some spot checks to the data before it was used as a mailing list. I’ll cover this further later.

Incorrect or Misleading Descriptions

In the course of the work I do, I see this a lot in invoice or PO descriptions. It could be something as simple as “services” in the description, and the person’s name as the supplier. Well, who are they? The copywriter, the lawyer, another consultant of some sort? It can be very tricky to find this out and so, more often than not, this will end up being classified under ‘Professional Services’. But what if it’s actually plumbing or electrical services, and should sit under ‘Facilities’? It might be a small value, but what if it’s not and is a large amount of spend that is not being accounted for correctly?

Misleading descriptions in spend data can happen easily if the data is not viewed in context. For example, if you only look at the information in the invoice or PO line description column, but not the supplier name, this can lead to misclassification. You might have cleaning as a description, but the supplier is Dell or IBM. This completely changes the context of the information from janitorial services to data or computer or data cleaning services.

Missing or Incorrect Codes

This can be a real issue in the manufacturing and supply chain industries. There are several reasons why a product code might be missing. If it’s an older product then historically it might never have been assigned a code. Or perhaps the code wasn’t available when the product was set up, but no one followed up to add it in once the code had been created.

And then there’s the “can’t be bothered” aspect. We don’t like to think about it, but some people just can’t be bothered to find out the information they need, and if they’re not being monitored and know they can get away with it, they’ll continue to set up products with missing information, and it could be wider than just the product code, it could be dimensions and weights, which are critical to many different areas of the business.

And just as harmful to the business is incorrect codes, it could be that the code has been mistyped with some numbers mixed up, or perhaps one number or letter missing, or it could be something more subtle like a zero being replaced with the letter O. This can all result in duplicate records for these products, the wrong items being ordered, shipped, manufactured or number inaccurately reported in inventory, etc. costing the business unnecessary expenditure.

No Standard Formats for Addresses

I see this A LOT in both supplier and personal data. There are multiple ways that an address can be recorded, sometimes it’s all in one cell, sometimes split over a number of columns, and then I’ve seen cities in the county or state column, or the postal or zip code in the city or county column.

It’s a mess and is there to some degree in nearly every data set.

Then you have abbreviations. Terrace could be Terr, Place – Plc, Road – Rd, Street – St, etc… and this could lead to near-duplicates, multiple records, and information split between these multiple records, leading to incorrect information and reporting being used in the business.

No Standard Units of Measure

This can cause a lot of issues, especially if you are trying to analyse or report on a specific product.

It’s even the little things like whether you decide to have a space between the number and the unit of measure that can cause near duplications, it’s much better to be clear and specific with your team to avoid multiple versions of the same items.

Currency Issues

This has certainly caused me issues when trying to report values to match, if you are not aware that the values you are working with are in multiple currencies, then you could spend hours trying to get figures to match up, trust me I’ve been there! In particular, when working with something like Swedish Krona versus GBP or USD, the values are significantly higher, so it could end up looking like you’ve spent £500 on a taxi…

Incorrect/Partially Classified Spend Data

For me, this is worse than not having classified data at all. When I have clients that come to me for help, if they already have classified data, I immediately disregard everything and start again from scratch.

Firstly, they wouldn’t be using my services if there wasn’t an issue with their classified data, and secondly in terms of time and being efficient, I find it’s far easier to start again with a clean slate, and I can apply my own standards which will be consistent and accurate.

Duplicates

Aaah, the dreaded duplicates.

These can appear in many forms, from duplicate invoices to customer/supplier records, to orders, to products, and much much more. They cause multiple records which could mean the information is split between the two, resulting in you only seeing part of the picture. And then there are the near duplicates, don’t even get me started on those. In business this could be PWC, P.W.C, or with personal information, this could be Robert Smith and Bob Smith.

The consequences of dirty data

So, why does it matter your data has some of these issues? Who’s going to notice? Well, like any problem, it’s manageable when it’s small, but gone unnoticed or left to fester, it can become a really big issue.

What if your car started making a rattling noise? Then the check engine light came on… you wouldn’t leave that to deteriorate, would you? You definitely wouldn’t decide to go on a long road trip with it and risk being stranded in the middle of nowhere… and you shouldn’t be making big business decisions based on unclassified, poor quality data.

And, I hate to break it to you, but there’s no quick fix, magic button, or special software out there that can magically fix your data. It requires hard work, patience, and a commitment to better data accuracy.

Yes, there are some software solutions out there that can help you manage and clean your data, but you will need a person to verify this, and this only works on certain data, such as address and name verification and formatting, this is not going to help with the classification of spend data or normalisation.

Want to know more about dirty data and how you can fix it for yourself? Susan’s book ‘Between the Spreadsheets: Classifying and Fixing Dirty Data’ is out now.

You can also get in touch with Susan with any enquires at susan@theclassificationguru.com.

Joins us at the housing Data and Analytics Summit where Susan will discuss the Dangers of Dirty Data.

Back to basics: what exactly is dirty data?