You can access a wealth of marketing-related data – from web analytics and customer journey behavior to competitive analysis and product usage.
However, if the data is not clean, you cannot truly leverage its value. Or worse, you could steer your marketing in the wrong direction and see diminishing returns.
James HuntPrincipal Consultant at Vivanti, says data cleaning and modeling are essential to extract value from the information and gain knowledge and wisdom. In his lecture at Marketing Analytics and Data Science Conference, He explains in detail why this is necessary, explains the basics of data cleaning, and explains the role of governance and observability.
What is data modeling?
Data models transform data into something useful, and you need to understand data modeling so you can understand the best cleansing options. James explains that data modeling includes three parts: additive, context and domain.
additive This means letting the machines figure out how to standardize the data. You don’t manually “correct” the data by, for example, lowercase the sporadic capital letters in a table. That would actually be data destruction because, as James says, “We humans are really bad at doing the same thing twice.”
context organizes the data to tell a story. They don’t add any new information; They assume the existing data. For example, the context of a sales transaction might include the marketing emails the buyer saw, the social media content the buyer interacted with, and the other products they viewed.
domain is the set of all possible data values for a given element. It can be qualitative and quantitative. James points out these five common domain types:
- identity – a unique value that uniquely and discreetly identifies a person, e.g. B. an email address, a social security number or a customer number
- nominative – a complementary identity that is not strong enough to stand alone, such as a person’s full name or a product name
- Categorically – a grouping across any boundaries, for example by customer type or industry; Often used for cohort division
- Money – the currency that can be compared, ordered, aggregated and disaggregated, e.g. B. Order total or unit price
- Temporal – a point or range of dates and times, e.g. B. Registration date, date of last purchase or loyalty period
With this basic understanding of modeling, you are ready to learn data cleaning.
What types of data cleansing are there?
James describes the three types of data cleaning – mechanical, explicit mapping, and patterns and rules:
With mechanical cleaningthe data is cleaned without changing the meaning of the information, e.g. B. normalize the case sensitivity of names and remove unnecessary spaces. “These are all things that I, as a data engineer, can do on my own and no one gets upset about,” says James. “No one says, ‘Well, you took the spaces out of his first name, so it’s a different person.’
Explicit assignment uses an activity called cardinality reduction to reduce the number of unique values associated with an attribute. It simplifies the data set by grouping values while retaining relevant information. These datasets are more manageable and can improve model performance.
For example, James says a customer status field might have started with two values - active and inactive. Over time, the field expanded to include suspended, deferred and potential options. An explicit mapping cleanup could move the customer status “suspended” to the value “active”.
A cleaning for Patterns and rules Identifies and corrects inconsistencies, inaccuracies, or errors in data based on identifiable structures (e.g., patterns) and constraints (e.g., rules).
Standard patterns include data such as email addresses, date strings, and phone numbers. Deviations from this structure indicate data that needs to be cleaned.
Rules refer to logical conditions or restrictions. For example, if the monetary data on an insurance policy exceeds its maximum value, the entry must be cleaned up.
James says you can also set rules and patterns to map the customer journey. Let’s say a brand doesn’t care how many times a person opens and clicks on their email. Instead, it’s about finding out who is receptive to purchasing an email marketing campaign. To achieve this goal, rules for cleaning the data could be set.
For example, all emails sent would be marked as “E” and all clicks would be marked as “C”, while an order would be recognized as “O”. These rules reduce data so that it is most helpful to the brand and its marketing goals.
What role does governance play in data cleansing?
“Every time you clean data, you make a decision. You decide what is relevant; You decide what is important. You decide what to keep and what to bring to the surface,” says James.
You must document these data cleansing decisions in an internal repository, such as a spreadsheet, or use a version control system such as open source Git.
Every decision should answer these four questions:
- What decision was made?
- When was it made? This point in time reference helps with historical analysis.
- Who made the decision?
- Why was this decision made? It is helpful to inform future actions. For example, if the decision was made because of a government update, it probably won’t be possible to reverse it. But if the decision was made because the data team thought it was a better way, reversing course might remain a realistic option, James says.
Let’s go back to the example where the customer status fields are collapsed so that the “Suspended” status has been grouped into “Active” customers. This is how this decision could be recorded:
“Customers with ‘suspended status’ are still considered active as of October 22, 2024. The decision was made by James Hunt because mapping analysis showed that customer behavior was best assessed by active or inactive status.”
People are crucial to the governance process, says James. Computer-generated algorithms can suggest data cleaning steps, but a human should be in the loop to review the suggestions and approve or reject them.
What is observability?
Even after you set up rules and patterns to ensure clean data, some data will conflict with these parameters. Instead of letting this data pass through or automatically cleaning it, focus on observability, which James says is ten times more important than governance.
Uncovering your data cleansing metadata might look like this example from a customer of James. The data cleansing rules set a lower limit on the policy size to catch bad data. It worked well for about six months until a policy entered the system with a limit below the limit set in the rules.
James highlighted this record and then asked the customer, “Would you like us to adjust the limit?” The customer said “yes” and the data floor rule was updated.
“We captured that through the observability loop by saying, ‘This is what we expect the data to look like.’ When we cleaned it it didn’t look like this. We were uncomfortable making this decision (without the customer’s input). And that’s what observability brings to you,” says James.
He notes that proper observability practices can save you hours, days, weeks, months, and a whole lot of embarrassment.
Are you ready for data cleansing?
Now that you’ve learned about data modeling, cleansing, governance, and observability, you can apply them to your marketing when you:
- Data sets where the integrity of the data is not pristine or perfect
- Datasets with a high number of unique values (i.e. where cardinality reduction can help in processing and analysis)
Where would you find this data? It could come from a variety of sources such as:
- CRM platforms
- Customer contact records
- Customer surveys and feedback forms
- Survey responses
- Web analytics
- Customer behavior
- Product or platform information
- Competitor analysis
Start with those that would benefit most from one or more of the three types of data cleansing, proper governance, and observability. You can then decide whether you want to bring in data teams within your organization to help.
HANDPICKED RELATED CONTENT:
Cover image by Joseph Kalinowski/Content Marketing Institute