What if my data isn't clean?

No company has the data quality they want. None. But that doesn't mean you can't do anything with what you have

We begin by recognizing that in the middle of the bad data there is good-enough data and that some bad data is more easily fixed than other

What we do is build something called a "Domain data product", which is a fancy way of saying: we build a table (or a few) that holds all the trustworthy data so that it can be used while more data is made trustworthy

What is a Domain data product? Why build one?

Traditionally companies have normalized tables, which means that if many tables repeat the same information (the address of a client, for example) you eliminate this repetition and only store it on a dictionary that can be used to join with other information

Having normalized information makes it easier to manage, cheaper to store and process… but also makes anyone having to use that information to need to know which table needs to be joined with which table, using which filters and taking care of the details

A data domain product is a single (or a few) denormalized table to eliminate joins, improve query speed, and simplify analysis. Optimized for modern cloud warehouses like BigQuery, it trades storage redundancy for faster performance and easier access, making it ideal for analytics and AI-driven insights

But simplicity and ease of access is just one part. Having an intermediate table allows companies to simplify separation between credible and untrustworthy data, and to create data cleaning processes that may be imperfect but useful

For example, when working with customer information, we humorously say that instead of the never materializing 360° view of the customer, we actually build the 270° view of the customer. Not perfect, but massively value adding

This approach to quality is based on Google's One big table methodology, usually known in spanish as "Tabla Ancha" or "Tablón"

How are Domain data products built?

The process to create domain data products includes 5 steps and involves the Business, BI or Planning team, Data Science and IT

Why not just wait for the quality of data to improve? Why do this "patch"?

The traditional approach to data quality, now known as "left to right", is a methodology that addresses quality issues at each stage of the data pipeline, starting from the source (left) and moving toward consumption (right). This progressive strategy focuses on fixing problems as early as possible in the data lifecycle rather than attempting to clean data at the end

Traditional approach is tidy and future-proof, but almost always ends up with roadblocks, delays and changes in priority which commonly result in low stakeholder satisfaction and reduced budgets. In short… seldom actually works

Right to left (R2L) is based on starting with the use case, but its biggest difference is that:
a. it limits the data to be governed and improved identifying those absolutely necessary only for that specific use case
b. It prioritizes data that is good enough or that can be easily improved

R2L has proven to be effective in creating a positive reinforcement wheel, increasing engagement and further investments in data quality