Data Cleaning
The first mile of our data transformation process
How We Deliver Clean Data
The first layer of our data transformation process focuses on data cleanliness and consistency. Through modeling techniques and automated testing, we proactively identify and rectify inconsistencies in the data we process, laying a strong foundation for efficient downstream modeling. This page outlines some of the key data cleaning methods we employ.
Schema Alignment
Schema alignment involves integrating and standardizing data structures from various sources, such as advertising platforms, into a unified format. This process ensures that data across different systems can be compared and analyzed holistically. By identifying equivalent data fields, harmonizing data formats, and resolving discrepancies, we deliver clean, consistent, and unified datasets that help increase the speed and quality of downstream modeling or BI develoment.
Data Type Consistency
Maintaining consistency in data types, especially for key identifiers like IDs or UUIDs, is crucial for data integrity. We enforce data type consistency for similar fields across source systems to avoid errors during data processing or downstream analysis.
Negative Value Conversion
We convert numerical values to negative to maintain logical consistency when values represent deductions or subtractions in calculations. This ensures that the computational logic reflects the actual financial or quantitative operations accurately.
Implementing Automated Rules
We run thousands of automated tests several times per day to ensure the data we deliver meets predefined standards, such as validating the range of acceptable values or ensuring that entries in a unique identifier field are distinct.
Standardizing Dates and Times
Date and time standardization ensures that all temporal data across datasets adhere to a consistent format, facilitating easier analysis and comparison of time-based information. While most source systems standardize around UTC, we’ve encountered several data sources that only report in ET, creating unexpected data quality issues.
Lowercasing Inconsistent Values
For textual data, especially computer-generated values, standardizing capitalization conventions by converting all entries to lowercase can reduce complexity and improve data uniformity. This method is particularly useful for data fields that are prone to inconsistent capitalization, ensuring that comparisons and searches are not case-sensitive and more reliable.
Was this page helpful?