US Household Data Cleaning Project
- Damian Owiredu
- Jun 13
- 1 min read
In this project, I walk through the process used for cleaning the raw household data.
Link to code: https://github.com/gl1tchdDev/SQL-Portfolio/blob/main/US%20Household%20Income%20(Data%20Cleaning).sql
🎯 Purpose
To clean and prepare a large dataset of U.S. household income data for analysis. The goal was to fix inconsistencies, remove duplicates, and ensure data accuracy before it could be used in any reporting or BI tools.
Lets take a look at the data

🛠️ Key Steps & Code Evidence
✅ Fixed malformed column name
ALTER TABLE us_project.us_household_income_statistics RENAME COLUMN `id` TO `id`;
✅ Identified and removed duplicate IDs
SELECT id, COUNT(id)
FROM us_project.us_household_income
GROUP BY id
HAVING COUNT(id) > 1;

This screenshot of my code shows that the result of this query is blank. This is because the duplicates that were previously present in the data have been removed by the following query.
DELETE FROM us_household_income
WHERE row_id IN (
SELECT row_id
FROM (
SELECT row_id, id,
ROW_NUMBER() OVER(PARTITION BY id ORDER BY id) row_num
FROM us_project.us_household_income
) duplicates
WHERE row_num > 1
);
✅ Corrected spelling and category inconsistencies
UPDATE us_project.us_household_income
SET State_Name = "Georgia"
WHERE State_Name = "georia";

This shows that the query above has corrected the spelling error and it has been applied to the table.
UPDATE us_household_income
SET Type = "Borough"
WHERE Type = "Boroughs";
💡 Impact
Created a clean, reliable dataset ready for EDA and dashboard use.
Improved trust in data-driven insights by resolving integrity issues.
Comentários