top of page
Search

US Household Data Cleaning Project

  • Writer: Damian Owiredu
    Damian Owiredu
  • Jun 13
  • 1 min read

In this project, I walk through the process used for cleaning the raw household data.






🎯 Purpose

To clean and prepare a large dataset of U.S. household income data for analysis. The goal was to fix inconsistencies, remove duplicates, and ensure data accuracy before it could be used in any reporting or BI tools.


Lets take a look at the data

ree

🛠️ Key Steps & Code Evidence

  • Fixed malformed column name

ALTER TABLE us_project.us_household_income_statistics RENAME COLUMN `id` TO `id`;

  • Identified and removed duplicate IDs

SELECT id, COUNT(id) 
FROM us_project.us_household_income
GROUP BY id
HAVING COUNT(id) > 1;

ree

This screenshot of my code shows that the result of this query is blank. This is because the duplicates that were previously present in the data have been removed by the following query.





DELETE FROM us_household_income
WHERE row_id IN (
	SELECT row_id
	FROM (
		SELECT row_id, id,
		ROW_NUMBER() OVER(PARTITION BY id ORDER BY id) row_num
		FROM us_project.us_household_income
	) duplicates
	WHERE row_num > 1
);
  • Corrected spelling and category inconsistencies

UPDATE us_project.us_household_income
SET State_Name = "Georgia"
WHERE State_Name = "georia";

ree

This shows that the query above has corrected the spelling error and it has been applied to the table.







UPDATE us_household_income
SET Type = "Borough"
WHERE Type = "Boroughs";

💡 Impact

  • Created a clean, reliable dataset ready for EDA and dashboard use.

  • Improved trust in data-driven insights by resolving integrity issues.

 
 
 

Comentários


CONTACT ME

  • Email: damianowiredu4@gmail,com

  • Phone Number: +447482555448

  • LinkedIn

  • GitHub

bottom of page