This project involves cleaning and standardizing the Layoffs 2022 dataset from Kaggle. The dataset includes information about layoffs in various companies globally, and the goal was to clean the data to make it usable for further analysis.
- Source: Kaggle - Layoffs 2022 Dataset
- Description: The dataset contains fields such as
company
,industry
,total_laid_off
,percentage_laid_off
,date
,location
,stage
,country
, andfunds_raised_millions
.
The objective of this project was to perform a comprehensive data cleaning process on the layoffs dataset. The key tasks included:
- Removing duplicates
- Standardizing data
- Handling null values
- Preparing the dataset for further analysis
Created a staging table (layoffs_staging
) to work on data cleaning while keeping the raw data intact for reference.
Checked for duplicates using ROW_NUMBER()
and partitioning techniques. Duplicates were identified and removed from the staging table.
Standardized various columns, including:
- Industry: Consolidated multiple entries of the same industry (e.g., "Crypto Currency" and "CryptoCurrency" were standardized to "Crypto").
- Country: Corrected inconsistencies in country names (e.g., "United States." to "United States").
- Date: Converted string-formatted dates to the
DATE
data type.
Identified null values and handled them accordingly. For example, populated null values in the industry
column by referencing non-null values for the same company. Kept null values in key columns like total_laid_off
for further analysis.
Removed rows with no useful data (e.g., both total_laid_off
and percentage_laid_off
were null) and dropped temporary columns.
The dataset was successfully cleaned, standardized, and prepared for further exploratory data analysis (EDA) and modeling. The clean dataset can now be used for insightful analysis on layoffs trends.
- SQL: MySQL
- Tools: MySQL Workbench
- Clone this repository.
- Import the raw dataset into your MySQL environment.
- Execute the SQL scripts in the
data_cleaning.sql
file to clean the data. - Once cleaned, you can proceed with your analysis or export the cleaned data for further processing.
- Dataset sourced from Kaggle.