Clean and preprocess the house price dataset from an Excel file, preparing it for further analysis or modeling by removing invalid, missing, or non-numeric data.
- Input:
HousePricePrediction.xlsx
(Excel file, Sheet1) - Output:
Cleaned_HousePricePrediction.xlsx
(cleaned and saved Excel file)
-
Initial Exploration Display dataset info and summary statistics.
-
Filter Invalid Records Remove rows with zero or negative values in critical columns:
LotArea
YearBuilt
YearRemodAdd
TotalBsmtSF
SalePrice
-
Handle Missing Data Drop rows containing any NaN values.
-
Keep Numeric Columns Only Remove all non-numeric columns to ensure clean data for modeling.
-
Save Cleaned Dataset Export the processed data to
Cleaned_HousePricePrediction.xlsx
.
-
Place your input Excel file at the correct path (
file_path
). -
Install dependencies:
pip install pandas openpyxl
-
Run the cleaning script:
python house_price_cleaning.py
- A clean, preprocessed Excel file:
Cleaned_HousePricePrediction.xlsx
. - Console report showing the count of records remaining after cleaning.
- Automates tedious data cleaning steps.
- Ensures dataset integrity by removing invalid or missing data.
- Prepares data perfectly for machine learning models or further analysis.