This repository aims to provide a curated list of well-known and useful datasets for practicing and learning regression modeling. Each dataset includes a brief description, its characteristics, suitable regression types, complexity, and potential applications.
- Source: World Health Organization and United Nations
- Download Link: Kaggle - Life Expectancy (WHO)
- Characteristics: Comprehensive dataset with 2938 rows and 22 columns. Includes factors affecting life expectancy such as country, year, developing status, adult mortality, infant deaths, alcohol consumption, health expenditure, immunization coverage, BMI, deaths due to HIV/AIDS, GDP, population, body condition, income, and education.
- Type of Regression: Primarily suitable for Multiple Linear Regression due to multiple independent variables influencing life expectancy. Can also be used for time-series regression if analyzed over time.
- Complexity: Medium to High. The dataset size is moderate, but the presence of 22 features, including both numerical and categorical data, adds complexity. Handling missing values and feature engineering for categorical variables would be key challenges.
- Applications: Ideal for predicting life expectancy based on various health and socio-economic indicators, and for public health policy analysis.
- Source: Commonly found on platforms like Kaggle.
- Download Link: Kaggle - Fish Market
- Characteristics: Contains detailed metrics for various fish species, including weight, length, height, and width.
- Type of Regression: Suitable for Multiple Linear Regression to predict one fish metric based on others. Can also be used for simple linear regression if focusing on two variables.
- Complexity: Low to Medium. Relatively small number of features, making it good for beginners. Potential for non-linear relationships might increase complexity.
- Applications: Excellent for multiple linear regression and multivariate analysis to understand relationships between fish dimensions and market prices, or environmental factors on fish size.
- Source: cancer.gov, clinicaltrials.gov, and American Community Survey
- Download Link: Kaggle - OLS Regression Challenge
- Characteristics: Predicts cancer mortality rates for US counties. Includes death rates, reported cases, county name, income per county, population, demographics, and more.
- Type of Regression: Primarily Multiple Linear Regression due to multiple demographic and health-related features. Can involve geographical regression if location data is used.
- Complexity: Medium to High. Involves a mix of numerical and categorical data, and potentially spatial data. Data cleaning and feature selection could be significant.
- Applications: Suitable for predicting cancer mortality rates and understanding contributing factors, and for public health research.
- Source: UCI Machine Learning Repository
- Download Link: UCI Machine Learning Repository - Wine Quality or Kaggle - Red Wine Quality
- Characteristics: Provides information about the chemical properties of different types of wine (acidity, residual sugar, pH, alcohol content) and their correlation with overall quality.
- Type of Regression: Ideal for Multiple Linear Regression to predict wine quality. Can also be used for polynomial regression if non-linear relationships are suspected.
- Complexity: Low to Medium. A relatively clean dataset with mostly numerical features, making it accessible for learning regression techniques.
- Applications: Can be used for regression modeling to predict wine quality based on chemical profiles, and for classification tasks.
- Source: CarDekho.com
- Download Link: Kaggle - CarDekho Used Car Dataset
- Characteristics: Contains information about cars and motorcycles, including model, year, selling price, showroom price, kilometres driven, fuel type, seller type, transmission, and number of previous owners.
- Type of Regression: Primarily Multiple Linear Regression for price prediction. May involve handling categorical features (fuel type, seller type, transmission) requiring one-hot encoding.
- Complexity: Medium. A good mix of numerical and categorical features. Feature engineering for 'year' and 'kilometres driven' might be necessary.
- Applications: Ideal for price prediction of vehicles, and for understanding factors influencing vehicle resale value.
- Source: cancer.gov
- Download Link: Kaggle - Cancer Mortality Rates (This is the same as OLS Regression Challenge, as it's a common dataset for this purpose)
- Characteristics: Provides a comprehensive picture of cancer-related mortality in the United States. Accompanied by detailed walkthroughs for data sourcing, preparation, exploratory study, model selection, diagnostics, and interpretation.
- Type of Regression: Primarily Multiple Linear Regression. Similar to the OLS Regression Challenge, but with potentially more detailed guidance for analysis.
- Complexity: Medium to High. Requires careful data preparation and understanding of medical and demographic factors.
- Applications: Suitable for understanding cancer mortality trends, identifying risk factors, and developing predictive models for public health policy.
- Source: A common dataset for regression analysis.
- Download Link: Kaggle - Real Estate Price Prediction
- Characteristics: Includes data on date of purchase, house age, location, distance to nearest MRT station, and house price per unit area.
- Type of Regression: Excellent for Multiple Linear Regression. Can also be used for geographical regression if specific coordinates are available.
- Complexity: Medium. Involves both numerical and potentially spatial features. Feature engineering for 'date of purchase' and 'location' could be important.
- Applications: Designed for regression analysis, linear regression, multiple regression, and building prediction models for real estate prices.
- Source: Inspired by Brett Lantz’s book Machine Learning with R
- Download Link: Kaggle - Medical Cost Personal Datasets
- Characteristics: Contains medical information and costs billed by health insurance companies (1338 rows). Includes age, gender, BMI, children, smoker, region, and insurance charges.
- Type of Regression: Primarily Multiple Linear Regression. Involves a mix of numerical and categorical features (gender, smoker, region).
- Complexity: Medium. Requires handling categorical variables and understanding their impact on insurance charges.
- Applications: Useful for predicting medical insurance costs based on individual characteristics, and for actuarial analysis.
- Source: Likely from financial data providers.
- Download Link: Kaggle - New York Stock Exchange (NYSE)
- Characteristics: Extensive historical data meticulously organized into four distinct CSV files: prices, prices-split-adjusted, securities, and fundamentals. Spans significant periods.
- Type of Regression: Primarily Time Series Regression for predicting stock prices. Can also involve multiple linear regression for fundamental analysis.
- Complexity: High. Involves large volumes of time-series data, requiring specialized techniques for handling temporal dependencies and financial market volatility. Feature engineering from financial statements is also complex.
- Applications: A rich resource for analyzing market dynamics, corporate performance, and predicting stock prices, and for financial modeling.