Famous and Good Datasets for Regression Analysis

This repository aims to provide a curated list of well-known and useful datasets for practicing and learning regression modeling. Each dataset includes a brief description, its characteristics, suitable regression types, complexity, and potential applications.

Datasets:

1. WHO Statistics on Life Expectancy

Source: World Health Organization and United Nations
Download Link: Kaggle - Life Expectancy (WHO)
Characteristics: Comprehensive dataset with 2938 rows and 22 columns. Includes factors affecting life expectancy such as country, year, developing status, adult mortality, infant deaths, alcohol consumption, health expenditure, immunization coverage, BMI, deaths due to HIV/AIDS, GDP, population, body condition, income, and education.
Type of Regression: Primarily suitable for Multiple Linear Regression due to multiple independent variables influencing life expectancy. Can also be used for time-series regression if analyzed over time.
Complexity: Medium to High. The dataset size is moderate, but the presence of 22 features, including both numerical and categorical data, adds complexity. Handling missing values and feature engineering for categorical variables would be key challenges.
Applications: Ideal for predicting life expectancy based on various health and socio-economic indicators, and for public health policy analysis.

2. Fish Market Dataset

Source: Commonly found on platforms like Kaggle.
Download Link: Kaggle - Fish Market
Characteristics: Contains detailed metrics for various fish species, including weight, length, height, and width.
Type of Regression: Suitable for Multiple Linear Regression to predict one fish metric based on others. Can also be used for simple linear regression if focusing on two variables.
Complexity: Low to Medium. Relatively small number of features, making it good for beginners. Potential for non-linear relationships might increase complexity.
Applications: Excellent for multiple linear regression and multivariate analysis to understand relationships between fish dimensions and market prices, or environmental factors on fish size.

3. OLS Regression Challenge (Cancer Mortality Rates)

Source: cancer.gov, clinicaltrials.gov, and American Community Survey
Download Link: Kaggle - OLS Regression Challenge
Characteristics: Predicts cancer mortality rates for US counties. Includes death rates, reported cases, county name, income per county, population, demographics, and more.
Type of Regression: Primarily Multiple Linear Regression due to multiple demographic and health-related features. Can involve geographical regression if location data is used.
Complexity: Medium to High. Involves a mix of numerical and categorical data, and potentially spatial data. Data cleaning and feature selection could be significant.
Applications: Suitable for predicting cancer mortality rates and understanding contributing factors, and for public health research.

4. Red Wine Quality Dataset

Source: UCI Machine Learning Repository
Download Link: UCI Machine Learning Repository - Wine Quality or Kaggle - Red Wine Quality
Characteristics: Provides information about the chemical properties of different types of wine (acidity, residual sugar, pH, alcohol content) and their correlation with overall quality.
Type of Regression: Ideal for Multiple Linear Regression to predict wine quality. Can also be used for polynomial regression if non-linear relationships are suspected.
Complexity: Low to Medium. A relatively clean dataset with mostly numerical features, making it accessible for learning regression techniques.
Applications: Can be used for regression modeling to predict wine quality based on chemical profiles, and for classification tasks.

5. Vehicle Dataset from CarDekho

Source: CarDekho.com
Download Link: Kaggle - CarDekho Used Car Dataset
Characteristics: Contains information about cars and motorcycles, including model, year, selling price, showroom price, kilometres driven, fuel type, seller type, transmission, and number of previous owners.
Type of Regression: Primarily Multiple Linear Regression for price prediction. May involve handling categorical features (fuel type, seller type, transmission) requiring one-hot encoding.
Complexity: Medium. A good mix of numerical and categorical features. Feature engineering for 'year' and 'kilometres driven' might be necessary.
Applications: Ideal for price prediction of vehicles, and for understanding factors influencing vehicle resale value.

6. Cancer Linear Regression Dataset

Source: cancer.gov
Download Link: Kaggle - Cancer Mortality Rates (This is the same as OLS Regression Challenge, as it's a common dataset for this purpose)
Characteristics: Provides a comprehensive picture of cancer-related mortality in the United States. Accompanied by detailed walkthroughs for data sourcing, preparation, exploratory study, model selection, diagnostics, and interpretation.
Type of Regression: Primarily Multiple Linear Regression. Similar to the OLS Regression Challenge, but with potentially more detailed guidance for analysis.
Complexity: Medium to High. Requires careful data preparation and understanding of medical and demographic factors.
Applications: Suitable for understanding cancer mortality trends, identifying risk factors, and developing predictive models for public health policy.

7. Real Estate Price Prediction Dataset

Source: A common dataset for regression analysis.
Download Link: Kaggle - Real Estate Price Prediction
Characteristics: Includes data on date of purchase, house age, location, distance to nearest MRT station, and house price per unit area.
Type of Regression: Excellent for Multiple Linear Regression. Can also be used for geographical regression if specific coordinates are available.
Complexity: Medium. Involves both numerical and potentially spatial features. Feature engineering for 'date of purchase' and 'location' could be important.
Applications: Designed for regression analysis, linear regression, multiple regression, and building prediction models for real estate prices.

8. Medical Insurance Costs Dataset

Source: Inspired by Brett Lantz’s book Machine Learning with R
Download Link: Kaggle - Medical Cost Personal Datasets
Characteristics: Contains medical information and costs billed by health insurance companies (1338 rows). Includes age, gender, BMI, children, smoker, region, and insurance charges.
Type of Regression: Primarily Multiple Linear Regression. Involves a mix of numerical and categorical features (gender, smoker, region).
Complexity: Medium. Requires handling categorical variables and understanding their impact on insurance charges.
Applications: Useful for predicting medical insurance costs based on individual characteristics, and for actuarial analysis.

9. New York Stock Exchange Dataset

Source: Likely from financial data providers.
Download Link: Kaggle - New York Stock Exchange (NYSE)
Characteristics: Extensive historical data meticulously organized into four distinct CSV files: prices, prices-split-adjusted, securities, and fundamentals. Spans significant periods.
Type of Regression: Primarily Time Series Regression for predicting stock prices. Can also involve multiple linear regression for fundamental analysis.
Complexity: High. Involves large volumes of time-series data, requiring specialized techniques for handling temporal dependencies and financial market volatility. Feature engineering from financial statements is also complex.
Applications: A rich resource for analyzing market dynamics, corporate performance, and predicting stock prices, and for financial modeling.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Famous and Good Datasets for Regression Analysis

Datasets:

1. WHO Statistics on Life Expectancy

2. Fish Market Dataset

3. OLS Regression Challenge (Cancer Mortality Rates)

4. Red Wine Quality Dataset

5. Vehicle Dataset from CarDekho

6. Cancer Linear Regression Dataset

7. Real Estate Price Prediction Dataset

8. Medical Insurance Costs Dataset

9. New York Stock Exchange Dataset

About

Uh oh!

Releases

Packages

sarfraspc/Famous-Regression-Datasets

Folders and files

Latest commit

History

Repository files navigation

Famous and Good Datasets for Regression Analysis

Datasets:

1. WHO Statistics on Life Expectancy

2. Fish Market Dataset

3. OLS Regression Challenge (Cancer Mortality Rates)

4. Red Wine Quality Dataset

5. Vehicle Dataset from CarDekho

6. Cancer Linear Regression Dataset

7. Real Estate Price Prediction Dataset

8. Medical Insurance Costs Dataset

9. New York Stock Exchange Dataset

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Packages