Skip to content

Python codes used for analyzing large datasets for machine learning, statistics, data cleaning and structuring, visualizations, etc.

License

Notifications You must be signed in to change notification settings

joseiky/Data-analytics

Repository files navigation

Data-analytics

Python codes for machine learning, statistics, bioinformatics, and epidemiology


Overview

This repository contains Python scripts and notebooks for analyzing large-scale datasets, with a focus on applications in:

  • Machine learning
  • Statistics
  • Data cleaning and structuring
  • Visualization
  • Bioinformatics
  • Infectious diseases and epidemiology

The goal is to provide a resource for researchers, data scientists, and students working in computational biology, genomics, molecular diagnostics, and related fields.


Features

  • Machine Learning & Modeling: Scripts for supervised and unsupervised learning, feature engineering, model evaluation, and explainability (including SHAP, feature importance, and more).

  • Bioinformatics: Tools for processing sequencing data, variant analysis, resistance gene identification, and genomics analytics.

  • Statistics & Data Cleaning: Utilities for data wrangling, normalization, deduplication, and advanced statistical analysis.

  • Epidemiological Analysis: Code for cohort analysis, prevalence and incidence calculations, interaction and association testing, and public health surveillance analytics.

  • Visualization: Publication-ready plots and multi-panel figures using Matplotlib, Seaborn, and other libraries.


Use Cases

  • Infectious disease diagnostics and surveillance
  • Antimicrobial resistance analytics
  • Genomic and molecular biology data analysis
  • Epidemiological modeling
  • General-purpose data analytics in biomedical research

Repository Structure

  • /scripts/ – Core Python scripts and utility functions
  • /notebooks/ – Example Jupyter notebooks and analyses
  • /data/ – Sample or test datasets (de-identified or simulated)
  • /figures/ – Example output plots and visualization templates

Folders will be updated as the repository grows.


Getting Started

  1. Clone this repository:

    git clone https://github.com/joseiky/Data-analytics.git
    cd Data-analytics
  2. Install required dependencies: Most scripts require pandas, numpy, scipy, scikit-learn, matplotlib, seaborn, and jupyter. You can install them using:

    pip install -r requirements.txt

    (A sample requirements.txt will be provided soon.)

  3. Run scripts or notebooks:

    • Navigate to the relevant folder
    • Open Jupyter notebooks or run .py files as needed

License

This repository is licensed under the MIT License. See LICENSE for details.


About

Created and maintained by Dr. John Osei, Extraordinary Professor, Medical Microbiology & Bioinformatics Contact: jod14139@yahoo.com


Keywords

machine-learningbioinformaticsgenomicsstatisticsepidemiologyinfectious-diseasesantibiotic-resistancevisualizationdata-cleaning


Feel free to contribute, open issues, or fork the repository!