Skip to content

Commit a29a5f7

Browse files
committed
🏴‍☠️ Compress data files for GitHub compatibility
- Replace large CSV files (151MB) with compressed zip (20MB) - Add automatic data extraction on first run - Update documentation for zip-based data distribution - Maintain backward compatibility with existing code - All tests passing with new data handling
1 parent 74bd74e commit a29a5f7

File tree

8 files changed

+55
-184
lines changed

8 files changed

+55
-184
lines changed

.gitignore

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -49,6 +49,9 @@ ehthumbs.db
4949
Thumbs.db
5050

5151
# Data files (these be large treasures that shouldn't go in git)
52+
data/kaggle_so_2023/
53+
# But keep the zip file for distribution
54+
!data/kaggle_so_2023_data.zip
5255
data/*.csv
5356
data/*.json
5457
data/*.xlsx

README.md

Lines changed: 26 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -56,12 +56,33 @@ This application is designed specifically for **data analysts** who need:
5656

5757
## 🏴‍☠️ Setup Instructions
5858

59-
### 1. Data Setup (Already Done!)
59+
### 1. Data Setup
6060

61-
The Stack Overflow 2023 survey data is already available in the `data/kaggle_so_2023/` directory with:
62-
- `survey_results_public.csv` - Main survey responses
63-
- `survey_results_schema.csv` - Data schema and column descriptions
64-
- Additional documentation files
61+
The Stack Overflow 2023 survey data is provided as a compressed zip file to keep the repository size manageable:
62+
63+
**Option A: Automatic Extraction (Recommended)**
64+
- The application will automatically extract `data/kaggle_so_2023_data.zip` when first run
65+
- No manual action needed - just start the server!
66+
67+
**Option B: Manual Extraction**
68+
```bash
69+
# Navigate to the data directory
70+
cd data
71+
72+
# Extract the zip file
73+
unzip kaggle_so_2023_data.zip
74+
75+
# This creates the kaggle_so_2023/ directory with:
76+
# - survey_results_public.csv (151MB - main survey responses)
77+
# - survey_results_schema.csv (data schema and column descriptions)
78+
# - Additional documentation files
79+
```
80+
81+
**Data Contents:**
82+
- `survey_results_public.csv` - Main survey responses (151MB)
83+
- `survey_results_schema.csv` - Data schema and column descriptions
84+
- `so_survey_2023.pdf` - Survey documentation
85+
- `README_2023.txt` - Additional information
6586

6687
### 2. Install Dependencies
6788

app/data_config.py

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,7 @@
66

77
import os
88
import pandas as pd
9+
import zipfile
910
from typing import Dict, List, Optional, Any
1011
from dataclasses import dataclass
1112
from pathlib import Path
@@ -42,8 +43,27 @@ class DataManager:
4243
def __init__(self, base_data_path: str):
4344
self.base_data_path = Path(base_data_path)
4445
self.data_sources = {}
46+
self._ensure_data_extracted()
4547
self._setup_default_sources()
4648

49+
def _ensure_data_extracted(self):
50+
"""
51+
Yarr! Make sure the data treasure be extracted from zip if needed
52+
"""
53+
zip_file_path = self.base_data_path / "kaggle_so_2023_data.zip"
54+
extract_dir = self.base_data_path / "kaggle_so_2023"
55+
56+
# If zip exists but extracted directory doesn't, extract it
57+
if zip_file_path.exists() and not extract_dir.exists():
58+
print("🏴‍☠️ Ahoy! Extracting data treasure from zip file...")
59+
try:
60+
with zipfile.ZipFile(zip_file_path, 'r') as zip_ref:
61+
zip_ref.extractall(extract_dir)
62+
print("⚓ Data successfully extracted, matey!")
63+
except Exception as e:
64+
print(f"🚨 Blimey! Error extracting data: {e}")
65+
raise RuntimeError(f"Failed to extract data from zip file: {e}")
66+
4767
def _setup_default_sources(self):
4868
"""Set up the default data sources we know about"""
4969

data/kaggle_so_2023/README_2023.txt

Lines changed: 0 additions & 31 deletions
This file was deleted.
-1.3 MB
Binary file not shown.

0 commit comments

Comments
 (0)