@@ -56,27 +56,69 @@ This application is designed specifically for **data analysts** who need:
56
56
57
57
## π΄ββ οΈ Setup Instructions
58
58
59
- ### 1. Data Setup
60
-
61
- The Stack Overflow 2023 survey data is provided as a compressed zip file to keep the repository size manageable:
62
-
63
- ** Option A: Automatic Extraction (Recommended)**
64
- - The application will automatically extract ` data/kaggle_so_2023_data.zip ` when first run
65
- - No manual action needed - just start the server!
66
-
67
- ** Option B: Manual Extraction**
68
- ``` bash
69
- # Navigate to the data directory
70
- cd data
71
-
72
- # Extract the zip file
73
- unzip kaggle_so_2023_data.zip
74
-
75
- # This creates the kaggle_so_2023/ directory with:
76
- # - survey_results_public.csv (151MB - main survey responses)
77
- # - survey_results_schema.csv (data schema and column descriptions)
78
- # - Additional documentation files
59
+ ### 1. Data Setup (Smart Zip Management System)
60
+
61
+ The application features an ** intelligent data management system** designed for data analysts who work with multiple datasets:
62
+
63
+ #### π€ Automatic Data Source Detection
64
+ - ** Zero Configuration** : Drop any survey data zip file into ` data/ ` folder
65
+ - ** Auto-Extraction** : Zip files are automatically extracted on application startup
66
+ - ** Smart Detection** : CSV files are automatically discovered and configured
67
+ - ** Technology Analysis** : Columns with semicolon-separated tech lists are auto-detected
68
+
69
+ #### π¦ Current Data Sources
70
+ - ** Stack Overflow 2023** : ` kaggle_so_2023.zip ` (20MB compressed β 151MB extracted)
71
+ - Contains ` survey_results_public.csv ` with 89,000+ developer responses
72
+ - Includes ` survey_results_schema.csv ` with column definitions
73
+ - Pre-configured with 8 technology analysis categories
74
+
75
+ #### β Adding New Data Sources (Open-Ended Design)
76
+ Perfect for data analysts working with multiple survey datasets:
77
+
78
+ 1 . ** Prepare Your Data** :
79
+ ```
80
+ your_survey_data/
81
+ βββ main_survey_responses.csv # Main data (any CSV name works)
82
+ βββ schema_definitions.csv # Optional (detected by "schema" in name)
83
+ βββ documentation.txt # Additional files (ignored)
84
+ ```
85
+
86
+ 2 . ** Create Zip Archive** :
87
+ ``` bash
88
+ zip -r your_survey_2024.zip your_survey_data/
89
+ ```
90
+
91
+ 3 . ** Deploy to Application** :
92
+ ``` bash
93
+ cp your_survey_2024.zip /path/to/project/data/
94
+ # Application auto-detects and configures on next startup
95
+ ```
96
+
97
+ 4 . ** Automatic Configuration** :
98
+ - Main data file detected (largest CSV or one with "survey"/"results" in name)
99
+ - Schema file detected (contains "schema" in filename)
100
+ - Technology columns identified (contain "language", "database", "platform", etc.)
101
+ - New data source registered and available in dashboard
102
+
103
+ #### π Data Format Requirements
104
+ - ** Primary Format** : CSV files with semicolon-separated technology lists
105
+ - ** Column Detection** : Automatic detection of technology-related columns
106
+ - ** Schema Support** : Optional schema files for column descriptions
107
+ - ** Size Limit** : Zip files should be under GitHub's 100MB limit
108
+
109
+ #### π Example Multi-Source Setup
79
110
```
111
+ data/
112
+ βββ kaggle_so_2023.zip # Stack Overflow 2023
113
+ βββ kaggle_so_2023/ # Auto-extracted
114
+ βββ github_dev_survey_2024.zip # Your GitHub survey
115
+ βββ github_dev_survey_2024/ # Auto-extracted
116
+ βββ company_internal_survey.zip # Internal survey
117
+ βββ company_internal_survey/ # Auto-extracted
118
+ βββ .gitignore # Excludes CSV files, includes zips
119
+ ```
120
+
121
+ Each data source becomes automatically available in the dashboard with detected technology categories!
80
122
81
123
** Data Contents:**
82
124
- ` survey_results_public.csv ` - Main survey responses (151MB)
0 commit comments