% Data cleaning and processing % Phil Grunewald % Last updated: 30 May 2025
Revision history:
v1 21 May 25: Initial version - prior to separation of cleaning and processing
v2 30 May 25: Separation of cleaning and processing
v3 17 Jun 25: New survey deposition (3671 CLIENT DATA - FINAL DATA SET (20250616) UCL.csv -> raw/2025_06_survey.csv)
The Oxford EDOL team will conduct data checks, cleaning, processing and analysis on survey, smart meter and other data collected by EDOL.
Each of us use different tools for different purposes, but we aim to maintain consistent and compatible data formats and methods.
This document seeks to explain some of the processes being applied to the survey data and describes the resulting data files.
The following steps and principles apply:
Original raw data is kept in unchanged form in the raw/
folder. The default file format is CSV. Exports from other formats are converted to csv where possible. Make sure to use UTF-8 encoding when saving CSV files.
Data cleaning steps change the raw data in instances where confidence is high that not doing so would lead to incorrect results.
Cleaned files are saved in the clean/
folder.
The processing steps use cleaned data and serve ease of analysis.
Processing does not change the cleaned data, but rather adds new columns that are derived from the cleaned data. It is saved in the processed/
folder.
All anlysis should be performed on cleaned or processed data, never on raw data. Analysis scripts and results are kept separate from the above.
The log of data cleaning and processing steps is SurveyDataCleaning Log.xlsx
The first deposition is "3671 CLIENT DATA - DATA SET 1 (20250430).xlsx". (password protected)
This file is reduced to IDs (PUPRN) in "survey-2025-04-25_wPUPRN.csv"
Columns: 284 columns with the question code
Rows: 1944 rows (one per respondent)
Original data: raw/data_dictionary.xlsx
.
csv version: raw/dictionary.csv
.
Coloumn | Description |
---|---|
code | Accent question number |
question | Question or Option wording |
response | Coded values (int -9..n) |
meaning | Response wording |
Script clean/survey.py
Input: raw/survey.csv
Output: clean/survey.csv
Steps:
- age responses > 9 are recoded to 1 (i.e. 1 person in that age group who happend to enter their age instead)
- recode Q2 'non applicable' from 5 to -5 (to differentiate with 1-4 being 1-4 people in that age group)
Script clean/dictionary.py
Input: raw/dictionary.csv
Output: clean/dictionary.csv
Steps:
- save the XLSX file as csv (because anything Microsoft sucks)
- replace all trailing commas and CR in vim
- replace all "9a" with "9" (vim: %s/9a/9/)
- replace all "-1a" with "-1"
- forwardfill all the empty cells from above
- remove duplicated codes in the question column
Script processed/survey.py
Input: clean/survey.csv
Output: processed/survey.csv
The survey asks my age group (R for rows) and gender (C for columns) in the household.
New columns are generated as:
Q1 => sum of all rows and columns (total number of people in the household)
Q1R1 => Q1R1C1 + Q1R1C2 # 0-15 (male + female)
Q1R2 => Q1R2C1 + Q1R2C2 # 16-24
Q1R3 => Q1R3C1 + Q1R3C2 # 25-44
Q1R4 => Q1R4C1 + Q1R4C2 # 45-64
Q1R5 => Q1R5C1 + Q1R5C2 # 65-74
Q1R6 => Q1R6C1 + Q1R6C2 # 75-84
Q1R7 => Q1R7C1 + Q1R7C2 # 85+
Q2_Emp sum of R1 and R2 # Number of emplpoyed people in household
Q2_Emp_Ed sum of R1, R2 and R5 # Number of employed / in education
Q2_NonEmp sum of R3, R4, R5 and R6 # Totabl not in paid employment
Q13_30: Microwave
Q13_31: Air fryer
Q13_32: Slow cooker
Q13_33: Smart plug
Q13_34: Smart thermostat
Q20_30: Microwave
Q20_31: Air fryer
Q20_32: Slow cooker
Q20_33: Smart plug
Q20_34: Smart thermostat
Script clean/dictionary.py
Input: clean/dictionary.csv
Output: processed/dictionary.csv
Question with multiple options carry the full question under each option. Simplified by having the question as an additional row. Were opions are QxRiCj, the question is coded as Qx (and columns response
and meaning
are NaN).
Add new rows for each added column in the survey, where column meaning
starts with "Derived ..."