Introduction

% Data cleaning and processing % Phil Grunewald % Last updated: 30 May 2025

Revision history:

v1 21 May 25: Initial version - prior to separation of cleaning and processing

v2 30 May 25: Separation of cleaning and processing

v3 17 Jun 25: New survey deposition (3671 CLIENT DATA - FINAL DATA SET (20250616) UCL.csv -> raw/2025_06_survey.csv)

Introduction

The Oxford EDOL team will conduct data checks, cleaning, processing and analysis on survey, smart meter and other data collected by EDOL.

Each of us use different tools for different purposes, but we aim to maintain consistent and compatible data formats and methods.

This document seeks to explain some of the processes being applied to the survey data and describes the resulting data files.

The following steps and principles apply:

Raw data

Original raw data is kept in unchanged form in the raw/ folder. The default file format is CSV. Exports from other formats are converted to csv where possible. Make sure to use UTF-8 encoding when saving CSV files.

Cleaning

Data cleaning steps change the raw data in instances where confidence is high that not doing so would lead to incorrect results.

Cleaned files are saved in the clean/ folder.

Processing

The processing steps use cleaned data and serve ease of analysis.

Processing does not change the cleaned data, but rather adds new columns that are derived from the cleaned data. It is saved in the processed/ folder.

Analysis

All anlysis should be performed on cleaned or processed data, never on raw data. Analysis scripts and results are kept separate from the above.

The log of data cleaning and processing steps is SurveyDataCleaning Log.xlsx

Raw data

Accent survey

The first deposition is "3671 CLIENT DATA - DATA SET 1 (20250430).xlsx". (password protected)

This file is reduced to IDs (PUPRN) in "survey-2025-04-25_wPUPRN.csv"

Columns: 284 columns with the question code

Rows: 1944 rows (one per respondent)

Accent data dictionary

Original data: raw/data_dictionary.xlsx.

csv version: raw/dictionary.csv.

Coloumn	Description
code	Accent question number
question	Question or Option wording
response	Coded values (int -9..n)
meaning	Response wording

Cleaning

Survey

Script clean/survey.py

Input: raw/survey.csv

Output: clean/survey.csv

Steps:

age responses > 9 are recoded to 1 (i.e. 1 person in that age group who happend to enter their age instead)
recode Q2 'non applicable' from 5 to -5 (to differentiate with 1-4 being 1-4 people in that age group)

Dictionary

Script clean/dictionary.py

Input: raw/dictionary.csv

Output: clean/dictionary.csv

Steps:

save the XLSX file as csv (because anything Microsoft sucks)
replace all trailing commas and CR in vim
replace all "9a" with "9" (vim: %s/9a/9/)
replace all "-1a" with "-1"
forwardfill all the empty cells from above
remove duplicated codes in the question column

Processing

Survey

Script processed/survey.py

Input: clean/survey.csv

Output: processed/survey.csv

Household size

The survey asks my age group (R for rows) and gender (C for columns) in the household.

New columns are generated as:

Q1 => sum of all rows and columns (total number of people in the household)

Q1R1 => Q1R1C1 + Q1R1C2 # 0-15 (male + female)

Q1R2 => Q1R2C1 + Q1R2C2 # 16-24

Q1R3 => Q1R3C1 + Q1R3C2 # 25-44

Q1R4 => Q1R4C1 + Q1R4C2 # 45-64

Q1R5 => Q1R5C1 + Q1R5C2 # 65-74

Q1R6 => Q1R6C1 + Q1R6C2 # 75-84

Q1R7 => Q1R7C1 + Q1R7C2 # 85+

Employment status

Q2_Emp sum of R1 and R2 # Number of emplpoyed people in household

Q2_Emp_Ed sum of R1, R2 and R5 # Number of employed / in education

Q2_NonEmp sum of R3, R4, R5 and R6 # Totabl not in paid employment

Coding plain text responses

Q13_30: Microwave

Q13_31: Air fryer

Q13_32: Slow cooker

Q13_33: Smart plug

Q13_34: Smart thermostat

Q20_30: Microwave

Q20_31: Air fryer

Q20_32: Slow cooker

Q20_33: Smart plug

Q20_34: Smart thermostat

Dictionary

Script clean/dictionary.py

Input: clean/dictionary.csv

Output: processed/dictionary.csv

Question with multiple options carry the full question under each option. Simplified by having the question as an additional row. Were opions are QxRiCj, the question is coded as Qx (and columns response and meaning are NaN).

Add new rows for each added column in the survey, where column meaning starts with "Derived ..."

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
clean		clean
processed		processed
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Introduction

Raw data

Cleaning

Processing

Analysis

Raw data

Accent survey

Accent data dictionary

Cleaning

Survey

Dictionary

Processing

Survey

Household size

Employment status

Coding plain text responses

Dictionary

About

Uh oh!

Releases

Packages

Languages

smartEnergyResearchLab/plus_survey

Folders and files

Latest commit

History

Repository files navigation

Introduction

Raw data

Cleaning

Processing

Analysis

Raw data

Accent survey

Accent data dictionary

Cleaning

Survey

Dictionary

Processing

Survey

Household size

Employment status

Coding plain text responses

Dictionary

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages