Skip to content

smartEnergyResearchLab/plus_survey

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 

Repository files navigation

% Data cleaning and processing % Phil Grunewald % Last updated: 30 May 2025

Revision history:

v1 21 May 25: Initial version - prior to separation of cleaning and processing

v2 30 May 25: Separation of cleaning and processing

v3 17 Jun 25: New survey deposition (3671 CLIENT DATA - FINAL DATA SET (20250616) UCL.csv -> raw/2025_06_survey.csv)

Introduction

The Oxford EDOL team will conduct data checks, cleaning, processing and analysis on survey, smart meter and other data collected by EDOL.

Each of us use different tools for different purposes, but we aim to maintain consistent and compatible data formats and methods.

This document seeks to explain some of the processes being applied to the survey data and describes the resulting data files.

The following steps and principles apply:

Raw data

Original raw data is kept in unchanged form in the raw/ folder. The default file format is CSV. Exports from other formats are converted to csv where possible. Make sure to use UTF-8 encoding when saving CSV files.

Cleaning

Data cleaning steps change the raw data in instances where confidence is high that not doing so would lead to incorrect results.

Cleaned files are saved in the clean/ folder.

Processing

The processing steps use cleaned data and serve ease of analysis.

Processing does not change the cleaned data, but rather adds new columns that are derived from the cleaned data. It is saved in the processed/ folder.

Analysis

All anlysis should be performed on cleaned or processed data, never on raw data. Analysis scripts and results are kept separate from the above.

The log of data cleaning and processing steps is SurveyDataCleaning Log.xlsx

Raw data

Accent survey

The first deposition is "3671 CLIENT DATA - DATA SET 1 (20250430).xlsx". (password protected)

This file is reduced to IDs (PUPRN) in "survey-2025-04-25_wPUPRN.csv"

Columns: 284 columns with the question code

Rows: 1944 rows (one per respondent)

Accent data dictionary

Original data: raw/data_dictionary.xlsx.

csv version: raw/dictionary.csv.

Coloumn Description
code Accent question number
question Question or Option wording
response Coded values (int -9..n)
meaning Response wording

Cleaning

Survey

Script clean/survey.py

Input: raw/survey.csv

Output: clean/survey.csv

Steps:

  • age responses > 9 are recoded to 1 (i.e. 1 person in that age group who happend to enter their age instead)
  • recode Q2 'non applicable' from 5 to -5 (to differentiate with 1-4 being 1-4 people in that age group)

Dictionary

Script clean/dictionary.py

Input: raw/dictionary.csv

Output: clean/dictionary.csv

Steps:

  • save the XLSX file as csv (because anything Microsoft sucks)
  • replace all trailing commas and CR in vim
  • replace all "9a" with "9" (vim: %s/9a/9/)
  • replace all "-1a" with "-1"
  • forwardfill all the empty cells from above
  • remove duplicated codes in the question column

Processing

Survey

Script processed/survey.py

Input: clean/survey.csv

Output: processed/survey.csv

Household size

The survey asks my age group (R for rows) and gender (C for columns) in the household.

New columns are generated as:

Q1 => sum of all rows and columns (total number of people in the household)

Q1R1 => Q1R1C1 + Q1R1C2 # 0-15 (male + female)

Q1R2 => Q1R2C1 + Q1R2C2 # 16-24

Q1R3 => Q1R3C1 + Q1R3C2 # 25-44

Q1R4 => Q1R4C1 + Q1R4C2 # 45-64

Q1R5 => Q1R5C1 + Q1R5C2 # 65-74

Q1R6 => Q1R6C1 + Q1R6C2 # 75-84

Q1R7 => Q1R7C1 + Q1R7C2 # 85+

Employment status

Q2_Emp sum of R1 and R2 # Number of emplpoyed people in household

Q2_Emp_Ed sum of R1, R2 and R5 # Number of employed / in education

Q2_NonEmp sum of R3, R4, R5 and R6 # Totabl not in paid employment

Coding plain text responses

Q13_30: Microwave

Q13_31: Air fryer

Q13_32: Slow cooker

Q13_33: Smart plug

Q13_34: Smart thermostat

Q20_30: Microwave

Q20_31: Air fryer

Q20_32: Slow cooker

Q20_33: Smart plug

Q20_34: Smart thermostat

Dictionary

Script clean/dictionary.py

Input: clean/dictionary.csv

Output: processed/dictionary.csv

Question with multiple options carry the full question under each option. Simplified by having the question as an additional row. Were opions are QxRiCj, the question is coded as Qx (and columns response and meaning are NaN).

Add new rows for each added column in the survey, where column meaning starts with "Derived ..."

About

Scripts for cleaning, processing and analysing survey responses

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages