Don't put fields with null values in `sequence_entries_preprocessed_data.processed_data` #2793

corneliusroemer · 2024-09-15T18:28:13Z

It seems we store a whole lot of nothing (=fields with null values) in the sequence_entries_preprocessed_data.processed_data column

We should switch to treating missing fields as having null and not saving them in the table. This would pretty much halve the size of that table.

Example processed data row:

> LOC_000CV63 | 1 | 1 | {"metadata": {"length": 17, "authors": "J.A. Mcswiggen, L. Blatt", "hostAge": null, "lineage": null, "cellLine": null, "hostRole": null, "cultureId": null, "isLabHost": null, "totalSnps": null, "geoLocCity": null, "geoLocSite": null, "hostAgeBin": null, "hostGender": null, "sampleType": null, "stopCodons": null, "bodyProduct": null, "displayName": "LOC_000CV63.1", "foodProduct": null, "frameShifts": null, "hostDisease": null, "hostTaxonId": null, "ampliconSize": null, "completeness": null, "geoLocAdmin1": null, "geoLocAdmin2": null, "insdcVersion": 1, "ncbiSourceDb": "GenBank", "exposureEvent": null, "geoLocCountry": null, "ncbiVirusName": "West Nile virus", "passageMethod": null, "passageNumber": null, "travelHistory": null, "anatomicalPart": null, "geoLocLatitude": null, "hostNameCommon": null, "ncbiUpdateDate": "2009-05-23", "ncbiVirusTaxId": 11082, "sequencingDate": null, "versionComment": null, "dehostingMethod": null, "depthOfCoverage": null, "exposureDetails": null, "exposureSetting": null, "geoLocLongitude": null, "hostHealthState": null, "ncbiReleaseDate": "2009-05-23", "sraRunAccession": null, "totalStopCodons": null, "collectionDevice": null, "collectionMethod": null, "signsAndSymptoms": null, "totalDeletedNucs": null, "totalFrameShifts": null, "totalUnknownNucs": null, "breadthOfCoverage": null, "environmentalSite": null, "hostHealthOutcome": null, "hostOriginCountry": null, "purposeOfSampling": null, "totalInsertedNucs": null, "anatomicalMaterial": null, "authorAffiliations": null, "biosampleAccession": null, "hostNameScientific": null, "insdcAccessionBase": "HA068283", "insdcAccessionFull": "HA068283.1", "sampleReceivedDate": null, "sequencingProtocol": null, "specimenProcessing": null, "totalAmbiguousNucs": null, "bioprojectAccession": null, "presamplingActivity": null, "purposeOfSequencing": null, "sequencingAssayType": null, "ncbiSubmitterCountry": null, "qualityControlIssues": null, "sampleCollectionDate": null, "sequencingInstrument": null, "environmentalMaterial": null, "foodProductProperties": null, "hostVaccinationStatus": null, "qualityControlDetails": null, "sequencedByContactName": null, "ampliconPcrPrimerScheme": null, "sequencedByContactEmail": null, "sequencedByOrganization": null, "diagnosticTargetGeneName": null, "diagnosticTargetPresence": null, "previousInfectionDisease": null, "qualityControlMethodName": null, "referenceGenomeAccession": null, "diagnosticMeasurementUnit": null, "previousInfectionOrganism": null, "specimenCollectorSampleId": null, "specimenProcessingDetails": null, "diagnosticMeasurementValue": null, "diagnosticMeasurementMethod": null, "qualityControlDetermination": null, "qualityControlMethodVersion": null, "experimentalSpecimenRoleType": null, "consensusSequenceSoftwareName": null, "rawSequenceDataProcessingMethod": null, "consensusSequenceSoftwareVersion": null}, "aminoAcidInsertions": {}, "nucleotideInsertions": {"main": []}, "alignedAminoAcidSequences": {"2K": null, "NS1": null, "NS3": null, "NS5": null, "env": null, "prM": null, "NS2A": null, "NS2B": null, "NS4A": null, "NS4B": null, "capsid": null}, "alignedNucleotideSequences": {"main": null}, "unalignedNucleotideSequences": {"main": {"compressedSequence": "KLUv/SARPQAAAAEAL5ghEA=="}}} | [{"source": [{"name": "main", "type": "NucleotideSequence"}], "message": "Nucleotide sequence failed to align"}] | [] | HAS_ERRORS | 2024-09-15 16:34:20.523262 | 2024-09-15 16:34:20.703557

The text was updated successfully, but these errors were encountered:

fengelniederhammer · 2024-09-16T06:40:12Z

On the other hand, we have a lot of metadata fields that only have null values. Those are all fields that have only null values on Pathoplexus:

CCHF

Fields with only null values:
dataUseTermsRestrictedUntil
ampliconPcrPrimerScheme
ampliconSize
anatomicalMaterial
anatomicalPart
bodyProduct
breadthOfCoverage
cellLine
collectionDevice
collectionMethod
consensusSequenceSoftwareName
consensusSequenceSoftwareVersion
cultureId
dehostingMethod
depthOfCoverage
diagnosticMeasurementMethod
diagnosticMeasurementUnit
diagnosticMeasurementValue
diagnosticTargetGeneName
diagnosticTargetPresence
environmentalMaterial
environmentalSite
experimentalSpecimenRoleType
exposureDetails
exposureEvent
exposureSetting
foodProduct
foodProductProperties
geoLocCity
geoLocSite
hostAge
hostAgeBin
hostDisease
hostGender
hostHealthOutcome
hostHealthState
hostNameCommon
hostOriginCountry
hostRole
hostVaccinationStatus
ncbiSubmitterCountry
ncbiUpdateDate_L
ncbiUpdateDate_M
ncbiUpdateDate_S
passageMethod
passageNumber
presamplingActivity
previousInfectionDisease
previousInfectionOrganism
purposeOfSampling
purposeOfSequencing
qualityControlDetails
qualityControlDetermination
qualityControlIssues
qualityControlMethodName
qualityControlMethodVersion
rawSequenceDataProcessingMethod
referenceGenomeAccession
sampleReceivedDate
sampleType
sequencedByContactEmail
sequencedByContactName
sequencedByOrganization
sequencingAssayType
sequencingDate
sequencingInstrument
sequencingProtocol
signsAndSymptoms
specimenProcessing
specimenProcessingDetails
stopCodons
totalStopCodons
travelHistory
versionComment

Ebola Sudan

Fields with only null values:
dataUseTermsRestrictedUntil
ampliconPcrPrimerScheme
ampliconSize
anatomicalMaterial
anatomicalPart
bodyProduct
breadthOfCoverage
cellLine
collectionDevice
collectionMethod
consensusSequenceSoftwareName
consensusSequenceSoftwareVersion
cultureId
dehostingMethod
depthOfCoverage
diagnosticMeasurementMethod
diagnosticMeasurementUnit
diagnosticMeasurementValue
diagnosticTargetGeneName
diagnosticTargetPresence
environmentalMaterial
environmentalSite
experimentalSpecimenRoleType
exposureDetails
exposureEvent
exposureSetting
foodProduct
foodProductProperties
geoLocCity
geoLocSite
hostAge
hostAgeBin
hostDisease
hostGender
hostHealthOutcome
hostHealthState
hostNameCommon
hostOriginCountry
hostRole
hostVaccinationStatus
ncbiSubmitterCountry
passageMethod
passageNumber
presamplingActivity
previousInfectionDisease
previousInfectionOrganism
purposeOfSampling
purposeOfSequencing
qualityControlDetails
qualityControlDetermination
qualityControlIssues
qualityControlMethodName
qualityControlMethodVersion
rawSequenceDataProcessingMethod
referenceGenomeAccession
sampleReceivedDate
sampleType
sequencedByContactEmail
sequencedByContactName
sequencedByOrganization
sequencingAssayType
sequencingDate
sequencingInstrument
sequencingProtocol
signsAndSymptoms
specimenProcessing
specimenProcessingDetails
travelHistory
versionComment

Ebola Zaire

Fields with only null values:
dataUseTermsRestrictedUntil
ampliconPcrPrimerScheme
ampliconSize
anatomicalMaterial
anatomicalPart
bodyProduct
breadthOfCoverage
cellLine
collectionDevice
collectionMethod
consensusSequenceSoftwareName
consensusSequenceSoftwareVersion
cultureId
dehostingMethod
depthOfCoverage
diagnosticMeasurementMethod
diagnosticMeasurementUnit
diagnosticMeasurementValue
diagnosticTargetGeneName
diagnosticTargetPresence
environmentalMaterial
environmentalSite
experimentalSpecimenRoleType
exposureDetails
exposureEvent
exposureSetting
foodProduct
foodProductProperties
geoLocCity
geoLocSite
hostAge
hostAgeBin
hostDisease
hostGender
hostHealthOutcome
hostHealthState
hostNameCommon
hostOriginCountry
hostRole
hostVaccinationStatus
ncbiSubmitterCountry
passageMethod
passageNumber
presamplingActivity
previousInfectionDisease
previousInfectionOrganism
purposeOfSampling
purposeOfSequencing
qualityControlDetails
qualityControlDetermination
qualityControlIssues
qualityControlMethodName
qualityControlMethodVersion
rawSequenceDataProcessingMethod
referenceGenomeAccession
sampleReceivedDate
sampleType
sequencedByContactEmail
sequencedByContactName
sequencedByOrganization
sequencingAssayType
sequencingDate
sequencingInstrument
sequencingProtocol
signsAndSymptoms
specimenProcessing
specimenProcessingDetails
travelHistory
versionComment

West Nile

Fields with only null values:
dataUseTermsRestrictedUntil
ampliconPcrPrimerScheme
ampliconSize
anatomicalMaterial
anatomicalPart
bodyProduct
breadthOfCoverage
cellLine
collectionDevice
collectionMethod
consensusSequenceSoftwareName
consensusSequenceSoftwareVersion
cultureId
dehostingMethod
depthOfCoverage
diagnosticMeasurementMethod
diagnosticMeasurementUnit
diagnosticMeasurementValue
diagnosticTargetGeneName
diagnosticTargetPresence
environmentalMaterial
environmentalSite
experimentalSpecimenRoleType
exposureDetails
exposureEvent
exposureSetting
foodProduct
foodProductProperties
geoLocCity
geoLocSite
hostAge
hostAgeBin
hostDisease
hostGender
hostHealthOutcome
hostHealthState
hostNameCommon
hostOriginCountry
hostRole
hostVaccinationStatus
ncbiSubmitterCountry
passageMethod
passageNumber
presamplingActivity
previousInfectionDisease
previousInfectionOrganism
purposeOfSampling
purposeOfSequencing
qualityControlDetails
qualityControlDetermination
qualityControlIssues
qualityControlMethodName
qualityControlMethodVersion
rawSequenceDataProcessingMethod
referenceGenomeAccession
sampleReceivedDate
sampleType
sequencedByContactEmail
sequencedByContactName
sequencedByOrganization
sequencingAssayType
sequencingDate
sequencingInstrument
sequencingProtocol
signsAndSymptoms
specimenProcessing
specimenProcessingDetails
travelHistory
versionComment

Maybe we could get rid of some of there fields?

fengelniederhammer · 2024-09-16T06:40:36Z

Script to scan for null-only LAPIS fields:

#!/usr/bin/bash

set -euo pipefail

lapis_url="https://lapis.pathoplexus.org/west-nile/"

# The the list of fields from /sample/databaseConfig
fields=$(curl -s -k -X GET "${lapis_url}sample/databaseConfig" | jq -r '.schema.metadata[].name')

echo "Fields with only null values:"

# For every field, call /sample/aggregated with {"fields": [field]}
for field in $fields; do
    response=$(curl -s -k -X POST "${lapis_url}sample/aggregated" -H "Content-Type: application/json" -d "{\"fields\": [\"$field\"]}")

    # If response.data has length 1 and response.data[0].[field] is null, print the field
    if [[ $(echo "$response" | jq -r '.data | length') -eq 1 ]]; then
        if [[ $(echo "$response" | jq -r ".data[0].$field") == "null" ]]; then
            echo "$field"
        fi
    fi
done

chaoran-chen · 2024-09-16T06:58:45Z

@fengelniederhammer, we introduced the fields intentionally with the goal / hope that they will be populated more in the future as these are all relevant information.

@corneliusroemer, sounds good, especially internally. When sending out the data (e.g., in get-released-data), would you leave the fields empty or populate with nulls?

fengelniederhammer · 2024-09-16T07:19:33Z

SILO requires that all configured metadata fields are present in the imported NDJSON. Also I have a slight preference that all configured data should be contained in the released data (even if it's just null).

chaoran-chen · 2024-09-16T07:33:11Z

I also have the same slight preference! As NDJSON doesn't include schema information, always seeing all fields is useful so that you can easily see which fields exist (although the type is still not clear before seeing the first value).

corneliusroemer · 2024-09-16T12:02:22Z

Right, let me be more precise: My proposal is to just change the internal representation in the database, not what is sent to clients. It doesn't matter if we send these null fields in get-released-data as they compress so well with zstd that it doesn't matter for transfer, and clients can throw them out immediately as well.

The only change I'm suggesting is to

Not store nulls when backend writes preprocessed data to sequence_entries_preprocessed_data table

corneliusroemer · 2024-09-18T12:10:28Z

This also has the benefit that we can add new fields without requiring new preprocessing pipeline version. So we should do it sooner rather than later.

corneliusroemer added backend related to the loculus backend component performance labels Sep 16, 2024

corneliusroemer changed the title ~~Don't put fields with null values in processed data output~~ Don't put fields with null values in sequence_entires_preprocessed_data.processed_data Sep 16, 2024

corneliusroemer changed the title ~~Don't put fields with null values in sequence_entires_preprocessed_data.processed_data~~ Don't put fields with null values in sequence_entries_preprocessed_data.processed_data Sep 16, 2024

anna-parker mentioned this issue Sep 18, 2024

[don't merge - this breaks production as is] feat(docs): Update metadata; include latitude, longitude allow download of all fields and descriptions in a csv file pathoplexus/pathoplexus#139

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Don't put fields with null values in `sequence_entries_preprocessed_data.processed_data` #2793

Don't put fields with null values in `sequence_entries_preprocessed_data.processed_data` #2793

corneliusroemer commented Sep 15, 2024 •

edited

Loading

fengelniederhammer commented Sep 16, 2024 •

edited

Loading

fengelniederhammer commented Sep 16, 2024

chaoran-chen commented Sep 16, 2024

fengelniederhammer commented Sep 16, 2024

chaoran-chen commented Sep 16, 2024

corneliusroemer commented Sep 16, 2024

corneliusroemer commented Sep 18, 2024

Don't put fields with null values in sequence_entries_preprocessed_data.processed_data #2793

Don't put fields with null values in sequence_entries_preprocessed_data.processed_data #2793

Comments

corneliusroemer commented Sep 15, 2024 • edited Loading

fengelniederhammer commented Sep 16, 2024 • edited Loading

fengelniederhammer commented Sep 16, 2024

chaoran-chen commented Sep 16, 2024

fengelniederhammer commented Sep 16, 2024

chaoran-chen commented Sep 16, 2024

corneliusroemer commented Sep 16, 2024

corneliusroemer commented Sep 18, 2024

Don't put fields with null values in `sequence_entries_preprocessed_data.processed_data` #2793

Don't put fields with null values in `sequence_entries_preprocessed_data.processed_data` #2793

corneliusroemer commented Sep 15, 2024 •

edited

Loading

fengelniederhammer commented Sep 16, 2024 •

edited

Loading