Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't put fields with null values in sequence_entries_preprocessed_data.processed_data #2793

Open
corneliusroemer opened this issue Sep 15, 2024 · 7 comments
Labels
backend related to the loculus backend component performance

Comments

@corneliusroemer
Copy link
Contributor

corneliusroemer commented Sep 15, 2024

It seems we store a whole lot of nothing (=fields with null values) in the sequence_entries_preprocessed_data.processed_data column

We should switch to treating missing fields as having null and not saving them in the table. This would pretty much halve the size of that table.

Example processed data row:

> LOC_000CV63 | 1 | 1 | {"metadata": {"length": 17, "authors": "J.A. Mcswiggen, L. Blatt", "hostAge": null, "lineage": null, "cellLine": null, "hostRole": null, "cultureId": null, "isLabHost": null, "totalSnps": null, "geoLocCity": null, "geoLocSite": null, "hostAgeBin": null, "hostGender": null, "sampleType": null, "stopCodons": null, "bodyProduct": null, "displayName": "LOC_000CV63.1", "foodProduct": null, "frameShifts": null, "hostDisease": null, "hostTaxonId": null, "ampliconSize": null, "completeness": null, "geoLocAdmin1": null, "geoLocAdmin2": null, "insdcVersion": 1, "ncbiSourceDb": "GenBank", "exposureEvent": null, "geoLocCountry": null, "ncbiVirusName": "West Nile virus", "passageMethod": null, "passageNumber": null, "travelHistory": null, "anatomicalPart": null, "geoLocLatitude": null, "hostNameCommon": null, "ncbiUpdateDate": "2009-05-23", "ncbiVirusTaxId": 11082, "sequencingDate": null, "versionComment": null, "dehostingMethod": null, "depthOfCoverage": null, "exposureDetails": null, "exposureSetting": null, "geoLocLongitude": null, "hostHealthState": null, "ncbiReleaseDate": "2009-05-23", "sraRunAccession": null, "totalStopCodons": null, "collectionDevice": null, "collectionMethod": null, "signsAndSymptoms": null, "totalDeletedNucs": null, "totalFrameShifts": null, "totalUnknownNucs": null, "breadthOfCoverage": null, "environmentalSite": null, "hostHealthOutcome": null, "hostOriginCountry": null, "purposeOfSampling": null, "totalInsertedNucs": null, "anatomicalMaterial": null, "authorAffiliations": null, "biosampleAccession": null, "hostNameScientific": null, "insdcAccessionBase": "HA068283", "insdcAccessionFull": "HA068283.1", "sampleReceivedDate": null, "sequencingProtocol": null, "specimenProcessing": null, "totalAmbiguousNucs": null, "bioprojectAccession": null, "presamplingActivity": null, "purposeOfSequencing": null, "sequencingAssayType": null, "ncbiSubmitterCountry": null, "qualityControlIssues": null, "sampleCollectionDate": null, "sequencingInstrument": null, "environmentalMaterial": null, "foodProductProperties": null, "hostVaccinationStatus": null, "qualityControlDetails": null, "sequencedByContactName": null, "ampliconPcrPrimerScheme": null, "sequencedByContactEmail": null, "sequencedByOrganization": null, "diagnosticTargetGeneName": null, "diagnosticTargetPresence": null, "previousInfectionDisease": null, "qualityControlMethodName": null, "referenceGenomeAccession": null, "diagnosticMeasurementUnit": null, "previousInfectionOrganism": null, "specimenCollectorSampleId": null, "specimenProcessingDetails": null, "diagnosticMeasurementValue": null, "diagnosticMeasurementMethod": null, "qualityControlDetermination": null, "qualityControlMethodVersion": null, "experimentalSpecimenRoleType": null, "consensusSequenceSoftwareName": null, "rawSequenceDataProcessingMethod": null, "consensusSequenceSoftwareVersion": null}, "aminoAcidInsertions": {}, "nucleotideInsertions": {"main": []}, "alignedAminoAcidSequences": {"2K": null, "NS1": null, "NS3": null, "NS5": null, "env": null, "prM": null, "NS2A": null, "NS2B": null, "NS4A": null, "NS4B": null, "capsid": null}, "alignedNucleotideSequences": {"main": null}, "unalignedNucleotideSequences": {"main": {"compressedSequence": "KLUv/SARPQAAAAEAL5ghEA=="}}} | [{"source": [{"name": "main", "type": "NucleotideSequence"}], "message": "Nucleotide sequence failed to align"}] | [] | HAS_ERRORS | 2024-09-15 16:34:20.523262 | 2024-09-15 16:34:20.703557

@corneliusroemer corneliusroemer added backend related to the loculus backend component performance labels Sep 16, 2024
@fengelniederhammer
Copy link
Contributor

fengelniederhammer commented Sep 16, 2024

On the other hand, we have a lot of metadata fields that only have null values. Those are all fields that have only null values on Pathoplexus:

CCHF

Fields with only null values:
dataUseTermsRestrictedUntil
ampliconPcrPrimerScheme
ampliconSize
anatomicalMaterial
anatomicalPart
bodyProduct
breadthOfCoverage
cellLine
collectionDevice
collectionMethod
consensusSequenceSoftwareName
consensusSequenceSoftwareVersion
cultureId
dehostingMethod
depthOfCoverage
diagnosticMeasurementMethod
diagnosticMeasurementUnit
diagnosticMeasurementValue
diagnosticTargetGeneName
diagnosticTargetPresence
environmentalMaterial
environmentalSite
experimentalSpecimenRoleType
exposureDetails
exposureEvent
exposureSetting
foodProduct
foodProductProperties
geoLocCity
geoLocSite
hostAge
hostAgeBin
hostDisease
hostGender
hostHealthOutcome
hostHealthState
hostNameCommon
hostOriginCountry
hostRole
hostVaccinationStatus
ncbiSubmitterCountry
ncbiUpdateDate_L
ncbiUpdateDate_M
ncbiUpdateDate_S
passageMethod
passageNumber
presamplingActivity
previousInfectionDisease
previousInfectionOrganism
purposeOfSampling
purposeOfSequencing
qualityControlDetails
qualityControlDetermination
qualityControlIssues
qualityControlMethodName
qualityControlMethodVersion
rawSequenceDataProcessingMethod
referenceGenomeAccession
sampleReceivedDate
sampleType
sequencedByContactEmail
sequencedByContactName
sequencedByOrganization
sequencingAssayType
sequencingDate
sequencingInstrument
sequencingProtocol
signsAndSymptoms
specimenProcessing
specimenProcessingDetails
stopCodons
totalStopCodons
travelHistory
versionComment

Ebola Sudan

Fields with only null values:
dataUseTermsRestrictedUntil
ampliconPcrPrimerScheme
ampliconSize
anatomicalMaterial
anatomicalPart
bodyProduct
breadthOfCoverage
cellLine
collectionDevice
collectionMethod
consensusSequenceSoftwareName
consensusSequenceSoftwareVersion
cultureId
dehostingMethod
depthOfCoverage
diagnosticMeasurementMethod
diagnosticMeasurementUnit
diagnosticMeasurementValue
diagnosticTargetGeneName
diagnosticTargetPresence
environmentalMaterial
environmentalSite
experimentalSpecimenRoleType
exposureDetails
exposureEvent
exposureSetting
foodProduct
foodProductProperties
geoLocCity
geoLocSite
hostAge
hostAgeBin
hostDisease
hostGender
hostHealthOutcome
hostHealthState
hostNameCommon
hostOriginCountry
hostRole
hostVaccinationStatus
ncbiSubmitterCountry
passageMethod
passageNumber
presamplingActivity
previousInfectionDisease
previousInfectionOrganism
purposeOfSampling
purposeOfSequencing
qualityControlDetails
qualityControlDetermination
qualityControlIssues
qualityControlMethodName
qualityControlMethodVersion
rawSequenceDataProcessingMethod
referenceGenomeAccession
sampleReceivedDate
sampleType
sequencedByContactEmail
sequencedByContactName
sequencedByOrganization
sequencingAssayType
sequencingDate
sequencingInstrument
sequencingProtocol
signsAndSymptoms
specimenProcessing
specimenProcessingDetails
travelHistory
versionComment

Ebola Zaire

Fields with only null values:
dataUseTermsRestrictedUntil
ampliconPcrPrimerScheme
ampliconSize
anatomicalMaterial
anatomicalPart
bodyProduct
breadthOfCoverage
cellLine
collectionDevice
collectionMethod
consensusSequenceSoftwareName
consensusSequenceSoftwareVersion
cultureId
dehostingMethod
depthOfCoverage
diagnosticMeasurementMethod
diagnosticMeasurementUnit
diagnosticMeasurementValue
diagnosticTargetGeneName
diagnosticTargetPresence
environmentalMaterial
environmentalSite
experimentalSpecimenRoleType
exposureDetails
exposureEvent
exposureSetting
foodProduct
foodProductProperties
geoLocCity
geoLocSite
hostAge
hostAgeBin
hostDisease
hostGender
hostHealthOutcome
hostHealthState
hostNameCommon
hostOriginCountry
hostRole
hostVaccinationStatus
ncbiSubmitterCountry
passageMethod
passageNumber
presamplingActivity
previousInfectionDisease
previousInfectionOrganism
purposeOfSampling
purposeOfSequencing
qualityControlDetails
qualityControlDetermination
qualityControlIssues
qualityControlMethodName
qualityControlMethodVersion
rawSequenceDataProcessingMethod
referenceGenomeAccession
sampleReceivedDate
sampleType
sequencedByContactEmail
sequencedByContactName
sequencedByOrganization
sequencingAssayType
sequencingDate
sequencingInstrument
sequencingProtocol
signsAndSymptoms
specimenProcessing
specimenProcessingDetails
travelHistory
versionComment

West Nile

Fields with only null values:
dataUseTermsRestrictedUntil
ampliconPcrPrimerScheme
ampliconSize
anatomicalMaterial
anatomicalPart
bodyProduct
breadthOfCoverage
cellLine
collectionDevice
collectionMethod
consensusSequenceSoftwareName
consensusSequenceSoftwareVersion
cultureId
dehostingMethod
depthOfCoverage
diagnosticMeasurementMethod
diagnosticMeasurementUnit
diagnosticMeasurementValue
diagnosticTargetGeneName
diagnosticTargetPresence
environmentalMaterial
environmentalSite
experimentalSpecimenRoleType
exposureDetails
exposureEvent
exposureSetting
foodProduct
foodProductProperties
geoLocCity
geoLocSite
hostAge
hostAgeBin
hostDisease
hostGender
hostHealthOutcome
hostHealthState
hostNameCommon
hostOriginCountry
hostRole
hostVaccinationStatus
ncbiSubmitterCountry
passageMethod
passageNumber
presamplingActivity
previousInfectionDisease
previousInfectionOrganism
purposeOfSampling
purposeOfSequencing
qualityControlDetails
qualityControlDetermination
qualityControlIssues
qualityControlMethodName
qualityControlMethodVersion
rawSequenceDataProcessingMethod
referenceGenomeAccession
sampleReceivedDate
sampleType
sequencedByContactEmail
sequencedByContactName
sequencedByOrganization
sequencingAssayType
sequencingDate
sequencingInstrument
sequencingProtocol
signsAndSymptoms
specimenProcessing
specimenProcessingDetails
travelHistory
versionComment

Maybe we could get rid of some of there fields?

@fengelniederhammer
Copy link
Contributor

Script to scan for null-only LAPIS fields:

#!/usr/bin/bash

set -euo pipefail

lapis_url="https://lapis.pathoplexus.org/west-nile/"

# The the list of fields from /sample/databaseConfig
fields=$(curl -s -k -X GET "${lapis_url}sample/databaseConfig" | jq -r '.schema.metadata[].name')

echo "Fields with only null values:"

# For every field, call /sample/aggregated with {"fields": [field]}
for field in $fields; do
    response=$(curl -s -k -X POST "${lapis_url}sample/aggregated" -H "Content-Type: application/json" -d "{\"fields\": [\"$field\"]}")

    # If response.data has length 1 and response.data[0].[field] is null, print the field
    if [[ $(echo "$response" | jq -r '.data | length') -eq 1 ]]; then
        if [[ $(echo "$response" | jq -r ".data[0].$field") == "null" ]]; then
            echo "$field"
        fi
    fi
done

@chaoran-chen
Copy link
Member

@fengelniederhammer, we introduced the fields intentionally with the goal / hope that they will be populated more in the future as these are all relevant information.

@corneliusroemer, sounds good, especially internally. When sending out the data (e.g., in get-released-data), would you leave the fields empty or populate with nulls?

@fengelniederhammer
Copy link
Contributor

SILO requires that all configured metadata fields are present in the imported NDJSON. Also I have a slight preference that all configured data should be contained in the released data (even if it's just null).

@chaoran-chen
Copy link
Member

I also have the same slight preference! As NDJSON doesn't include schema information, always seeing all fields is useful so that you can easily see which fields exist (although the type is still not clear before seeing the first value).

@corneliusroemer
Copy link
Contributor Author

Right, let me be more precise: My proposal is to just change the internal representation in the database, not what is sent to clients. It doesn't matter if we send these null fields in get-released-data as they compress so well with zstd that it doesn't matter for transfer, and clients can throw them out immediately as well.

The only change I'm suggesting is to

  • Not store nulls when backend writes preprocessed data to sequence_entries_preprocessed_data table

@corneliusroemer corneliusroemer changed the title Don't put fields with null values in processed data output Don't put fields with null values in sequence_entires_preprocessed_data.processed_data Sep 16, 2024
@corneliusroemer corneliusroemer changed the title Don't put fields with null values in sequence_entires_preprocessed_data.processed_data Don't put fields with null values in sequence_entries_preprocessed_data.processed_data Sep 16, 2024
@corneliusroemer
Copy link
Contributor Author

This also has the benefit that we can add new fields without requiring new preprocessing pipeline version. So we should do it sooner rather than later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backend related to the loculus backend component performance
Projects
None yet
Development

No branches or pull requests

3 participants