-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Don't put fields with null values in sequence_entries_preprocessed_data.processed_data
#2793
Comments
On the other hand, we have a lot of metadata fields that only have CCHFFields with only null values: Ebola SudanFields with only null values: Ebola ZaireFields with only null values: West NileFields with only null values: Maybe we could get rid of some of there fields? |
Script to scan for #!/usr/bin/bash
set -euo pipefail
lapis_url="https://lapis.pathoplexus.org/west-nile/"
# The the list of fields from /sample/databaseConfig
fields=$(curl -s -k -X GET "${lapis_url}sample/databaseConfig" | jq -r '.schema.metadata[].name')
echo "Fields with only null values:"
# For every field, call /sample/aggregated with {"fields": [field]}
for field in $fields; do
response=$(curl -s -k -X POST "${lapis_url}sample/aggregated" -H "Content-Type: application/json" -d "{\"fields\": [\"$field\"]}")
# If response.data has length 1 and response.data[0].[field] is null, print the field
if [[ $(echo "$response" | jq -r '.data | length') -eq 1 ]]; then
if [[ $(echo "$response" | jq -r ".data[0].$field") == "null" ]]; then
echo "$field"
fi
fi
done |
@fengelniederhammer, we introduced the fields intentionally with the goal / hope that they will be populated more in the future as these are all relevant information. @corneliusroemer, sounds good, especially internally. When sending out the data (e.g., in get-released-data), would you leave the fields empty or populate with nulls? |
SILO requires that all configured metadata fields are present in the imported NDJSON. Also I have a slight preference that all configured data should be contained in the released data (even if it's just null). |
I also have the same slight preference! As NDJSON doesn't include schema information, always seeing all fields is useful so that you can easily see which fields exist (although the type is still not clear before seeing the first value). |
Right, let me be more precise: My proposal is to just change the internal representation in the database, not what is sent to clients. It doesn't matter if we send these null fields in The only change I'm suggesting is to
|
sequence_entires_preprocessed_data.processed_data
sequence_entires_preprocessed_data.processed_data
sequence_entries_preprocessed_data.processed_data
This also has the benefit that we can add new fields without requiring new preprocessing pipeline version. So we should do it sooner rather than later. |
It seems we store a whole lot of nothing (=fields with null values) in the
sequence_entries_preprocessed_data.processed_data
columnWe should switch to treating missing fields as having null and not saving them in the table. This would pretty much halve the size of that table.
Example processed data row:
> LOC_000CV63 | 1 | 1 | {"metadata": {"length": 17, "authors": "J.A. Mcswiggen, L. Blatt", "hostAge": null, "lineage": null, "cellLine": null, "hostRole": null, "cultureId": null, "isLabHost": null, "totalSnps": null, "geoLocCity": null, "geoLocSite": null, "hostAgeBin": null, "hostGender": null, "sampleType": null, "stopCodons": null, "bodyProduct": null, "displayName": "LOC_000CV63.1", "foodProduct": null, "frameShifts": null, "hostDisease": null, "hostTaxonId": null, "ampliconSize": null, "completeness": null, "geoLocAdmin1": null, "geoLocAdmin2": null, "insdcVersion": 1, "ncbiSourceDb": "GenBank", "exposureEvent": null, "geoLocCountry": null, "ncbiVirusName": "West Nile virus", "passageMethod": null, "passageNumber": null, "travelHistory": null, "anatomicalPart": null, "geoLocLatitude": null, "hostNameCommon": null, "ncbiUpdateDate": "2009-05-23", "ncbiVirusTaxId": 11082, "sequencingDate": null, "versionComment": null, "dehostingMethod": null, "depthOfCoverage": null, "exposureDetails": null, "exposureSetting": null, "geoLocLongitude": null, "hostHealthState": null, "ncbiReleaseDate": "2009-05-23", "sraRunAccession": null, "totalStopCodons": null, "collectionDevice": null, "collectionMethod": null, "signsAndSymptoms": null, "totalDeletedNucs": null, "totalFrameShifts": null, "totalUnknownNucs": null, "breadthOfCoverage": null, "environmentalSite": null, "hostHealthOutcome": null, "hostOriginCountry": null, "purposeOfSampling": null, "totalInsertedNucs": null, "anatomicalMaterial": null, "authorAffiliations": null, "biosampleAccession": null, "hostNameScientific": null, "insdcAccessionBase": "HA068283", "insdcAccessionFull": "HA068283.1", "sampleReceivedDate": null, "sequencingProtocol": null, "specimenProcessing": null, "totalAmbiguousNucs": null, "bioprojectAccession": null, "presamplingActivity": null, "purposeOfSequencing": null, "sequencingAssayType": null, "ncbiSubmitterCountry": null, "qualityControlIssues": null, "sampleCollectionDate": null, "sequencingInstrument": null, "environmentalMaterial": null, "foodProductProperties": null, "hostVaccinationStatus": null, "qualityControlDetails": null, "sequencedByContactName": null, "ampliconPcrPrimerScheme": null, "sequencedByContactEmail": null, "sequencedByOrganization": null, "diagnosticTargetGeneName": null, "diagnosticTargetPresence": null, "previousInfectionDisease": null, "qualityControlMethodName": null, "referenceGenomeAccession": null, "diagnosticMeasurementUnit": null, "previousInfectionOrganism": null, "specimenCollectorSampleId": null, "specimenProcessingDetails": null, "diagnosticMeasurementValue": null, "diagnosticMeasurementMethod": null, "qualityControlDetermination": null, "qualityControlMethodVersion": null, "experimentalSpecimenRoleType": null, "consensusSequenceSoftwareName": null, "rawSequenceDataProcessingMethod": null, "consensusSequenceSoftwareVersion": null}, "aminoAcidInsertions": {}, "nucleotideInsertions": {"main": []}, "alignedAminoAcidSequences": {"2K": null, "NS1": null, "NS3": null, "NS5": null, "env": null, "prM": null, "NS2A": null, "NS2B": null, "NS4A": null, "NS4B": null, "capsid": null}, "alignedNucleotideSequences": {"main": null}, "unalignedNucleotideSequences": {"main": {"compressedSequence": "KLUv/SARPQAAAAEAL5ghEA=="}}} | [{"source": [{"name": "main", "type": "NucleotideSequence"}], "message": "Nucleotide sequence failed to align"}] | [] | HAS_ERRORS | 2024-09-15 16:34:20.523262 | 2024-09-15 16:34:20.703557
The text was updated successfully, but these errors were encountered: