Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spike generate popular tasks using BigQuery #3761

Draft
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

beccapearce
Copy link
Contributor

@beccapearce beccapearce commented Sep 6, 2024

Trello card: https://trello.com/c/AGRVqS4S

This spike sets up a service that connects to BigQuery

To run this locally you will need to set up some config following the steps in the dev docs.

It collects the data from big query and outputs it on the benefits and business pages.

There is an initial implementation of caching and back up caching. This needs to be amended so it will run locally.

The links are currently displayed as a slug, this will need to be changed to show a title.

Still need to add a commit for the sentry logging.

Before After
image image

Local env is used only for locally testing, this is to ensure secrets
are not accidentally pushed.
- Add the BigQuery gem
- Create a BigQuery service so the app can talk to BigQuery

NB in order to use this you must add the credentials to
config/local_env.yml. There are instructions on how to do this in the
[dev docs](https://docs.publishing.service.gov.uk/repos/content-data-api/google_analytics_setup.html)
- Added a basic SQL query to retrieve some initial test data

NB this query only currently works for a select time period and only on
the benefits and business pages (shouldn't be too hard to change)
- Updated the view to render the results fetched from BigQuery
- Simple unordered list displaying the search term as a link
- Just making sure that the data can be passed from BigQuery to the view

NB Still need to add the title into the link rendering
- Cache the expensive process of retrieving popular tasks from BigQuery
- Save this cache with a cache key that has the date and the browse page
name.

Sneaky change added in here that I'll move to a different commit:
- Change the BigQuery data retrieval to only collect data for one browse
at a time.
Improve data availability bu having a backup cache if the latest is not
availabile

- Added a backup cache mechanism to ensure data is available even if the
latest cache is expired or unavailable.
- Popular tasks data is now stored in both a latest cache (24 hours
expiration) and a backup cache (7 days expiration).
- Fallback to backup cache when the latest cache is missing, ensuring
users always see data even if fresh data retrieval fails.
- Updated methods to handle cache fallback logic, improving robustness
and reducing the likelihood of empty data responses.
@govuk-ci govuk-ci temporarily deployed to collections-pr-3761 September 6, 2024 17:36 Inactive
@govuk-ci govuk-ci temporarily deployed to collections-pr-3761 September 16, 2024 14:42 Inactive
WHERE Rank <6
SQL

data = @fetch_data.query(query).all
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How many results does this query fetch?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the top 5 we can increase or lower it by changing the ranking here:
WHERE Rank <6

@@ -1,5 +1,6 @@
class PopularTasks
CACHE_EXPIRATION = 24.hours # Set the cache expiration time
BACKUP_CACHE_EXPIRATION = 7.days # Backup cache can have a longer expiration
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to think a bit more about how this would work.

If the bigquery data was unavailable for more than 7 days then what happens?

I can think of other ways to do it - but this feels like a problem that must have been solved many times before. i.e. Only expire the cache if fresh data is available to fill it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've caught up now... the cache will expire regardless of whether or not the API responds so I understand the need for a backup. And I like the idea of writing to the backup at the same time as you fetch the fresh data.

@hannako
Copy link
Contributor

hannako commented Sep 23, 2024

Some things to consider for next steps (which I've started thinking about here

  • Error handing. What happens if the api returns an error
  • Some rake tasks so we can manually expire/refresh the cache if needed
  • Where are we going to add logging and what kind
  • Is memcached the best option?

Also the code base has moved on quite a bit since this spike was started - in terms of the logic in the browse helper - hence starting the fresh spike 2 branch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants