Spike generate popular tasks using BigQuery #3761

beccapearce · 2024-09-06T17:35:51Z

Trello card: https://trello.com/c/AGRVqS4S

This spike sets up a service that connects to BigQuery

To run this locally you will need to set up some config following the steps in the dev docs.

It collects the data from big query and outputs it on the benefits and business pages.

There is an initial implementation of caching and back up caching. This needs to be amended so it will run locally.

The links are currently displayed as a slug, this will need to be changed to show a title.

Still need to add a commit for the sentry logging.

Before	After

Local env is used only for locally testing, this is to ensure secrets are not accidentally pushed.

- Add the BigQuery gem - Create a BigQuery service so the app can talk to BigQuery NB in order to use this you must add the credentials to config/local_env.yml. There are instructions on how to do this in the [dev docs](https://docs.publishing.service.gov.uk/repos/content-data-api/google_analytics_setup.html)

- Added a basic SQL query to retrieve some initial test data NB this query only currently works for a select time period and only on the benefits and business pages (shouldn't be too hard to change)

- Updated the view to render the results fetched from BigQuery - Simple unordered list displaying the search term as a link - Just making sure that the data can be passed from BigQuery to the view NB Still need to add the title into the link rendering

- Cache the expensive process of retrieving popular tasks from BigQuery - Save this cache with a cache key that has the date and the browse page name. Sneaky change added in here that I'll move to a different commit: - Change the BigQuery data retrieval to only collect data for one browse at a time.

Improve data availability bu having a backup cache if the latest is not availabile - Added a backup cache mechanism to ensure data is available even if the latest cache is expired or unavailable. - Popular tasks data is now stored in both a latest cache (24 hours expiration) and a backup cache (7 days expiration). - Fallback to backup cache when the latest cache is missing, ensuring users always see data even if fresh data retrieval fails. - Updated methods to handle cache fallback logic, improving robustness and reducing the likelihood of empty data responses.

hannako · 2024-09-16T16:03:22Z

app/services/popular_tasks.rb

+      WHERE Rank <6
+    SQL
+
+    data = @fetch_data.query(query).all


How many results does this query fetch?

this is the top 5 we can increase or lower it by changing the ranking here:
WHERE Rank <6

hannako · 2024-09-19T16:22:31Z

app/services/popular_tasks.rb

@@ -1,5 +1,6 @@
 class PopularTasks
  CACHE_EXPIRATION = 24.hours # Set the cache expiration time
+  BACKUP_CACHE_EXPIRATION = 7.days # Backup cache can have a longer expiration


I think we need to think a bit more about how this would work.

If the bigquery data was unavailable for more than 7 days then what happens?

I can think of other ways to do it - but this feels like a problem that must have been solved many times before. i.e. Only expire the cache if fresh data is available to fill it.

I've caught up now... the cache will expire regardless of whether or not the API responds so I understand the need for a backup. And I like the idea of writing to the backup at the same time as you fetch the fresh data.

hannako · 2024-09-23T08:38:40Z

Some things to consider for next steps (which I've started thinking about here

Error handing. What happens if the api returns an error
Some rake tasks so we can manually expire/refresh the cache if needed
Where are we going to add logging and what kind
Is memcached the best option?

Also the code base has moved on quite a bit since this spike was started - in terms of the logic in the browse helper - hence starting the fresh spike 2 branch.

beccapearce added 6 commits September 6, 2024 17:57

Add local_env to the gitignore

39029ed

Local env is used only for locally testing, this is to ensure secrets are not accidentally pushed.

Add a BigQuery query to get some initial data

046d618

- Added a basic SQL query to retrieve some initial test data NB this query only currently works for a select time period and only on the benefits and business pages (shouldn't be too hard to change)

Render popular tasks in view

794a1c2

- Updated the view to render the results fetched from BigQuery - Simple unordered list displaying the search term as a link - Just making sure that the data can be passed from BigQuery to the view NB Still need to add the title into the link rendering

govuk-ci temporarily deployed to collections-pr-3761 September 6, 2024 17:36 Inactive

govuk-ci temporarily deployed to collections-pr-3761 September 16, 2024 14:42 Inactive

hannako reviewed Sep 16, 2024

View reviewed changes

hannako reviewed Sep 19, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spike generate popular tasks using BigQuery #3761

Spike generate popular tasks using BigQuery #3761

beccapearce commented Sep 6, 2024 •

edited

Loading

hannako Sep 16, 2024

beccapearce Sep 23, 2024

hannako Sep 19, 2024

hannako Sep 23, 2024

hannako commented Sep 23, 2024 •

edited

Loading

Spike generate popular tasks using BigQuery #3761

Are you sure you want to change the base?

Spike generate popular tasks using BigQuery #3761

Conversation

beccapearce commented Sep 6, 2024 • edited Loading

hannako Sep 16, 2024

Choose a reason for hiding this comment

beccapearce Sep 23, 2024

Choose a reason for hiding this comment

hannako Sep 19, 2024

Choose a reason for hiding this comment

hannako Sep 23, 2024

Choose a reason for hiding this comment

hannako commented Sep 23, 2024 • edited Loading

beccapearce commented Sep 6, 2024 •

edited

Loading

hannako commented Sep 23, 2024 •

edited

Loading