-
Notifications
You must be signed in to change notification settings - Fork 480
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cache consistency problem after partial cache clean-up with Apache httpd as upstream #248
Comments
Wow, this is a phenomenally detailed bug report! Thanks, that super useful. What you're doing is basically what I would have done if I ever got around to adding size limits for the on-disk cache. So I wouldn't necessarily categorize this as something you shouldn't be doing. You just seem to have found an interesting bug in the current implementation. I'll have to play with this a bit more to trace through things and better understand exactly what is going on. Realistically, it's probably not something I'll get to super soon, but I'll take a look when I can. |
We're seeing something similar here and I think our 30 day S3 cleanup is to blame. Check this out:
produces
Which creates 2 files:
now we simulate an S3 delete on the original after 30 days since this copy is never updated:
and request our resized copy again
Uh oh - 0 bytes back. Matching logs from this request:
Now our original is a bit off
and the cache is poisoned. Moral of the story for now: don't cleanup. I'm not clear on how to fix this yet. |
@mdkent Looks like Windows-Azure-Blob doesn't add the Last-Modified header to 304-responses either and therefore confuses some part of imageproxy's caching logic - the same way Apache httpd did in our tests. Regarding work-arounds: As long as for every deleted original image all resized copies are deleted too, the cache should be safe from poisoning. Instead of not cleaning up at all, our current strategy is to completely delete everything as soon as the storage size exceeds a certain threshold. |
The combination of using Apache httpd as upstream and meddling with imageproxy's file system cache causes unexpected 304s to be returned where 200s would be necessary, breaking our image delivery.
I've created a gist containing a docker-compose file and shell script to reproduce the bug.
Scenario
Setup
We're using imageproxy with a file system cache in front of an Apache httpd delivering the unscaled image. We cache based on
Last-Modified
notETag
.Example image and cache (file system cache at
./cache
):Original image:
https://example.org/example.jpg
,./cache/3d/fc/3dfc20cf01c0d8c6011b343185c7367d
(Last-Modified:Thu, 1 Jan 1970 00:00:00 GMT
)Resized image:
https://example.org/example.jpg#100x0
,./cache/42/de/42de33c0c5956a6c15069f6944050390
Resized image:
https://example.org/example.jpg#200x0
,./cache/38/f3/38f3ec98fd5c26b5828b85557e1c7346
To prevent imageproxy's file system cache from growing without bounds, we use a cron-job to delete files based on their last modification date (not necessarily an optimal strategy). This can lead to the deletion of the cache entry of the original image while retaining some cache entries of scaled images (created at a later point in time). (If you say that's something you shouldn't be doing - we can stop here and maybe add a note to the documentation somewhere...)
So potentially, after a clean-up the example cache could look like this:
deleted:
https://example.org/example.jpg
,./cache/3d/fc/3dfc20cf01c0d8c6011b343185c7367d
deleted:
https://example.org/example.jpg#100x0
,./cache/42/de/42de33c0c5956a6c15069f6944050390
remaining:
https://example.org/example.jpg#200x0
,./cache/38/f3/38f3ec98fd5c26b5828b85557e1c7346
(because it was requested at a later point in time and therefore the file in the cache was modified at a later point in time too)The request
Now we perform a fresh request for the scaled image
https://example.org/example.jpg#200x0
. Enough time has passed since the last request for this same image+transformation so that imageproxy needs to check whether upstream has an image with a newerLast-Modified
header than the cachedThu, 1 Jan 1970 00:00:00 GMT
. In this scenario, the upstream image is not modified, so the Apache httpd responds with a304 Not modified
. Apache httpd does not include aLast-Modified
header in this case, see this issue for current work regarding this behaviour.Expected behaviour
Two plausible behaviours:
Last-Modified
header hasn't changed, it could avoid scaling the original, as the cached entry of the scaled image would still be valid.GET
request forhttps://example.org/example.jpg
withIf-Modified-Since: Thu, 1 Jan 1970 00:00:00 GMT
and when receiving a304 Not modified
could avoid the caching of the original image and simply return the cached entry of the scaled image.Observed behaviour
Imageproxy chooses the second path, but with an unexpected twist: It performs a
GET
request forhttps://example.org/example.jpg
withIf-Modified-Since: Thu, 1 Jan 1970 00:00:00 GMT
and on receiving a304 Not modified
, (unexpectedly) stores this response as the cache entry for the original image(!). So the file./cache/3d/fc/3dfc20cf01c0d8c6011b343185c7367d
(forhttps://example.org/example.jpg
) now contains the304 Not modified
response - not any image data. It then (correctly) returns the cached entry of the scaled image.As the image proxy attempts to transform the original image (despite having a cached entry of the scaled image), the logs contain the following error:
error transforming image https://example.org/example.jpg#200x0: image: unknown format
From this point on, the cache is "poisoned" and any requests for new transformations regarding this image cause this poisoned cache entry for the original image to be returned, i.e. the client receives a
304 Not modified
although the GET request contained no caching information.Work-around
If Apache httpd added a
Last-Modified
header to its304 Not modified
response, everything would work as expected - no broken cache entry of the original image would be created. This is why the problem is not visible against Nginx, which adds theLast-Modified
header. We've verified this by adding mitmproxy in between imageproxy and Apache httpd / nginx and adding / removing the headers in the responses. The above mentioned gist contains the scenario where we modify the nginx response to provoke the bug.Another obvious work-around is to not mess with the cache directory. (This is what we've done in our setup. If the cache directory gets too large, we simply delete everything at once.) Maybe messing with the directory is specified as undefined ... unfortunately my go knowledge was too limited to understand which components / precise code paths are involved, so I didn't quite know what to check.
The text was updated successfully, but these errors were encountered: