-
Notifications
You must be signed in to change notification settings - Fork 444
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GC process may be skipping deletion of some unreferenced files. #1227
Comments
So far I have found two common "types" that are not being removed. The may be a few other files that don't fit these criteria, and more checking is required to see if there are recent occurrences with tserver and master log files available for the times of the file creation to narrow the scope. |
This isn't a bug. This is the intended design. The original garbage collector used to crawl all of HDFS, look for referenced files, and delete everything unreferenced. This was unsafe in the case of failure (some files were too aggressively deleted prior to a reference being added), and a big burden on the name node. The new garbage collector tries to only delete things that have been explicitly identified as a candidate for deletion, and are provably safe to delete. It errs on the side of leaving things behind, rather than deleting them. Of course, this means that system administrators need to watch their clouds, especially in the case of failures, for anything left behind unreferenced, but that is an intentional trade-off. |
Understood - I didn't realize this was intended - I'm working on tooling that would help in identifying these "unreferenced" candidates - we can discuss the best mechanism that would make the utility available to the wider community if that is desired. |
Removed bug and making this an enhancement instead. It seems that the best course of action will be to provide a mechanism for these files to be discovered / reported to administrators. |
No activity in over 3 years so closing out, can be reopened if still an issue or planned to be worked. |
Also related: as similar issue: #1006 |
I have found "old" files on a system that are not referenced in the metadata table that are not being removed by the default GC process.
The text was updated successfully, but these errors were encountered: