Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GC process may be skipping deletion of some unreferenced files. #1227

Closed
EdColeman opened this issue Jun 20, 2019 · 6 comments
Closed

GC process may be skipping deletion of some unreferenced files. #1227

EdColeman opened this issue Jun 20, 2019 · 6 comments
Labels
enhancement This issue describes a new feature, improvement, or optimization.

Comments

@EdColeman
Copy link
Contributor

EdColeman commented Jun 20, 2019

I have found "old" files on a system that are not referenced in the metadata table that are not being removed by the default GC process.

@EdColeman EdColeman added the bug This issue has been verified to be a bug. label Jun 20, 2019
@EdColeman
Copy link
Contributor Author

EdColeman commented Jun 20, 2019

So far I have found two common "types" that are not being removed.
a) Files in a bulk import directory - this may be from a failed ingest, but the failure would have had to of occurred after the files were moved / renamed - they have the expected directory structure and file naming convention. Because of the age of the files, I need to find more recent occurrences where information for the import would be available in the logs to help determine what may have occurred with the ingest.
b) Files that seem to be from a failed / interrupted compaction - the files are the .tmp_rf files that are used during the compaction process. There were no running compactions, and running a compaction over the range of the file completed. It did not appear that this was enough to trigger the file being gc'ed (but I may have been impatient with the check and need to resist if the file has since been removed) However, the file "should" have been removed long before I took any action.

The may be a few other files that don't fit these criteria, and more checking is required to see if there are recent occurrences with tserver and master log files available for the times of the file creation to narrow the scope.

@ctubbsii
Copy link
Member

ctubbsii commented Jun 20, 2019

This isn't a bug. This is the intended design. The original garbage collector used to crawl all of HDFS, look for referenced files, and delete everything unreferenced. This was unsafe in the case of failure (some files were too aggressively deleted prior to a reference being added), and a big burden on the name node.

The new garbage collector tries to only delete things that have been explicitly identified as a candidate for deletion, and are provably safe to delete. It errs on the side of leaving things behind, rather than deleting them. Of course, this means that system administrators need to watch their clouds, especially in the case of failures, for anything left behind unreferenced, but that is an intentional trade-off.

@EdColeman
Copy link
Contributor Author

Understood - I didn't realize this was intended - I'm working on tooling that would help in identifying these "unreferenced" candidates - we can discuss the best mechanism that would make the utility available to the wider community if that is desired.

@EdColeman EdColeman added enhancement This issue describes a new feature, improvement, or optimization. and removed bug This issue has been verified to be a bug. labels Aug 16, 2019
@EdColeman
Copy link
Contributor Author

Removed bug and making this an enhancement instead. It seems that the best course of action will be to provide a mechanism for these files to be discovered / reported to administrators.

@cshannon
Copy link
Contributor

cshannon commented Dec 3, 2022

No activity in over 3 years so closing out, can be reopened if still an issue or planned to be worked.

@cshannon cshannon closed this as not planned Won't fix, can't repro, duplicate, stale Dec 3, 2022
@EdColeman
Copy link
Contributor Author

Also related: as similar issue: #1006

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement This issue describes a new feature, improvement, or optimization.
Projects
None yet
Development

No branches or pull requests

3 participants