HDFS trash is just like the Recycle Bin in Windows operating systems. Its purpose is to prevent you from unintentionally deleting something. You can enable this feature by setting this property:

fs.trash.interval

with a number greater than 0 in core-site.xml. After the trash feature is enabled, when you remove something from HDFS by using the rm command, files or directories will not be wiped out immediately; instead, they will be moved to a trash directory (/user/${username}/.Trash, for example).

hadoop dfs -rm -r /tmp/5gb

15/09/01 20:34:48 INFO fs.TrashPolicyDefault: Namenode trash configuration: Deletion interval = 360 minutes, Emptier interval = 0 minutes.
Moved: ‘hdfs://hdpnn/tmp/5gb’ to trash at: hdfs://hdpnn/user/ambari-qa/.Trash/Current

In the preceding output:

  • Deletion interval specifies how long (in minutes) a checkpoint will be expired before it is deleted. It is the value of fs.trash.interval. The NameNode runs a thread to periodically remove expired checkpoints from the file system.
  • Emptier interval specifies how long (in minutes) the NameNode waits before running a thread to manage checkpoints. The NameNode deletes checkpoints that are older than fs.trash.interval and creates a new checkpoint from /user/${username}/.Trash/Current. This frequency is determined by the value of fs.trash.checkpoint.interval, and it must not be greater than the deletion interval. This ensures that in an emptier window, there are one or more checkpoints in the trash.

For example, set

fs.trash.interval = 360 (deletion interval = 6 hours)
fs.trash.checkpoint.interval = 60 (emptier interval = 1 hour)

This causes the NameNode to create a new checkpoint every hour and to delete checkpoints that have existed longer than 6 hours.

What is a checkpoint?

A checkpoint is merely a directory under the user trash that is used to store all files or directories that were deleted before the checkpoint is created. If you want to take a look at the trash directory, you can see it at /user/${username}/.Trash/{timestamp_of_checkpoint_creation}.

What if I want to empty the trash?

The first thing that comes to your mind is “Just delete the entire trash directory; that would remove everything”. True, that is always an option. But you have a better option. HDFS provides a command line utility to do that:

hadoop fs -expunge

This command causes the NameNode to permanently delete files from the trash that are older than the threshold, instead of waiting for the next emptier window. It immediately removes expired checkpoints from the file system.

When should I enable the trash? And what needs my attention?

For a production environment, it is recommended that you enable trash to avoid unexpected removal operations. Enabling trash provides a chance to recover data from operational or user errors. But it is also important to set appropriate values for fs.trash.interval and fs.trash.checkpoint.interval to make trash work the way you expect it to work. For example, if you need to frequently upload and delete files from the HDFS, you probably want to set fs.trash.interval to a smaller value, otherwise the checkpoints would take up too much space.

Keep in mind that when trash is enabled and you remove some files, HDFS capacity does not increase because files are not truly deleted. The HDFS does not reclaim the space unless the files are removed from the trash, which occurs only after checkpoints are expired. Sometimes you might want to temporarily disable trash when deleting files; in this case, you can run the rm command with the -skipTrash option. For example:

hadoop fs -rm -skipTrash /path/to/permanently/delete

This bypasses the trash and removes the files immediately from the file system.

More…

HDFS trash is defined as a pluggable interface. You can define a new policy if the default one (discussed here) doesn’t fit your requirements.

5 comments on"HDFS Trash"

  1. Hi Weiwei,

    Thanks for the article. I have a question though: Is the HDFS trash container set to a specific size? I mean, is there configured on how much space it can utilize or it only depends on the Filesystem Trash Interval?

    Thanks,
    J.

    • WEIWEI YANG October 07, 2016

      Hello Daninho

      No, there isn’t a place to configure the space. Trash is just an user directory on HDFS, if you want to limit the space size of a trash dir, you could utilize HDFS Quotas.

  2. what does the trash consists of metadata or actual data

    • Since you are deleting something from File System(which contains actual data), so it gets moved to Trash which also contains the same data.

Join The Discussion

Your email address will not be published. Required fields are marked *