Home > Blockchain >  Multiple Snapshots on a directory
Multiple Snapshots on a directory

Time:12-09

We have a cluster in Cloudera. We are using snapshots for HDFS for backups. Recently we have seen that the space that is used in HDFS has been growing significantly. We suspect that are because the snapshots that we use for backups.

  1. When we try to see the size of a directory we see the following:

    hdfs dfs -du -s -h path

    12.4 T 76.8 T path

  2. When we try to see the size of this directory we see a totally different thing:

    hdfs dfs -du -h -s -x

    12.4 T 37.2 T

We also tried to look at the size of the snapshots on this . The sizes are the following:

hdfs dfs -du -s -h <path>/.snapshot
9.1 T  63.6 T <path>/.snapshot/snap-new
10.9 T  68.0 T <path>/.snapthot/snap-old
12.4 T  37.2 T <path>/.snapshot/snap-of-today

My question here is if I delete all this snapshots (snap-new, snap-old, snap-of-today) we will start to see the size of like number 2?

If not, what do I have to do in order to start seeing the size of like number 2?

Thanks in advance!

CodePudding user response:

If you delete your snapshot you will use less memory.

Just as a reminder of why snapshots become larger over time:

The implementation of HDFS Snapshots is efficient:

Snapshot creation is instantaneous: the cost is O(1) excluding the inode lookup time. Additional memory is used only when modifications are made relative to a snapshot: memory usage is O(M), where M is the number of modified files/directories. Blocks in datanodes are not copied: the snapshot files record the block list and the file size. There is no data copying. Snapshots do not adversely affect regular HDFS operations: modifications are recorded in reverse chronological order so that the current data can be accessed directly. The snapshot data is computed by subtracting the modifications from the current data.

CodePudding user response:

There is no way in HDFS to see how much space a particular snapshot is using. In general, the oldest snapshot will use the most space, but it depends on when you delete and reload data. Any data covered by a snapshot will not be removed when you delete the data. From your du outputs:

12.4 T 76.8 T path

Notice how 3 x 12.4 = 37.2. So all your snapshots are using 76.8 - 37.2 = 39.6 T beyond what is in the live filesystem, which is using 37.2T. The -x switch excludes snapshots data, and also shows 37.2T in the live file system.

You can see from running du on the snapshot directories, it just tells you how much space was used by the files in the snapshot at the time the snapshot was captured. Some of that space will be shared between all snapshots and even the live filesystem, so you don't know which snapshot is using up the most space.

If you delete snapshots, starting from the oldest, the space usage should reduce.

If you cluster has workloads which frequently delete and re-create a lot of data, snapshots will greatly increase your space requirements on the cluster.

  • Related