Home > Blockchain >  Fastest way to check thousands of gzip files
Fastest way to check thousands of gzip files

Time:08-19

I created this snippet to loop through a folder and check if there is an invalid gz file and fix it by gzipping it again. This works fine but only if there are only a couple of files. If there are thousands of files, this takes so long.

Is there a more optimized way to do this.

fix_corrupt_files()
{
  dir=$1
  
  for f in $dir/*.gz
  do
  
  if gzip -t $f;
    then :
  else
    log "$(basename $f) is corrupt"
    base="$(basename $f .gz)"
    log "fixing file"
    mv $f $dir/$base
    gzip $dir/$base
    log "file fixed"
  fi
  
  done
}

CodePudding user response:

This should give you a little speed up:

fix_corrupt_files()
{
  dir="$1"
  
  for f in "$dir"/*.gz
  do 
  {
    if gzip -t "$f";
      then :
    else
      log "$(basename "$f") is corrupt"
      base="$(basename "$f" .gz)"
      log "fixing file"
      mv "$f" "$dir/$base"
      gzip "$dir/$base" &  # run in background
      log "file fixed"
    fi
  } &
  done
  wait # wait for all background processes to terminate
}

Note that I'm assuming the gzip commands are your slow parts.

All I really did here was run your if statement in the background (with {...}&). So basically each if statement in your function is going to run in parallel. There is a wait at the end of the function so it won't leave the function until all sub-processes complete. That may or may not fit your use case. Also be aware that log is going to get called essentially randomly and possibly out of order. Again, it really depends on your use case whether that matters.

Also note that I added double quotes where ever they should be. Looks like you're are confident that there are not spaces in your file names, but it was giving me anxiety.

edit: Also also note that this may bring your machine to its knees. I'm not familiar enough with gzip to know how resource intensive it is. I also don't know how big your archives are. If that becomes a problem, you could add a loop counter that calls wait every X number of iterations.

  • Related