Home > Software engineering >  Optimize zgrep using awk
Optimize zgrep using awk

Time:10-09

I have a list of files (/c/Users/Roy/DataReceived) over which I want to grep some information and store it as txt files(/c/Users/Roy/Documents/Result).

For example purposes: Imagine I have a 20 files with different information about cities, and I want to grep information for the cities that are listed in a txt file. All this information will then be stored in another txt file that would have the name of the given city (NewYork.txt, Rome.txt, etc).

The following code is working:

#!/bin/bash

declare INPUT_DIRECTORY=/c/Users/Roy/DataReceived
declare OUTPUT_DIRECTORY=/c/Users/Roy/Documents/Result

while read -r city; do
  echo $city
  zgrep -Hwi "$city" "${INPUT_DIRECTORY}/"*.vcf.gz > "${OUTPUT_DIRECTORY}/${city}.txt"
done < list_of_cities.txt

However, this process takes around a week to run fully. My question is, is there a way to unzip the files just once? Using awk for example? This would make the process twice as fast.

Also, is there any other way of optimizing the process?

CodePudding user response:

I suspect what you really need is something like the following, assuming the zipped file contains CSV with the city in the 3rd field:

zcat "${INPUT_DIRECTORY}/"*.vcf.gz |
sort -t',' -k3,3 |
awk -F',' -v outDir="$OUTPUT_DIRECTORY" '
    $3 != prev {
        close(out)
        out = outDir "/" $3 ".txt"
    }
    { print > out }
'

If the file isn't CSV then change each ',' separator to whatever separator it really is, and if the city isn't in the 3rd field then change each 3 to whatever field number it really is.

If you really do need to reduce the output to a specific list of cities then:

zcat "${INPUT_DIRECTORY}/"*.vcf.gz |
sort -t',' -k3,3 |
awk -F',' -v outDir="$OUTPUT_DIRECTORY" '
    NR == FNR {
        cities[$0]
        next
    }
    !($3 in cities) {
        next
    }
    $3 != prev {
        close(out)
        out = outDir "/" $3 ".txt"
    }
    { print > out }
' list_of_cities.txt -

CodePudding user response:

I'll give you the same reply as for your previous question:

zgrep -Hwif list_of_cities.txt /c/Users/Roy/DataReceived/*.vcf.gz |
awk -F ':' '{print > ( "/c/Users/Roy/Documents/Result/" tolower($NF) ".txt")}'
  • Related