Home > Mobile >  How can I grep from a list in a file where each line includes a space. Working with genus and specie
How can I grep from a list in a file where each line includes a space. Working with genus and specie

Time:02-16

I have a file organisms.txt with one organism (genus and species) per line.

Escherichia coli
Staphylococcus aureus
Prevotella sp. 855
Saprospirales
Candidatus Accumulibacter phosphatis

I want to use grep to search though another file for each organism and write the matches to an output file with the name of the organism. My file large_file.txt is like this:

Parcubacteria bacterium    0    87    2762014
Saprospirales    837    78    1936988
Escherichia coli    857    95    562
Bacteroides ihuae    12    100    1852362
Candidatus Escherichia coli O12H3    988    95    888
Dialister invisus    30    86    218538
Fake Escherichia bacterium    112    99    110
Escherichia coli 07798    1094    99   1005566
Escherichia coli    14    87    562
Saprospirales bacterium    87    98.6    4587674
Saprospirales sp.    12588    99    1936988

I am using this while loop.

while IFS= read -r line
do
out="${line}_hits.txt"
grep "${line}" large_file.txt
> "$out"
done < "organisms.txt"

I have checked manually for the organisms in my list to verify that they are found in large_file.txt and they are definitely found in large_file.txt . The output files are all created using this loop however they are all empty. I would expect for example, that the output file Escherichia coli_hits.txt, would look like this:

    Escherichia coli    857    95    562
    Candidatus Escherichia coli O12H3    988    95    888
    Escherichia coli 07798    1094    99   1005566
    Escherichia coli    14    87    562 

And I would expect the output file Saprospirales_hits.txt to look like this:

Saprospirales    837    78    1936988
Saprospirales bacterium    87    98.6    4587674
Saprospirales sp.    12588    99    1936988

I would also expect a file named Staphylococus aureus_hits.txt to have been created and to be an empty file as well as similar files for all other lines in organisms.txt that were not found in large_file.txt.

What do I need to change to get my desired results?

CodePudding user response:

The way you redirect to "$out" truncates the file for every loop iteration:

grep "$line" large_file.txt
> "$out" # This truncates the file

This doesn't fix it:

grep "$line" large_file.txt > "$out"

because now, the file $out contains only the most recent result of grep. You should append instead:

grep "$line" large_file.txt >> "$out"

This still opens and closes a filehandle for each iteration, but because the output filename depends on the line being read, you can't move the redirection to outside the loop.

CodePudding user response:

Given the symptoms you describe I'd guess your organisms.txt has DOS line endings and so line in your script always ends in \r and so Escherichia coli\r, for example, is never present in large_file.txt. See why-does-my-tool-output-overwrite-itself-and-how-do-i-fix-it.

  • Related