I have a file organisms.txt
with one organism (genus and species) per line.
Escherichia coli
Staphylococcus aureus
Prevotella sp. 855
Saprospirales
Candidatus Accumulibacter phosphatis
I want to use grep to search though another file for each organism and write the matches to an output file with the name of the organism. My file large_file.txt
is like this:
Parcubacteria bacterium 0 87 2762014
Saprospirales 837 78 1936988
Escherichia coli 857 95 562
Bacteroides ihuae 12 100 1852362
Candidatus Escherichia coli O12H3 988 95 888
Dialister invisus 30 86 218538
Fake Escherichia bacterium 112 99 110
Escherichia coli 07798 1094 99 1005566
Escherichia coli 14 87 562
Saprospirales bacterium 87 98.6 4587674
Saprospirales sp. 12588 99 1936988
I am using this while loop.
while IFS= read -r line
do
out="${line}_hits.txt"
grep "${line}" large_file.txt
> "$out"
done < "organisms.txt"
I have checked manually for the organisms in my list to verify that they are found in large_file.txt
and they are definitely found in large_file.txt
. The output files are all created using this loop however they are all empty. I would expect for example, that the output file Escherichia coli_hits.txt
, would look like this:
Escherichia coli 857 95 562
Candidatus Escherichia coli O12H3 988 95 888
Escherichia coli 07798 1094 99 1005566
Escherichia coli 14 87 562
And I would expect the output file Saprospirales_hits.txt
to look like this:
Saprospirales 837 78 1936988
Saprospirales bacterium 87 98.6 4587674
Saprospirales sp. 12588 99 1936988
I would also expect a file named Staphylococus aureus_hits.txt
to have been created and to be an empty file as well as similar files for all other lines in organisms.txt
that were not found in large_file.txt
.
What do I need to change to get my desired results?
CodePudding user response:
The way you redirect to "$out"
truncates the file for every loop iteration:
grep "$line" large_file.txt
> "$out" # This truncates the file
This doesn't fix it:
grep "$line" large_file.txt > "$out"
because now, the file $out
contains only the most recent result of grep
. You should append instead:
grep "$line" large_file.txt >> "$out"
This still opens and closes a filehandle for each iteration, but because the output filename depends on the line being read, you can't move the redirection to outside the loop.
CodePudding user response:
Given the symptoms you describe I'd guess your organisms.txt
has DOS line endings and so line
in your script always ends in \r
and so Escherichia coli\r
, for example, is never present in large_file.txt
. See why-does-my-tool-output-overwrite-itself-and-how-do-i-fix-it.