I have a file of IDs called IDs_list.txt that I want to use in order to extract information from a second file which has hundreds of IDs, many of which are not in my specific IDS_list.txt.
I've tried combinations of if and grep but my results keep coming up empty.
Here is an example of what I'm trying to do and what I've done.
cat IDS_list.txt | head -n 4
24
43
56
69
cat sample1.txt | head -n 4
NODE_1_length_148512_cov_24.5066,gi|573017271|gb|CP006568.1|,148512,4513140,8,7289,86.545,0.0,13461,24,madeup species 1
NODE_2_length_122550_cov_25.719,gi|84778498|dbj|AP008232.1|,122550,4171146,13,12690,93.693,0.0,23435,244,madeup species 2
NODE_3_length_103385_cov_25.9802,gi|84778498|dbj|AP008232.1|,103385,4171146,6,4243,88.782,0.0,7836,43,madeup species 3
NODE_4_length_101672_cov_25.6536,gi|84778498|dbj|AP008232.1|,101672,4171146,7,4139,86.799,0.0,7644,955,long name here
The IDs are in the 10th column.
I will need to pull out all lines where the IDs are in the IDS_list.txt.
So my output should be:
NODE_1_length_148512_cov_24.5066,gi|573017271|gb|CP006568.1|,148512,4513140,8,7289,86.545,0.0,13461,24,madeup species 1
NODE_3_length_103385_cov_25.9802,gi|84778498|dbj|AP008232.1|,103385,4171146,6,4243,88.782,0.0,7836,43,madeup species 3
I've tried:
for file in sample?.txt; do awk 'FNR==NR{arr[$0];next} ($10 in arr)' IDs_list.txt $file; done
Nothing comes out. This example I took from another stack overflow question.
for i in $(cat IDs_list.txt); do awk -F"," '$10 == $i' sample1.txt; done
But this will print a single output so many times because I am iterating over the IDs_list.txt line by line, so it is not what I want. I will get the first output line maybe hundreds of times because my IDs_list.txt has hundreds of IDs.
Then I tried grep with awk but that didn't work either. My syntax is off.
for file in sample?.txt; do for i in $(cat IDs_list.txt); do grep -w '$i' $file; done; done
Nothing is output here. My logic is that for each sample file, I want to grep the lines that contain the ID that is found in the IDs_list.txt. However I don't like not calling the specific 10th column because the IDs sometimes can show up in other columns that are not actually IDs.
Any eloquent way of doing this in a for loop with grep or awk or both somehow?
CodePudding user response:
You may use this awk
:
awk -F, 'NR==FNR {ids[$1]; next} $10 in ids' IDs_list.txt sample.txt
NODE_1_length_148512_cov_24.5066,gi|573017271|gb|CP006568.1|,148512,4513140,8,7289,86.545,0.0,13461,24,madeup species 1
NODE_3_length_103385_cov_25.9802,gi|84778498|dbj|AP008232.1|,103385,4171146,6,4243,88.782,0.0,7836,43,madeup species 3