I have a list of ids in one file that I want to use to grep their information from a second file. I can only get my output to show only the information for the last id and I think I just can't figure out how to tweak my code a bit so that it outputs the info for each line, not the last one only.
my command:
for i in $(cat my_ids.txt);
do
for name in $i;
do
class=$(grep -A 25 $name id_info.txt | grep -E "tf_class");
family=$(grep -A 25 $name id_info.txt | grep -E "tf_family");
echo -e "$name\n\class\n\family";
done
done
I only get the last id's information lines that I need. I need it to show up for each ID and I don't know how else to tweak this. I also tried removing the second for loop but it was giving the exact same output.
Sample input from my_ids.txt:
MA0052.4
MA0602.1
MA0497.1
MA0786.1
MA0515.1
Sample input from id_info.txt
AC MA0052.4
XX
ID MEF2A
XX
DE MA0052.4 MEF2A ; From JASPAR
PO A C G T
01 5075.0 2119.0 3651.0 5317.0
02 4033.0 1960.0 4493.0 5676.0
03 1984.0 10919.0 1007.0 2252.0
04 627.0 2974.0 236.0 12325.0
05 12437.0 1013.0 1066.0 1646.0
06 13132.0 253.0 610.0 2167.0
07 14680.0 141.0 506.0 835.0
08 14453.0 231.0 241.0 1237.0
09 14956.0 173.0 202.0 831.0
10 441.0 349.0 215.0 15157.0
11 15582.0 50.0 422.0 108.0
12 2566.0 1060.0 11104.0 1432.0
13 7709.0 4039.0 1605.0 2809.0
14 6171.0 3523.0 1810.0 4658.0
15 5254.0 3812.0 2479.0 4617.0
XX
CC tax_group:vertebrates
CC tf_family:Regulators of differentiation
CC tf_class:MADS box factors
CC pubmed_ids:25217591
CC uniprot_ids:Q02078
CC data_type:ChIP-seq
AC MA0602.1
XX
ID Arid5a
XX
DE MA0602.1 Arid5a ; From JASPAR
PO A C G T
01 18.0 43.0 23.0 17.0
02 16.0 32.0 3.0 48.0
03 85.0 3.0 7.0 5.0
04 96.0 0.0 1.0 2.0
05 6.0 0.0 1.0 93.0
06 93.0 1.0 1.0 6.0
07 2.0 1.0 1.0 96.0
08 4.0 9.0 4.0 83.0
09 23.0 3.0 52.0 22.0
10 34.0 35.0 18.0 12.0
11 29.0 13.0 27.0 31.0
12 57.0 8.0 19.0 16.0
13 29.0 18.0 26.0 27.0
14 34.0 23.0 15.0 27.0
XX
CC tax_group:vertebrates
CC tf_family:ARID-related
CC tf_class:ARID
CC pubmed_ids:25215497
CC uniprot_ids:Q3U108
CC data_type:PBM
XX
AC MA0497.1
XX
ID MEF2C
XX
DE MA0497.1 MEF2C ; From JASPAR
PO A C G T
01 705.0 321.0 676.0 507.0
02 733.0 151.0 573.0 752.0
03 431.0 196.0 822.0 760.0
04 382.0 1412.0 78.0 337.0
05 0.0 985.0 0.0 1224.0
06 1616.0 256.0 74.0 263.0
07 1706.0 32.0 241.0 230.0
08 2107.0 0.0 87.0 15.0
09 2131.0 0.0 2.0 76.0
10 2135.0 0.0 4.0 70.0
11 56.0 62.0 0.0 2091.0
12 2177.0 0.0 32.0 0.0
13 389.0 120.0 1671.0 29.0
14 975.0 836.0 148.0 250.0
15 1009.0 450.0 126.0 624.0
XX
CC tax_group:vertebrates
CC tf_family:Regulators of differentiation
CC tf_class:MADS box factors
CC pubmed_ids:7559475
CC uniprot_ids:Q06413
CC data_type:ChIP-seq
XX
AC MA0786.1
XX
ID POU3F1
XX
DE MA0786.1 POU3F1 ; From JASPAR
PO A C G T
01 1034.0 126.0 322.0 1437.0
02 505.0 186.0 128.0 2471.0
03 2471.0 7.0 26.0 21.0
04 44.0 53.0 21.0 2471.0
05 37.0 13.0 2471.0 232.0
06 170.0 2471.0 413.0 1119.0
07 1423.0 1.0 21.0 1048.0
08 2471.0 103.0 130.0 284.0
09 2471.0 20.0 25.0 63.0
10 259.0 95.0 128.0 2471.0
11 382.0 302.0 620.0 1167.0
12 1510.0 478.0 452.0 961.0
XX
CC tax_group:vertebrates
CC tf_family:POU domain factors
CC tf_class:Homeo domain factors
CC pubmed_ids:1361172
CC uniprot_ids:Q03052
CC data_type:HT-SELEX
XX
AC MA0515.1
XX
ID Sox6
XX
DE MA0515.1 Sox6 ; From JASPAR
PO A C G T
01 4.0 139.0 50.0 56.0
02 0.0 221.0 0.0 28.0
03 161.0 0.0 0.0 88.0
04 0.0 0.0 0.0 249.0
05 0.0 0.0 0.0 249.0
06 0.0 0.0 249.0 0.0
07 0.0 0.0 0.0 249.0
08 0.0 115.0 5.0 129.0
09 4.0 112.0 0.0 133.0
10 14.0 76.0 31.0 128.0
XX
CC tax_group:vertebrates
CC tf_family:SOX-related factors
CC tf_class:High-mobility group (HMG) domain factors
CC pubmed_ids:21985497
CC uniprot_ids:P40645
CC data_type:ChIP-seq
XX
Example of the output I get when I run this as a bash script:
MA0052.4
MA0602.1
MA0497.1
MA0786.1
MA0515.1 CC tf_class:High-mobility group (HMG) domain factors CC tf_family:SOX-related factors
Desired output:
MA0602.1 CC ARID CC ARID-related
MA0497.1 CC MADS box factors CC Regulators of differentiation
MA0786.1 CC Homeo domain factors CC POU domain factors
MA0515.1 CC tf_class:High-mobility group (HMG) domain factors CC tf_family:SOX-related factors
Another code snippet I tried but the output just gives me id names and nothing more; probably because I am messing up the syntax somehow (ran this in terminal):
while IFS= read -r line; do class=$(grep -A 25 $line id_infoc.txt | grep -E "tf_class"); family=$(grep -A 25 $line id_info.txt | grep -E "tf_family"); echo -e "$line\n\class\n\family"; done < my_ids.txt
CodePudding user response:
Try this script:
#! /usr/bin/env bash
while read -r id; do
name="$id"
class=$( grep -A 25 "$name" id_info.txt | grep -E "tf_class")
family=$(grep -A 25 "$name" id_info.txt | grep -E "tf_family")
echo -e "${name}\n${class}\n${family}"
done <"my_ids.txt"
CodePudding user response:
Ignoring style, the bug in your code is that you use \family
and \class
instead of $family
and $class
.
Invoking grep
multiple times as you do will be a bit inefficient if the file is large and there are many ids to check.
A straightforward solution in awk
that only needs to read each file once might be:
awk '
function do_print () {
if (name in ids)
printf("%s\n%s\n%s\n",name,class,family)
name=family=
}
# read ids into an array
NR==FNR { ids[$0]; next }
# start of a section
/^AC / { do_print(); name=$2; next }
# other candidate values found
/^CC tf_family:/ { family=$0; next }
/^CC tf_class:/ { class=$0; next }
# maybe print final section
END { do_print() }
' my_ids.txt id_info.txt
To filter out the tf_family:
,etc, the regexes can be replaced by sub
:
sub(/^CC tf_family:/,"CC ") { family=$0; next }
sub(/^CC tf_class:/,"CC ") { class=$0; next }