Execute command on each line in a file


I have a list of ids in one file that I want to use to grep their information from a second file. I can only get my output to show only the information for the last id and I think I just can't figure out how to tweak my code a bit so that it outputs the info for each line, not the last one only.

my command:

for i in $(cat my_ids.txt); 
    for name in $i; 
        class=$(grep -A 25 $name id_info.txt | grep -E "tf_class"); 
        family=$(grep -A 25 $name id_info.txt | grep -E "tf_family"); 
        echo -e "$name\n\class\n\family"; 

I only get the last id's information lines that I need. I need it to show up for each ID and I don't know how else to tweak this. I also tried removing the second for loop but it was giving the exact same output.

Sample input from my_ids.txt:


Sample input from id_info.txt

AC MA0052.4
DE MA0052.4 MEF2A ; From JASPAR
PO  A   C   G   T
01  5075.0  2119.0  3651.0  5317.0
02  4033.0  1960.0  4493.0  5676.0
03  1984.0  10919.0 1007.0  2252.0
04  627.0   2974.0  236.0   12325.0
05  12437.0 1013.0  1066.0  1646.0
06  13132.0 253.0   610.0   2167.0
07  14680.0 141.0   506.0   835.0
08  14453.0 231.0   241.0   1237.0
09  14956.0 173.0   202.0   831.0
10  441.0   349.0   215.0   15157.0
11  15582.0 50.0    422.0   108.0
12  2566.0  1060.0  11104.0 1432.0
13  7709.0  4039.0  1605.0  2809.0
14  6171.0  3523.0  1810.0  4658.0
15  5254.0  3812.0  2479.0  4617.0
CC tax_group:vertebrates
CC tf_family:Regulators of differentiation
CC tf_class:MADS box factors
CC pubmed_ids:25217591
CC uniprot_ids:Q02078
CC data_type:ChIP-seq
AC MA0602.1
ID Arid5a
DE MA0602.1 Arid5a ; From JASPAR
PO  A   C   G   T
01  18.0    43.0    23.0    17.0
02  16.0    32.0    3.0 48.0
03  85.0    3.0 7.0 5.0
04  96.0    0.0 1.0 2.0
05  6.0 0.0 1.0 93.0
06  93.0    1.0 1.0 6.0
07  2.0 1.0 1.0 96.0
08  4.0 9.0 4.0 83.0
09  23.0    3.0 52.0    22.0
10  34.0    35.0    18.0    12.0
11  29.0    13.0    27.0    31.0
12  57.0    8.0 19.0    16.0
13  29.0    18.0    26.0    27.0
14  34.0    23.0    15.0    27.0
CC tax_group:vertebrates
CC tf_family:ARID-related
CC tf_class:ARID
CC pubmed_ids:25215497
CC uniprot_ids:Q3U108
CC data_type:PBM
AC MA0497.1
DE MA0497.1 MEF2C ; From JASPAR
PO  A   C   G   T
01  705.0   321.0   676.0   507.0
02  733.0   151.0   573.0   752.0
03  431.0   196.0   822.0   760.0
04  382.0   1412.0  78.0    337.0
05  0.0 985.0   0.0 1224.0
06  1616.0  256.0   74.0    263.0
07  1706.0  32.0    241.0   230.0
08  2107.0  0.0 87.0    15.0
09  2131.0  0.0 2.0 76.0
10  2135.0  0.0 4.0 70.0
11  56.0    62.0    0.0 2091.0
12  2177.0  0.0 32.0    0.0
13  389.0   120.0   1671.0  29.0
14  975.0   836.0   148.0   250.0
15  1009.0  450.0   126.0   624.0
CC tax_group:vertebrates
CC tf_family:Regulators of differentiation
CC tf_class:MADS box factors
CC pubmed_ids:7559475
CC uniprot_ids:Q06413
CC data_type:ChIP-seq
AC MA0786.1
DE MA0786.1 POU3F1 ; From JASPAR
PO  A   C   G   T
01  1034.0  126.0   322.0   1437.0
02  505.0   186.0   128.0   2471.0
03  2471.0  7.0 26.0    21.0
04  44.0    53.0    21.0    2471.0
05  37.0    13.0    2471.0  232.0
06  170.0   2471.0  413.0   1119.0
07  1423.0  1.0 21.0    1048.0
08  2471.0  103.0   130.0   284.0
09  2471.0  20.0    25.0    63.0
10  259.0   95.0    128.0   2471.0
11  382.0   302.0   620.0   1167.0
12  1510.0  478.0   452.0   961.0
CC tax_group:vertebrates
CC tf_family:POU domain factors
CC tf_class:Homeo domain factors
CC pubmed_ids:1361172
CC uniprot_ids:Q03052
CC data_type:HT-SELEX
AC MA0515.1
ID Sox6
DE MA0515.1 Sox6 ; From JASPAR
PO  A   C   G   T
01  4.0 139.0   50.0    56.0
02  0.0 221.0   0.0 28.0
03  161.0   0.0 0.0 88.0
04  0.0 0.0 0.0 249.0
05  0.0 0.0 0.0 249.0
06  0.0 0.0 249.0   0.0
07  0.0 0.0 0.0 249.0
08  0.0 115.0   5.0 129.0
09  4.0 112.0   0.0 133.0
10  14.0    76.0    31.0    128.0
CC tax_group:vertebrates
CC tf_family:SOX-related factors
CC tf_class:High-mobility group (HMG) domain factors
CC pubmed_ids:21985497
CC uniprot_ids:P40645
CC data_type:ChIP-seq

Example of the output I get when I run this as a bash script:

MA0515.1        CC tf_class:High-mobility group (HMG) domain factors    CC tf_family:SOX-related factors

Desired output:

 MA0602.1    CC ARID    CC ARID-related
 MA0497.1    CC MADS box factors    CC Regulators of differentiation
 MA0786.1    CC Homeo domain factors    CC POU domain factors
 MA0515.1    CC tf_class:High-mobility group (HMG) domain factors    CC tf_family:SOX-related factors

Another code snippet I tried but the output just gives me id names and nothing more; probably because I am messing up the syntax somehow (ran this in terminal):

while IFS= read -r line; do class=$(grep -A 25 $line id_infoc.txt | grep -E "tf_class"); family=$(grep -A 25 $line id_info.txt | grep -E "tf_family"); echo -e "$line\n\class\n\family"; done < my_ids.txt  

CodePudding user response:

Try this script:

#! /usr/bin/env bash

while read -r id; do
    class=$( grep -A 25 "$name" id_info.txt | grep -E "tf_class")
    family=$(grep -A 25 "$name" id_info.txt | grep -E "tf_family")
    echo -e "${name}\n${class}\n${family}"
done <"my_ids.txt"

CodePudding user response:

Ignoring style, the bug in your code is that you use \family and \class instead of $family and $class.

Invoking grep multiple times as you do will be a bit inefficient if the file is large and there are many ids to check.

A straightforward solution in awk that only needs to read each file once might be:

awk '
    function do_print () {
        if (name in ids)

    # read ids into an array
    NR==FNR { ids[$0]; next }

    # start of a section
    /^AC / { do_print(); name=$2; next }

    # other candidate values found
    /^CC tf_family:/ { family=$0; next }
    /^CC tf_class:/ { class=$0; next }

    # maybe print final section
    END { do_print() }
' my_ids.txt id_info.txt

To filter out the tf_family:,etc, the regexes can be replaced by sub:

    sub(/^CC tf_family:/,"CC ") { family=$0; next }
    sub(/^CC tf_class:/,"CC ") { class=$0; next }
