How to extract 45,000 columns from a space separated file with 500,000 columns according to a list o-CodePudding

a.txt contains 500,000 columns and 2000 rows. The example file below only shows the first 9 columns in this file. This file has header in the first row.

chromosome SNPID rsid position alleleA alleleB 2409086 3514581 3635346 ...
1 1:55487346_C_G rs12117661 55487346 C G 1 0 0 ...
1 1:55487648_A_G rs11588151 55487648 A G 1 0 0 ...
1 1:55489542_C_T rs34232196 55489542 C T 1 0 0 ...
1 1:55490861_T_C rs4500361 55490861 T C 1 0 0 ...
1 1:55491702_T_C rs4927191 55491702 T C 0.894118 0 0 ...
1 1:55491780_A_G rs200159426 55491780 A G 0.894118 0 0 ...
...

b.txt contains 45000 columns which show the column name for each line. I want to extract the columns from a.txt according to b.txt.

chromosome
SNPID
rsid
position
alleleA
alleleB
2409086
3635346
...

c.txt is my expected outcome. c.txt should be a space separated table with 45000 columns and 2000 rows.

chromosome SNPID rsid position alleleA alleleB 2409086 3635346 ...
1 1:55487346_C_G rs12117661 55487346 C G 1 0 ...
1 1:55487648_A_G rs11588151 55487648 A G 1 0 ...
1 1:55489542_C_T rs34232196 55489542 C T 1 0 ...
1 1:55490861_T_C rs4500361 55490861 T C 1 0 ...
1 1:55491702_T_C rs4927191 55491702 T C 0.894118 0 ...
1 1:55491780_A_G rs200159426 55491780 A G 0.894118 0 ...
...

I tried to use cut to solve this problem, but it shows that argument list too long (since I need to extract 45000 columns). I know awk may solve this problem but I am not familiar with awk and did not find any answer about that. Does any body have solution for it?

cut -f 1,$(
    head -n1 a.txt |
    tr ' ' '\n' |
    grep -nf b.txt |
    sed ':a;$!N;s/:[^\n]*\n/,/;ta;s/:.*//'
) a.txt > c.txt
-bash: /usr/bin/cut: Argument list too long

CodePudding user response：

With awk

Suppose this is filter.awk

NR == FNR { # reading the first file
    wanted[$1] = 1
    next
}

FNR == 1 {
    for (i=1; i<=NF; i  ) {
        header[i] = $i
    }
}

{
    for (i=1; i<=NF; i  ) {
        if (header[i] in wanted) {
            printf "%s ", $i
        }
    }
    print ""
}

Then, given your sample a.txt and

$ cat b.txt
chromosome
rsid
2409086

we get

$ awk -f filter.awk b.txt a.txt
chromosome rsid 2409086
1 rs12117661 1
1 rs11588151 1
1 rs34232196 1
1 rs4500361 1
1 rs4927191 0.894118
1 rs200159426 0.894118
...

This will be a bit quicker: it doesn't have to iterate over all the columns for each record

NR == FNR { # reading the first file
    wanted[$1] = 1
    next
}

FNR == 1 {
    n = 0
    for (i=1; i<=NF; i  ) {
        if ($i in wanted) {
            cols_to_print[  n] = i
        }
    }
}

{
    for (i=1; i<=n; i  ) printf "%s ", $(cols_to_print[i])
    print ""
}

CodePudding user response：

Using any awk:

$ cat tst.awk
NR == FNR {
    out2tag[  numOutFlds] = $1
    next
}
FNR==1 {
    for ( inFldNr=1; inFldNr<=NF; inFldNr   ) {
        tag2in[$inFldNr] = inFldNr
    }
}
{
    for ( outFldNr=1; outFldNr<=numOutFlds; outFldNr   ) {
        tag     = out2tag[outFldNr]
        inFldNr = tag2in[tag]
        printf "%s%s", $inFldNr, (outFldNr<numOutFlds ? OFS : ORS)
    }
}

$ awk -f tst.awk b.txt a.txt
chromosome SNPID rsid position alleleA alleleB 2409086 3635346
1 1:55487346_C_G rs12117661 55487346 C G 1 0
1 1:55487648_A_G rs11588151 55487648 A G 1 0
1 1:55489542_C_T rs34232196 55489542 C T 1 0
1 1:55490861_T_C rs4500361 55490861 T C 1 0
1 1:55491702_T_C rs4927191 55491702 T C 0.894118 0
1 1:55491780_A_G rs200159426 55491780 A G 0.894118 0

Three things to note with this approach:

It only loops as many times per input line as the number of fields you want to output, 45,000 in this case (as opposed to looping through all 500,000 of the input fields and discarding 455,000 of them). This would be a noticeable performance improvement over @GlennJackman's first solution but the same as their 2nd solution.
The output fields will be printed in the same order as the lines in b.txt so you don't have to produce output in the same order as the input fields in a.txt. This is the main functional difference between the above and @GlennJackmans 2nd solution, theirs will also be very slightly faster as mine requires 1 extra hash lookup per field.
It won't produce an extra blank char at the end of each output line (a common issue with such solutions).

CodePudding user response：

I suggest trying GNU datamash if you are allowed to install and use tools other than awk, though I am not sure how it would deal with such big file and so many columns, it has cut operation named so after cut command, but unlike it does understand headers and keep order, simple example let file.txt content be

alleleA alleleB alleleC
A C G
T A C
G T A

then

datamash --field-separator=' ' --headers cut "alleleC,alleleA" < file.txt

gives output

cut(alleleC) cut(alleleA)
G A
C T
A G

Observe that order provided was applied and cut appeared in output headers as this was action done, if this is not acceptable you might elect to remove cut( and ) using e.g. sed as long as there are not brackets in column names.

(tested in GNU datamash 1.7)