Home > Enterprise >  Extract columns from multiple text files with bash or awk or sed?
Extract columns from multiple text files with bash or awk or sed?

Time:11-20

I am trying to extract column1 and column4 from multiple text files.

file1.txt:

#rname  startpos        endpos  numreads        covbases        coverage        meandepth       meanbaseq       meanmapq
CFLAU10s46802|kraken:taxid|33189        1       125     2       105     84      1.68    36.8    24
CFLAU10s46898|kraken:taxid|33189        1       116     32      116     100     23.5862 35.7    19.4
CFLAU10s46988|kraken:taxid|33189        1       105     2       53      50.4762 1.00952 36.9    11
AUZW01004514.1 Cronartium comandrae C4 contig1015102_0, whole genome shotgun sequence   1       1102    2       88      7.98548 0.15971 36.4    10
AUZW01004739.1 Cronartium comandrae C4 contig1070682_0, whole genome shotgun sequence   1       2133    6       113     5.2977  0.186592        36.6    13

file2.txt:

#rname  startpos        endpos  numreads        covbases        coverage        meandepth       meanbaseq       meanmapq
CFLAU10s46802|kraken:taxid|33189        1       125     5       105     84      1.68    36.8    24
CFLAU10s46898|kraken:taxid|33189        1       116     40      116     100     23.5862 35.7    19.4
CFLAU10s46988|kraken:taxid|33189        1       105     6       53      50.4762 1.00952 36.9    11
AUZW01004514.1 Cronartium comandrae C4 contig1015102_0, whole genome shotgun sequence   1       1102    2       88      7.98548 0.15971 36.4    10
AUZW01004739.1 Cronartium comandrae C4 contig1070682_0, whole genome shotgun sequence   1       2133    6       113     5.2977  0.186592        36.6    13

output format (save the output as merged.txt in another directory). In the output file: Column1(#nname) will be once because this is same for every file, but there will be multiple column4 (numreads) as many as files and the rename the column4 should be according to each file name. Output file looks like:

#rname  file1_numreads     file2_numreads
CFLAU10s46802|kraken:taxid|33189        2       5
CFLAU10s46898|kraken:taxid|33189       32      40
CFLAU10s46988|kraken:taxid|33189        2       6
AUZW01004514.1 Cronartium comandrae C4 contig1015102_0, whole genome shotgun sequence   2       88
AUZW01004739.1 Cronartium comandrae C4 contig1070682_0, whole genome shotgun sequence   6       113

Your suggestions would be appreciated.

CodePudding user response:

Here is something I put together. awk gurus might have a simpler - shorter version but I am still learning awk.

Create a file script.awk and make it executable. Put in it:

#!/usr/bin/awk -f

BEGIN { FS="\t" }

# process files, ignoring comments
!/^#/ {
    # keep the first column values.
    # Only add a new value if it is not already in the array.
    if (!($1 in firstcolumns)) {
        firstcolumns[$1] = $1
    }
    # extract the 4th column of file1, put it in the array (column 1).1
    if (FILENAME == ARGV[1]) {
        results[$1 ".1"] = $4
    }
    # extract the 4th column of file2, put it in the array (column 1).2
    if (FILENAME == ARGV[2]) {
        results[$1 ".2"] = $4
    }
}

# print the results
END {
    # for each first column value...
    for (key in firstcolumns) {
        # Print the first column, then (column 1).1, then (column 1).2
        print key "\t" results[key ".1"] "\t" results[key ".2"]
    }
}
  • Call it like this: ./script.awk file1.txt file2.txt.
  • Since awk parses the files line per line, I keep the possible values of the first column in an array (firstcolumns).
  • For each line, if the 4th column comes from file1.txt (ARGV[1]) I store it in the results array under (firstcolumn).1.
  • For each line, if the 4th column comes from file2.txt (ARGV[2]) I store it in the results array under (firstcolumn).2.
  • In the END block, loop through the possible firstcolumn values and print the values (firstcolumn).1 and (firstcolumn).2, separated by "\t" for tabs.

Results:

$ ./so.awk file1.txt file2.txt
AUZW01004514.1  C4  C4
CFLAU10s46988|kraken:taxid|33189    2   6
CFLAU10s46802|kraken:taxid|33189    2   5
AUZW01004739.1  C4  C4
CFLAU10s46898|kraken:taxid|33189    32  40

CodePudding user response:

I had some fun writing this bash-3 solution with sort. If not for the printf '%q' (which may not be needed if the filenames don't include tabs or newlines) and the $'\t' (which can be replaced with literal tabs), it would be POSIX compliant.

It makes some assumptions though:

  • That every record is present in each file
  • That the character # comes prior * when sorting
#!/bin/bash
IFS=$'\t'

for fpath in "$@"
do
    fid=$(printf '%q\n' "$fpath")
    printf '%s\t_\t%s\n' "#rname" "${fid}_numreads"
    {
        read # Trash the header!

        while read -d $'\n' -r rname _ _ numreads _ _ _ _ _
        do
            printf '%s\t%s\t%s\n' "*$rname" "$fid" "$numreads"
        done
    } < "$fpath"
done |
sort -t $'\t' |
{
    declare record='#rname'
    while read -d $'\n' -r rname fid numreads
    do
        if [ "$rname" != "${record%%$'\t'*}" ]
        then
            printf '%s\n' "${record#\*}"
            record="$rname"
        fi
        record =$'\t'"$numreads"
    done
}

intermediate result (just after the sort)

#rname   _  file1.txt_numreads
#rname   _  file2.txt_numreads
*AUZW01004514.1 Cronartium comandrae C4 contig1015102_0, whole genome shotgun sequence  file1.txt   2
*AUZW01004514.1 Cronartium comandrae C4 contig1015102_0, whole genome shotgun sequence  file2.txt   2
*AUZW01004739.1 Cronartium comandrae C4 contig1070682_0, whole genome shotgun sequence  file1.txt   6
*AUZW01004739.1 Cronartium comandrae C4 contig1070682_0, whole genome shotgun sequence  file2.txt   6
*CFLAU10s46802|kraken:taxid|33189   file1.txt   2
*CFLAU10s46802|kraken:taxid|33189   file2.txt   5
*CFLAU10s46898|kraken:taxid|33189   file1.txt   32
*CFLAU10s46898|kraken:taxid|33189   file2.txt   40
*CFLAU10s46988|kraken:taxid|33189   file1.txt   2
*CFLAU10s46988|kraken:taxid|33189   file2.txt   6

final outout

#rname  file1.txt_numreads  file2.txt_numreads
AUZW01004514.1 Cronartium comandrae C4 contig1015102_0, whole genome shotgun sequence   2   2
AUZW01004739.1 Cronartium comandrae C4 contig1070682_0, whole genome shotgun sequence   6   6
CFLAU10s46802|kraken:taxid|33189    2   5
CFLAU10s46898|kraken:taxid|33189    32  40

remark: It is possible to write this script in pure bash-4 (which means without sort) using associative arrays. but more advanced scripting languages like awk perl python ruby would probably do a better job.

  • Related