I am trying to extract column1 and column4 from multiple text files.
file1.txt:
#rname startpos endpos numreads covbases coverage meandepth meanbaseq meanmapq
CFLAU10s46802|kraken:taxid|33189 1 125 2 105 84 1.68 36.8 24
CFLAU10s46898|kraken:taxid|33189 1 116 32 116 100 23.5862 35.7 19.4
CFLAU10s46988|kraken:taxid|33189 1 105 2 53 50.4762 1.00952 36.9 11
AUZW01004514.1 Cronartium comandrae C4 contig1015102_0, whole genome shotgun sequence 1 1102 2 88 7.98548 0.15971 36.4 10
AUZW01004739.1 Cronartium comandrae C4 contig1070682_0, whole genome shotgun sequence 1 2133 6 113 5.2977 0.186592 36.6 13
file2.txt:
#rname startpos endpos numreads covbases coverage meandepth meanbaseq meanmapq
CFLAU10s46802|kraken:taxid|33189 1 125 5 105 84 1.68 36.8 24
CFLAU10s46898|kraken:taxid|33189 1 116 40 116 100 23.5862 35.7 19.4
CFLAU10s46988|kraken:taxid|33189 1 105 6 53 50.4762 1.00952 36.9 11
AUZW01004514.1 Cronartium comandrae C4 contig1015102_0, whole genome shotgun sequence 1 1102 2 88 7.98548 0.15971 36.4 10
AUZW01004739.1 Cronartium comandrae C4 contig1070682_0, whole genome shotgun sequence 1 2133 6 113 5.2977 0.186592 36.6 13
output format (save the output as merged.txt in another directory). In the output file: Column1(#nname) will be once because this is same for every file, but there will be multiple column4 (numreads) as many as files and the rename the column4 should be according to each file name. Output file looks like:
#rname file1_numreads file2_numreads
CFLAU10s46802|kraken:taxid|33189 2 5
CFLAU10s46898|kraken:taxid|33189 32 40
CFLAU10s46988|kraken:taxid|33189 2 6
AUZW01004514.1 Cronartium comandrae C4 contig1015102_0, whole genome shotgun sequence 2 88
AUZW01004739.1 Cronartium comandrae C4 contig1070682_0, whole genome shotgun sequence 6 113
Your suggestions would be appreciated.
CodePudding user response:
Here is something I put together. awk
gurus might have a simpler - shorter version but I am still learning awk
.
Create a file script.awk
and make it executable. Put in it:
#!/usr/bin/awk -f
BEGIN { FS="\t" }
# process files, ignoring comments
!/^#/ {
# keep the first column values.
# Only add a new value if it is not already in the array.
if (!($1 in firstcolumns)) {
firstcolumns[$1] = $1
}
# extract the 4th column of file1, put it in the array (column 1).1
if (FILENAME == ARGV[1]) {
results[$1 ".1"] = $4
}
# extract the 4th column of file2, put it in the array (column 1).2
if (FILENAME == ARGV[2]) {
results[$1 ".2"] = $4
}
}
# print the results
END {
# for each first column value...
for (key in firstcolumns) {
# Print the first column, then (column 1).1, then (column 1).2
print key "\t" results[key ".1"] "\t" results[key ".2"]
}
}
- Call it like this:
./script.awk file1.txt file2.txt
. - Since
awk
parses the files line per line, I keep the possible values of the first column in an array (firstcolumns). - For each line, if the 4th column comes from file1.txt (ARGV[1]) I store it in the results array under (firstcolumn).1.
- For each line, if the 4th column comes from file2.txt (ARGV[2]) I store it in the results array under (firstcolumn).2.
- In the END block, loop through the possible firstcolumn values and print the values (firstcolumn).1 and (firstcolumn).2, separated by "\t" for tabs.
Results:
$ ./so.awk file1.txt file2.txt
AUZW01004514.1 C4 C4
CFLAU10s46988|kraken:taxid|33189 2 6
CFLAU10s46802|kraken:taxid|33189 2 5
AUZW01004739.1 C4 C4
CFLAU10s46898|kraken:taxid|33189 32 40
CodePudding user response:
I had some fun writing this bash-3 solution with sort
. If not for the printf '%q'
(which may not be needed if the filenames don't include tabs or newlines) and the $'\t'
(which can be replaced with literal tabs), it would be POSIX compliant.
It makes some assumptions though:
- That every record is present in each file
- That the character
#
comes prior*
when sorting
#!/bin/bash
IFS=$'\t'
for fpath in "$@"
do
fid=$(printf '%q\n' "$fpath")
printf '%s\t_\t%s\n' "#rname" "${fid}_numreads"
{
read # Trash the header!
while read -d $'\n' -r rname _ _ numreads _ _ _ _ _
do
printf '%s\t%s\t%s\n' "*$rname" "$fid" "$numreads"
done
} < "$fpath"
done |
sort -t $'\t' |
{
declare record='#rname'
while read -d $'\n' -r rname fid numreads
do
if [ "$rname" != "${record%%$'\t'*}" ]
then
printf '%s\n' "${record#\*}"
record="$rname"
fi
record =$'\t'"$numreads"
done
}
intermediate result (just after the sort
)
#rname _ file1.txt_numreads
#rname _ file2.txt_numreads
*AUZW01004514.1 Cronartium comandrae C4 contig1015102_0, whole genome shotgun sequence file1.txt 2
*AUZW01004514.1 Cronartium comandrae C4 contig1015102_0, whole genome shotgun sequence file2.txt 2
*AUZW01004739.1 Cronartium comandrae C4 contig1070682_0, whole genome shotgun sequence file1.txt 6
*AUZW01004739.1 Cronartium comandrae C4 contig1070682_0, whole genome shotgun sequence file2.txt 6
*CFLAU10s46802|kraken:taxid|33189 file1.txt 2
*CFLAU10s46802|kraken:taxid|33189 file2.txt 5
*CFLAU10s46898|kraken:taxid|33189 file1.txt 32
*CFLAU10s46898|kraken:taxid|33189 file2.txt 40
*CFLAU10s46988|kraken:taxid|33189 file1.txt 2
*CFLAU10s46988|kraken:taxid|33189 file2.txt 6
final outout
#rname file1.txt_numreads file2.txt_numreads
AUZW01004514.1 Cronartium comandrae C4 contig1015102_0, whole genome shotgun sequence 2 2
AUZW01004739.1 Cronartium comandrae C4 contig1070682_0, whole genome shotgun sequence 6 6
CFLAU10s46802|kraken:taxid|33189 2 5
CFLAU10s46898|kraken:taxid|33189 32 40
remark: It is possible to write this script in pure bash-4 (which means without sort
) using associative arrays. but more advanced scripting languages like awk
perl
python
ruby
would probably do a better job.