Home > Net >  Every two columns are merged into one column in
Every two columns are merged into one column in

Time:10-31

I have a text file that looks like this It's 432 rows and more than 1.2 million columns I want to join each ATGC pairwise, starting with the seventh column, to achieve this effect: The two base pairs after splicing are still separated by a space I tried to do this on linux using the awk command, but it was too slow: enter image description here thank you very much!

Use code in any language, hopefully python

CodePudding user response:

awk solution

I have an alternative awk solution that uses global substitutions rather than loops to format the data. I suspect this will be faster than what you tried but do not have your file to try it on.

This is the solution, (replace data.txt with the path to your data file):

awk '{$1 =$1"*"$2"*"$3"*"$4"*"$5"*"$6"*"; $2=$3=$4=$5=$6=""; print $0}' data.txt | awk 'BEGIN{FS="*"} {gsub(/ /,"",$0);gsub(/.{2}/,"& ",$7); print $0}'

Explanation

The first part of this procedure rewrites the first field $1 to a string containing the first 6 fields concatenated, and ending, with intervening asterisks (*) - chosen as a unique character, choose a character that is not elsewhere in your file. fields $2 to $6 inclusive are set to empty strings before the new version of the line $0 is printed to stdout.

At this stage the output has many unwanted spaces because the empty fields $2 to $6 still have their output-field-separators (spaces).

The output is piped into a second awk procedure that delimits the fields by asterisk BEGIN{FS="*"}, giving a total of 7 fields with many unwanted spaces.

Global substitution is used to replace all spaces in the record ($0) with empty strings using gsub(/ /,"",$0).

Because the fields are defined by asterisk, the entire sequence (GATC) data, now devoid of any spaces, is held in a single field, the last field $7.

A second global substitution is used to insert a space after every second character in this sequence: gsub(/.{2}/,"& ",$7).

Finally, printing the entire record (print $0) outputs the data with the unaltered output-field-separator (a space) effectively replacing the asterisks.

example

I tested the procedure with a file structured as follows:

(data.txt file)
one two three four five six A A C C G G C C G G

output:

one two three four five six AA CC GG CC GG 

edit - dealing with excessive field numbers

The OP reported the above solution failed in their case with the message that the maximum number of fields (~32k) had been exceeded.

One work-around is to use gawk, the GNU implementation of awk (inbuilt in many linux distributions or available at: https://www.gnu.org/software/software.html), which can handle more fields than awk.

Alternative Solution using sed and awk

Alternatively, the data could be pre-processed to create a smaller number of fields by inserting a unique character with the stream editor sed. The data in the OP's data that needs reformatting is present in fields 7..n where n is large, and can exceed 32k. By placing a unique character immediately before the start of these fields, awk can split each line (record) at the unique character resulting in only 2 fields to process. In the same way suggested for the earlier solution, the required changes can be made by applying global substitutions to field 2.

The solution becomes simpler than the original as the spaces between the first six fields are untouched:

sed 's/ /*/6' data.txt | awk 'BEGIN{FS="*"} {gsub(/ /,"",$2); gsub(/.{2}/,"& ",$2); print $0}

(change data.txt to the path to your data file)

explanation

sed substitutes the sixth space in each original line (record) with an asterisk (must be unique): sed 's/ /*/6'

The stdout is pipped into awk where a BEGIN block sets the field-separator to an asterisk. Awk now sees two fields for each line, one containing (space-separated) columns 1-6, the other containing the ATCG sequence stream separated by spaces.

Awk can now replace the spaces from the sequence data with empty string by targetting a global substitution to field 2: gsub(/ /,"",$2)

Lastly, field 2 is again processed to insert a space every second character using gsub(/.{2}/,"& ",$2), before the output is printed.

caveat Whether this alternative solution will overcome the field-limit problem depends on how awk interprets $0 (the whole line) internally. My hunch is that it does not view it as a collection of fields but rather views it as a single string. If so, and if strings of that length can be handled, it should work. (An informative experiment). If there is a limit to string length, the procedure could be modified to insert several asterisks across the sequence data, say in 1000 blocks, with each record processed as before.

  • Related