The original Data looks like
ID, kgp11274425_A, kgp11274425_HET, kgp5732633_C, kgp5732633_HET, rs707_G, rs707_HET, kgp75_T, kgp75_HET
1 C T G T C A 0 0
2 C C T G A A G T
3 A A G G C G A A
4 G G C C A A T A
Desired Output:
ID, kgp11274425 kgp5732633 rs707 kgp75
1 CT GT CA 00
2 CC TG AA GT
3 AA GG CG AA
4 GG CC AA TA
I was able to accomplish this using the following Python Script
sep = '_'
unique_cols = pd.Index(map(lambda x : x.split(sep, 1)[0], df.columns)).unique()
results = []
columns = []
for col in unique_cols:
my_cols = [x for x in df.columns if x.startswith(col)]
results.append(df[my_cols].sum(axis=1).values)
columns.append(col)
new_df = pd.DataFrame(results).T
new_df.columns = columns
But this time I got 522rows & 5311137cols (5GB) data, Python is not able to read the file. So I need to run the same python logic using bash commands, new to bash please help
CodePudding user response:
Input:
$ cat raw.dat
ID, kgp11274425_A, kgp11274425_HET, kgp5732633_C, kgp5732633_HET, rs707_G, rs707_HET, kgp75_T, kgp75_HET
1 C T G T C A 0 0
2 C C T G A A G T
3 A A G G C G A A
4 G G C C A A T A
One awk
idea:
awk '
{ printf $1 # print 1st column
for (i=2;i<=NF;i=i 2) { # process columns 2 at a time
if (FNR==1) # 1st row? then ...
printf " %s", substr($i,1,index($i,"_")-1) # print ith column sans the '_xxxx' suffix
else
printf " %s%s", $i, $(i 1) # print the (i)th and (i 1)th columns
}
print ""
}
' raw.dat
This generates:
ID, kgp11274425 kgp5732633 rs707 kgp75
1 CT GT CA 00
2 CC TG AA GT
3 AA GG CG AA
4 GG CC AA TA
NOTE: OP's desired output does not appear to be single-space delimited, nor tab delimited, so I've opted for a single space; OP can modify the printf
format strings to accomplish the desired formatting
CodePudding user response:
The basic approach is to kinda emulate what FPAT does, but in a more portable fashion - this code has been tested and confirmed working on gawk 5.1.1
, including flags -ce/-Pe
, mawk 1.3.4
, mawk 1.9.9.6
, and macOS 12.3 nawk
.
ID, kgp11274425 kgp5732633 rs707 kgp75
1 CT GT CA 00
2 CC TG AA GT
3 AA GG CG AA
4 GG CC AA TA
[mng]awk 'BEGIN {
FS = "[_][^_] [_][^,] [,]?[ " (substr(\
OFS = "\t\t", _ =_=_~_)) "]*"
print $( (getline) < -(NF=NF) )
__ = (__="[^ \t] ") substr(_=substr(___="&\6\3",_) \
"[ \t] ", index(_, "[") ) __ "|_"
FS = "^$"
} gsub(__,___) gsub(_,"")' datafile.txt
By using
regex
, it avoids having to manually cycle through the fields, even when 2-fields at a time.\6\3
(ACK ETX
) is just anASCII-only SEP
choice I prefer overSUBSEP
(\034
) or the null-byte (\0
).A 2nd advantage is that by having separate and independent logic that attempts to emulate
FPAT
,FS
can be set directly to^$
, as the rationale for splitting up individual fields is no longer applicable .... . . which, hopefully, could help move things along, since OP did mention
500K
columns.
- The caveat of this solution is that it doesn't attempt to have everything line up in a perfect manner, using
\t\t
2 horizontal tabs as a crude solution.
Synthetic 5GB
file benchmark : 1 min 14 secs
for 5.12GB
file, with read-in throughput rate of 70.5 MB/s
out9: 2.54GiB 0:01:14 [34.9MiB/s] [34.9MiB/s]
[====================================>] 102%
in0: 5.12GiB 0:01:14 [70.5MiB/s] [70.5MiB/s]
[====================================>] 100%
( pvE 0.1 in0 < testcase_dna_002.txt | gawk -be ; )
73.44s user 2.20s system 101% cpu 1:14.37 total
nice pv -pteba -i 0.25 -r --size='2540m' -cN out9 > /dev/null 0.24s
user 0.73s system 1% cpu 1:14.37 total