How to Add Two Columns of DataFrame and Rename it with Prefix Name using bash-CodePudding

The original Data looks like

ID, kgp11274425_A, kgp11274425_HET, kgp5732633_C, kgp5732633_HET, rs707_G, rs707_HET, kgp75_T, kgp75_HET
1       C                T            G             T              C            A       0          0
2       C                C            T             G              A            A       G          T
3       A                A            G             G              C            G       A          A
4       G                G            C             C              A            A       T          A

Desired Output:

ID, kgp11274425  kgp5732633    rs707     kgp75
1      CT           GT           CA       00
2      CC           TG           AA       GT
3      AA           GG           CG       AA
4      GG           CC           AA       TA

I was able to accomplish this using the following Python Script

sep = '_'
unique_cols = pd.Index(map(lambda x : x.split(sep, 1)[0], df.columns)).unique()

results = []
columns = []
for col in unique_cols:
    my_cols = [x for x in df.columns if x.startswith(col)]
    results.append(df[my_cols].sum(axis=1).values)
    columns.append(col)

new_df = pd.DataFrame(results).T
new_df.columns = columns

But this time I got 522rows & 5311137cols (5GB) data, Python is not able to read the file. So I need to run the same python logic using bash commands, new to bash please help

CodePudding user response：

Input:

$ cat raw.dat
ID, kgp11274425_A, kgp11274425_HET, kgp5732633_C, kgp5732633_HET, rs707_G, rs707_HET, kgp75_T, kgp75_HET
1       C                T            G             T              C            A       0          0
2       C                C            T             G              A            A       G          T
3       A                A            G             G              C            G       A          A
4       G                G            C             C              A            A       T          A

One awk idea:

awk '
{ printf $1                                           # print 1st column
  for (i=2;i<=NF;i=i 2) {                             # process columns 2 at a time
      if (FNR==1)                                     # 1st row? then ...
         printf " %s", substr($i,1,index($i,"_")-1)   # print ith column sans the '_xxxx' suffix
      else
         printf " %s%s", $i, $(i 1)                   # print the (i)th and (i 1)th columns
  }
  print ""
}
' raw.dat

This generates:

ID, kgp11274425 kgp5732633 rs707 kgp75
1 CT GT CA 00
2 CC TG AA GT
3 AA GG CG AA
4 GG CC AA TA

NOTE: OP's desired output does not appear to be single-space delimited, nor tab delimited, so I've opted for a single space; OP can modify the printf format strings to accomplish the desired formatting

CodePudding user response：

The basic approach is to kinda emulate what FPAT does, but in a more portable fashion - this code has been tested and confirmed working on gawk 5.1.1, including flags -ce/-Pe, mawk 1.3.4, mawk 1.9.9.6, and macOS 12.3 nawk.

ID, kgp11274425     kgp5732633      rs707       kgp75       
1       CT            GT              CA       00
2       CC            TG              AA       GT
3       AA            GG              CG       AA
4       GG            CC              AA       TA

[mng]awk 'BEGIN {

    FS = "[_][^_] [_][^,] [,]?[ " (substr(\
   OFS = "\t\t", _ =_=_~_))  "]*"

    print $( (getline) < -(NF=NF) )

    __ = (__="[^ \t] ") substr(_=substr(___="&\6\3",_) \
               "[ \t] ",  index(_, "[") ) __ "|_"
    FS = "^$"

 } gsub(__,___)   gsub(_,"")' datafile.txt

By using regex, it avoids having to manually cycle through the fields, even when 2-fields at a time. \6\3 (ACK ETX) is just an ASCII-only SEP choice I prefer over SUBSEP (\034) or the null-byte (\0).
A 2nd advantage is that by having separate and independent logic that attempts to emulate FPAT, FS can be set directly to ^$, as the rationale for splitting up individual fields is no longer applicable ...

. . . which, hopefully, could help move things along, since OP did mention 500K columns.

The caveat of this solution is that it doesn't attempt to have everything line up in a perfect manner, using \t\t 2 horizontal tabs as a crude solution.

Synthetic 5GB file benchmark : 1 min 14 secs for 5.12GB file, with read-in throughput rate of 70.5 MB/s

     out9: 2.54GiB 0:01:14 [34.9MiB/s] [34.9MiB/s]
           [====================================>] 102%
      in0: 5.12GiB 0:01:14 [70.5MiB/s] [70.5MiB/s]
           [====================================>] 100%            
( pvE 0.1 in0 < testcase_dna_002.txt | gawk -be ; )  

  73.44s user 2.20s system 101% cpu 1:14.37 total

nice pv -pteba -i 0.25 -r --size='2540m' -cN out9 > /dev/null  0.24s 

 user 0.73s system 1% cpu 1:14.37 total