Home > Net >  How to Add Two Columns of DataFrame and Rename it with Prefix Name using bash
How to Add Two Columns of DataFrame and Rename it with Prefix Name using bash

Time:04-21

The original Data looks like

ID, kgp11274425_A, kgp11274425_HET, kgp5732633_C, kgp5732633_HET, rs707_G, rs707_HET, kgp75_T, kgp75_HET
1       C                T            G             T              C            A       0          0
2       C                C            T             G              A            A       G          T
3       A                A            G             G              C            G       A          A
4       G                G            C             C              A            A       T          A

Desired Output:

ID, kgp11274425  kgp5732633    rs707     kgp75
1      CT           GT           CA       00
2      CC           TG           AA       GT
3      AA           GG           CG       AA
4      GG           CC           AA       TA

I was able to accomplish this using the following Python Script

sep = '_'
unique_cols = pd.Index(map(lambda x : x.split(sep, 1)[0], df.columns)).unique()

results = []
columns = []
for col in unique_cols:
    my_cols = [x for x in df.columns if x.startswith(col)]
    results.append(df[my_cols].sum(axis=1).values)
    columns.append(col)

new_df = pd.DataFrame(results).T
new_df.columns = columns

But this time I got 522rows & 5311137cols (5GB) data, Python is not able to read the file. So I need to run the same python logic using bash commands, new to bash please help

CodePudding user response:

Input:

$ cat raw.dat
ID, kgp11274425_A, kgp11274425_HET, kgp5732633_C, kgp5732633_HET, rs707_G, rs707_HET, kgp75_T, kgp75_HET
1       C                T            G             T              C            A       0          0
2       C                C            T             G              A            A       G          T
3       A                A            G             G              C            G       A          A
4       G                G            C             C              A            A       T          A

One awk idea:

awk '
{ printf $1                                           # print 1st column
  for (i=2;i<=NF;i=i 2) {                             # process columns 2 at a time
      if (FNR==1)                                     # 1st row? then ...
         printf " %s", substr($i,1,index($i,"_")-1)   # print ith column sans the '_xxxx' suffix
      else
         printf " %s%s", $i, $(i 1)                   # print the (i)th and (i 1)th columns
  }
  print ""
}
' raw.dat

This generates:

ID, kgp11274425 kgp5732633 rs707 kgp75
1 CT GT CA 00
2 CC TG AA GT
3 AA GG CG AA
4 GG CC AA TA

NOTE: OP's desired output does not appear to be single-space delimited, nor tab delimited, so I've opted for a single space; OP can modify the printf format strings to accomplish the desired formatting

CodePudding user response:

The basic approach is to kinda emulate what FPAT does, but in a more portable fashion - this code has been tested and confirmed working on gawk 5.1.1, including flags -ce/-Pe, mawk 1.3.4, mawk 1.9.9.6, and macOS 12.3 nawk.

ID, kgp11274425     kgp5732633      rs707       kgp75       
1       CT            GT              CA       00
2       CC            TG              AA       GT
3       AA            GG              CG       AA
4       GG            CC              AA       TA

[mng]awk 'BEGIN {

    FS = "[_][^_] [_][^,] [,]?[ " (substr(\
   OFS = "\t\t", _ =_=_~_))  "]*"

    print $( (getline) < -(NF=NF) )

    __ = (__="[^ \t] ") substr(_=substr(___="&\6\3",_) \
               "[ \t] ",  index(_, "[") ) __ "|_"
    FS = "^$"

 } gsub(__,___)   gsub(_,"")' datafile.txt
  1. By using regex, it avoids having to manually cycle through the fields, even when 2-fields at a time. \6\3 (ACK ETX) is just an ASCII-only SEP choice I prefer over SUBSEP (\034) or the null-byte (\0).

  2. A 2nd advantage is that by having separate and independent logic that attempts to emulate FPAT, FS can be set directly to ^$, as the rationale for splitting up individual fields is no longer applicable ...

    . . . which, hopefully, could help move things along, since OP did mention 500K columns.

  • The caveat of this solution is that it doesn't attempt to have everything line up in a perfect manner, using \t\t 2 horizontal tabs as a crude solution.

Synthetic 5GB file benchmark : 1 min 14 secs for 5.12GB file, with read-in throughput rate of 70.5 MB/s

     out9: 2.54GiB 0:01:14 [34.9MiB/s] [34.9MiB/s]
           [====================================>] 102%
      in0: 5.12GiB 0:01:14 [70.5MiB/s] [70.5MiB/s]
           [====================================>] 100%            
( pvE 0.1 in0 < testcase_dna_002.txt | gawk -be ; )  

  73.44s user 2.20s system 101% cpu 1:14.37 total

nice pv -pteba -i 0.25 -r --size='2540m' -cN out9 > /dev/null  0.24s 

 user 0.73s system 1% cpu 1:14.37 total
  • Related