incompatible index of inserted column with frame index with group by and count-CodePudding

I have data that looks like this:

CHROM    POS REF ALT  ...  is_sever_int is_sever_str is_sever_f encoding_str
0  chr1  14907   A   G  ...             1            1        one          one
1  chr1  14930   A   G  ...             1            1        one          one

These are the columns that I'm interested to perform calculations on (example) :

is_severe       snp _id            encoding 

       1             1                   one
       1             1                    two
       0             1                    one
       1             2                     two
       0             2                      two
       0             2                       one

what I want to do is to count for each snp_id and severe_id how many ones and twos are in the encoding column :

snp_id        is_svere       encoding_one         encoding_two
  1               1             1                        1
  1               0             1                        0
  2                1             0                       1
  2                0            1                         1

I tried this :

df.groupby(["snp_id","is_sever_f","encoding_str"])["encoding_str"].count()

but it gave the error :

 incompatible index of inserted column with frame index

then i tried this:

df["count"]=df.groupby(["snp_id","is_sever_f","encoding_str"],as_index=False)["encoding_str"].count()

and it returned:

Expected a 1D array, got an array with shape (2532831, 3)

how can i fix this? thank you:)

CodePudding user response：

Let's try groupby with whole columns and get size of each group then unstack the encoding index.

out = (df.groupby(['is_severe', 'snp_id', 'encoding']).size()
       .unstack(fill_value=0)
       .add_prefix('encoding_')
       .reset_index())

print(out)

encoding  is_severe  snp_id  encoding_one  encoding_two
0                 0       1             1             0
1                 0       2             1             1
2                 1       1             1             1
3                 1       2             0             1

CodePudding user response：

Try as follows:

Use pd.get_dummies to convert categorical data in column encoding into indicator variables.
Chain df.groupby and get sum to turn double rows per group into one row (i.e. [0,1] and [1,0] will become [1,1] where df.snp_id == 2 and df.is_severe == 0).

res = pd.get_dummies(data=df, columns=['encoding'])\
    .groupby(['snp_id','is_severe'], as_index=False, sort=False).sum()
    
print(res)

   snp_id  is_severe  encoding_one  encoding_two
0       1          1             1             1
1       1          0             1             0
2       2          1             0             1
3       2          0             1             1

If your actual df has more columns, limit the assigment to the data parameter inside get_dummies. I.e. use:

res = pd.get_dummies(data=df[['is_severe', 'snp_id', 'encoding']], 
                     columns=['encoding']).groupby(['snp_id','is_severe'], 
                                                   as_index=False, sort=False)\
                                                   .sum()