Assign groups in pandas dataframe based on column-CodePudding

I have a dataframe df like below :

import pandas as pd

data = {'A': ['ABCD_1', 'ABCD_1', 'ABCD_1', 'ABCD_1', 'PQRS_2', 'PQRS_2', 'PQRS_2', 'PQRS_2', 'PQRS_2'], 'P':[1, 0, 0, 0, 0, 1, 0, 0, 0]}
df = pd.DataFrame(data)
df


         |     A      |    P      | 
          ------------ ----------- 
         |    ABCD_1  |    1      | 
         |    ABCD_1  |    0      |  
         |    ABCD_1  |    0      |  
         |    ABCD_1  |    0      | 
         |    PQRS_2  |    0      | 
         |    PQRS_2  |    1      | 
         |    PQRS_2  |    0      | 
         |    PQRS_2  |    0      |
         |    PQRS_2  |    0      |
          ------------ -----------

I want to group the strings as per Column A in df and assign the groups to new column G in the same dataframe df in pandas (python). I'm expecting the output like below.

     |     A      |    P      |     G      |
      ------------ ----------- ------------ 
     |    ABCD_1  |    1      |            |
     |    ABCD_1  |    0      |            |
     |    ABCD_1  |    0      |      1     |
     |    ABCD_1  |    0      |            |
      ------------|-----------|------------ 
     |    PQRS_2  |    0      |            |
     |    PQRS_2  |    1      |            |
     |    PQRS_2  |    0      |      2     |
     |    PQRS_2  |    0      |            |
     |    PQRS_2  |    0      |            |
      ------------ ----------- ------------

So how can I achieve this using pandas and python ? Thanks !

CodePudding user response：

You can use pandas.Series.str.split.html and then pandas.DataFrame.set_index to get the desired output

Creating data

data = {'A': ['ABCD_1', 'ABCD_1', 'ABCD_1', 'ABCD_1', 'PQRS_2', 'PQRS_2', 'PQRS_2', 'PQRS_2', 'PQRS_2'], 'P':[1, 0, 0, 0, 0, 1, 0, 0, 0]}
df = pd.DataFrame(data)

Pre-processing

df['G'] = df.['A'].str.split('_').str.get(0)

Generating output

df.set_index(['G', 'A'])

Output

This gives us the expected output:

          P
G A        
1 ABCD_1  1
  ABCD_1  0
  ABCD_1  0
  ABCD_1  0
2 PQRS_2  0
  PQRS_2  1
  PQRS_2  0
  PQRS_2  0
  PQRS_2  0

Alternatively :

You can also use pandas.factorize.html to for obtaining a numeric representation of an array when all we need is distinct values

df['G'] = pd.factorize(df['A'])[0]   1
df.set_index(['G', 'A'])

This gives you the same output :

          P
G A        
1 ABCD_1  1
  ABCD_1  0
  ABCD_1  0
  ABCD_1  0
2 PQRS_2  0
  PQRS_2  1
  PQRS_2  0
  PQRS_2  0
  PQRS_2  0

EDIT : But as we discussed, you don't need index on the column but a single instance of the value for the column G and not the other values duplicated.

We can replace all duplicated values of the column with a empty string. We can use pandas.DataFrame.duplicated which returns us boolean Series denoting duplicate rows.

df['G'] = pd.factorize(df['A'])[0]   1
df['G'].astype('str')
df.loc[df['G'].duplicated(), 'G'] = ''

This gives us the expected output :

        A  P  G
0  ABCD_1  1  1
1  ABCD_1  0   
2  ABCD_1  0   
3  ABCD_1  0   
4  PQRS_2  0  2
5  PQRS_2  1   
6  PQRS_2  0   
7  PQRS_2  0   
8  PQRS_2  0