I have a dataframe df
like below :
import pandas as pd
data = {'A': ['ABCD_1', 'ABCD_1', 'ABCD_1', 'ABCD_1', 'PQRS_2', 'PQRS_2', 'PQRS_2', 'PQRS_2', 'PQRS_2'], 'P':[1, 0, 0, 0, 0, 1, 0, 0, 0]}
df = pd.DataFrame(data)
df
| A | P |
------------ -----------
| ABCD_1 | 1 |
| ABCD_1 | 0 |
| ABCD_1 | 0 |
| ABCD_1 | 0 |
| PQRS_2 | 0 |
| PQRS_2 | 1 |
| PQRS_2 | 0 |
| PQRS_2 | 0 |
| PQRS_2 | 0 |
------------ -----------
I want to group the strings as per Column A in df
and assign the groups to new column G in the same dataframe df
in pandas (python). I'm expecting the output like below.
| A | P | G |
------------ ----------- ------------
| ABCD_1 | 1 | |
| ABCD_1 | 0 | |
| ABCD_1 | 0 | 1 |
| ABCD_1 | 0 | |
------------|-----------|------------
| PQRS_2 | 0 | |
| PQRS_2 | 1 | |
| PQRS_2 | 0 | 2 |
| PQRS_2 | 0 | |
| PQRS_2 | 0 | |
------------ ----------- ------------
So how can I achieve this using pandas and python ? Thanks !
CodePudding user response:
You can use pandas.Series.str.split.html
and then pandas.DataFrame.set_index
to get the desired output
Creating data
data = {'A': ['ABCD_1', 'ABCD_1', 'ABCD_1', 'ABCD_1', 'PQRS_2', 'PQRS_2', 'PQRS_2', 'PQRS_2', 'PQRS_2'], 'P':[1, 0, 0, 0, 0, 1, 0, 0, 0]}
df = pd.DataFrame(data)
Pre-processing
df['G'] = df.['A'].str.split('_').str.get(0)
Generating output
df.set_index(['G', 'A'])
Output
This gives us the expected output:
P
G A
1 ABCD_1 1
ABCD_1 0
ABCD_1 0
ABCD_1 0
2 PQRS_2 0
PQRS_2 1
PQRS_2 0
PQRS_2 0
PQRS_2 0
Alternatively :
You can also use pandas.factorize.html
to for obtaining a numeric representation of an array when all we need is distinct values
df['G'] = pd.factorize(df['A'])[0] 1
df.set_index(['G', 'A'])
This gives you the same output :
P
G A
1 ABCD_1 1
ABCD_1 0
ABCD_1 0
ABCD_1 0
2 PQRS_2 0
PQRS_2 1
PQRS_2 0
PQRS_2 0
PQRS_2 0
EDIT : But as we discussed, you don't need index on the column but a single instance of the value for the column G
and not the other values duplicated.
We can replace all duplicated values of the column with a empty string. We can use pandas.DataFrame.duplicated
which returns us boolean Series denoting duplicate rows.
df['G'] = pd.factorize(df['A'])[0] 1
df['G'].astype('str')
df.loc[df['G'].duplicated(), 'G'] = ''
This gives us the expected output :
A P G
0 ABCD_1 1 1
1 ABCD_1 0
2 ABCD_1 0
3 ABCD_1 0
4 PQRS_2 0 2
5 PQRS_2 1
6 PQRS_2 0
7 PQRS_2 0
8 PQRS_2 0