Apply encoding with get _dummies but i need other column value to be printed instead of 1 and 0-CodePudding

I have a column A,B,C,D.
A column has value x1,x2,x3,x4,x5.
Create a column x1,x2,x3,x4,x5 and print 1 if B,C,D has a duplication.

Please provide an answer using pyspark or python pandas.

Input

A   B   C   D  status_color
X1  a   b   c   red
X2  a   a   b   green
X3  a   a   b    red
X4  a   b   c   green

Output

B   C   D   X1  X2  X3  X4
a   b   c   red 0   0   green
a   a   b   0   green   red 0

I tried to find duplicate of column and then create a column duplicate flag which prints status_color if other column are duplicated.

df['duplicate_flag']=df.duplicated(subset['B','C','D'])

My problem here I don't know to compare it with column A and print it in X1,X2,X3,X4

With the help of get_dummies it printed a duplicate value column as 1,0 but I need a status color column value to be printed instead of 1 and 0. Can anyone help me with this using python?

CodePudding user response：

IIUC you could try it like this:

out = (
    df
    .join(
        pd.get_dummies(df.pop('A'))
        .mul(df.pop('status_color'),axis=0)
    )
    .groupby(['B', 'C', 'D'])
    .max()
    .reset_index()
)
print(out)

   B  C  D   X1     X2   X3     X4
0  a  a  b       green  red       
1  a  b  c  red              green

get_dummies will create the columns of all values of A. By multiplying with column status_color the 1's get changed to the the color. Then groupby ['B', 'C', 'D'] and aggregating the rows to one single row by using max

CodePudding user response：

Assuming we are aiming to achieve "output", I think you can use df.pivot for this:

import pandas as pd

data = {'A': {0: 'X1', 1: 'X2', 2: 'X3', 3: 'X4'}, 
        'B': {0: 'a', 1: 'a', 2: 'a', 3: 'a'}, 
        'C': {0: 'b', 1: 'a', 2: 'a', 3: 'b'}, 
        'D': {0: 'c', 1: 'b', 2: 'b', 3: 'c'}, 
        'status_color': {0: 'red', 1: 'green', 2: 'red', 3: 'green'}}
df = pd.DataFrame(data)

res = df.pivot(index=[*'BCD'], columns='A', values='status_color').reset_index()
res.columns.name = None
res.fillna(0, inplace=True)

print(res)

   B  C  D   X1     X2   X3     X4
0  a  a  b    0  green  red      0
1  a  b  c  red      0    0  green