I have a column A,B,C,D
.
A column has value x1,x2,x3,x4,x5
.
Create a column x1,x2,x3,x4,x5
and print 1
if B,C,D
has a duplication.
Please provide an answer using pyspark or python pandas.
Input
A B C D status_color
X1 a b c red
X2 a a b green
X3 a a b red
X4 a b c green
Output
B C D X1 X2 X3 X4
a b c red 0 0 green
a a b 0 green red 0
I tried to find duplicate of column and then create a column duplicate flag which prints status_color if other column are duplicated.
df['duplicate_flag']=df.duplicated(subset['B','C','D'])
My problem here I don't know to compare it with column A
and print it in X1,X2,X3,X4
With the help of get_dummies
it printed a duplicate value column as 1,0 but I need a status color column value to be printed instead of 1 and 0. Can anyone help me with this using python?
CodePudding user response:
IIUC you could try it like this:
out = (
df
.join(
pd.get_dummies(df.pop('A'))
.mul(df.pop('status_color'),axis=0)
)
.groupby(['B', 'C', 'D'])
.max()
.reset_index()
)
print(out)
B C D X1 X2 X3 X4
0 a a b green red
1 a b c red green
get_dummies
will create the columns of all values of A
. By multiplying with column status_color
the 1's get changed to the the color. Then groupby ['B', 'C', 'D']
and aggregating the rows to one single row by using max
CodePudding user response:
Assuming we are aiming to achieve "output", I think you can use df.pivot
for this:
import pandas as pd
data = {'A': {0: 'X1', 1: 'X2', 2: 'X3', 3: 'X4'},
'B': {0: 'a', 1: 'a', 2: 'a', 3: 'a'},
'C': {0: 'b', 1: 'a', 2: 'a', 3: 'b'},
'D': {0: 'c', 1: 'b', 2: 'b', 3: 'c'},
'status_color': {0: 'red', 1: 'green', 2: 'red', 3: 'green'}}
df = pd.DataFrame(data)
res = df.pivot(index=[*'BCD'], columns='A', values='status_color').reset_index()
res.columns.name = None
res.fillna(0, inplace=True)
print(res)
B C D X1 X2 X3 X4
0 a a b 0 green red 0
1 a b c red 0 0 green