I have a column A,B,C,D A column has value x1,x2,x3,x4,x5 create a column x1,x2,x3,x4,x5 and print 1 if B,C,D has a duplication
Please provide an answer using pyspark or python pandas
Input
A B C D
X1 a b c
X2 a a b
X3 a a b
X4 a b c
Output
B C D X1 X2 X3 X4
a b c 1 0 0 1
a a b 0 1 1 0
I tried to find duplicate of column and then create a column duplicate flag which prints 1 if other column are duplicated df['duplicate_flag']=df.duplicated(subset['B','C','D'])
my problem here i don't know to compare it with column A and print it in X1,X2,X3,X4
any one can help with python? i am new to python
CodePudding user response:
Use groupby
str.get_dummies
:
group = df.groupby(["B", "C", "D"], sort=False).agg("|".join)
res = group["A"].str.get_dummies().reset_index()
print(res)
Output
B C D X1 X2 X3 X4
0 a a b 0 1 1 0
1 a b c 1 0 0 1
CodePudding user response:
Use pandas.crosstab
:
out = (pd.crosstab([df['B'], df['C'], df['D']], df['A'])
.clip(upper=1) # only if you expect duplicates
.reset_index().rename_axis(columns=None)
)
output:
B C D X1 X2 X3 X4
0 a a b 0 1 1 0
1 a b c 1 0 0 1