I have a DataFrame of users information like Name, Mail, Birthday, Genre... but the dataframe have duplicate users with important different info in both rows. Also the problem that there are different users with the same Name. For example:
Original Dataframe
Name Mail Birthday Genre Subscription Age Comments
0 A [email protected] 1-1-1990 M Y 33 -
1 B None NaT F N NaN -
2 C [email protected] 1-1-1985 M Y 38 -
3 D None 1-1-1980 I N 43 -
4 B None 1-1-1995 I N 27 -
5 D [email protected] NaT M Y NaN -
6 B [email protected] NaT I Y NaN -
7 C [email protected] 1-1-1970 F N 53 -
And I want something like that:
Name Mail Birthday Genre Subscription Age Comments
0 A [email protected] 1-1-1990 M Y 33 -
1 B [email protected] 1-1-1995 F Y 27 -
2 C [email protected] 1-1-1985 M Y 38 -
3 D [email protected] 1-1-1980 M Y 43 -
4 C2 [email protected] 1-1-1970 F N 53 -
It's possible to for example group by name and combine the rows keeping all the information in one row, and if in both rows in a column have the same value keep it. And also if 2 rows of different values have the same name, create a row for the other user with the new name.
for example an intermediate df:
Name Mail Birthday Genre Subscription Age Comments
0 A [email protected] 1-1-1990 M Y 33 -
1 B [email protected] 1-1-1995 F Y 27 -
2 C [[email protected], [1-1-1985, [M, [Y, [38, [-,
[email protected]] 1-1-1970] F] N] 53] -]
3 D [email protected] 1-1-1980 M Y 43 -
I tried with groupby function, or with join function but did not work correctly for me.
Thanks.
CodePudding user response:
You can combine values with replace values per folumnc for exclude with aggregate by all columns without Name
by custom function with reusing variable by walrus operator :=
with convert ene element lists to scalars and empty lists after remove missing values and duplicates to NaN
s:
f = lambda x: y if (len(y:=list(dict.fromkeys(x.dropna()))) > 1)
else y[0]
if len(y) != 0
else np.nan
df = (df.replace({'Mail':{'None':np.nan},
'Birthday':{'NaT':np.nan},
'Genre':{'I':np.nan},
'Subscription':{'N':np.nan}})
.set_index('Name')
.groupby('Name')
.agg(f))
print (df)
Mail Birthday Genre Subscription \
Name
A [email protected] 1-1-1990 M Y
B [email protected] 1-1-1995 F Y
C [[email protected], [email protected]] [1-1-1985, 1-1-1970] [M, F] Y
D [email protected] 1-1-1980 M Y
Age Comments
Name
A 33.0 -
B 27.0 -
C [38.0, 53.0] -
D 43.0 -