Home > database >  How can i combine and pull apart rows of a DataFrame?
How can i combine and pull apart rows of a DataFrame?

Time:01-11

I have a DataFrame of users information like Name, Mail, Birthday, Genre... but the dataframe have duplicate users with important different info in both rows. Also the problem that there are different users with the same Name. For example:

Original Dataframe

     Name Mail          Birthday Genre Subscription Age  Comments
0    A    [email protected]   1-1-1990 M     Y            33   -
1    B    None          NaT      F     N            NaN  -
2    C    [email protected]   1-1-1985 M     Y            38   -
3    D    None          1-1-1980 I     N            43   - 
4    B    None          1-1-1995 I     N            27   -
5    D    [email protected]   NaT      M     Y            NaN  -
6    B    [email protected]   NaT      I     Y            NaN  -
7    C    [email protected]  1-1-1970 F     N            53   -

And I want something like that:

     Name Mail          Birthday Genre Subscription Age Comments
0    A    [email protected]   1-1-1990 M     Y            33  -
1    B    [email protected]   1-1-1995 F     Y            27  -
2    C    [email protected]   1-1-1985 M     Y            38  -
3    D    [email protected]   1-1-1980 M     Y            43  -
4    C2   [email protected]  1-1-1970 F     N            53  -

It's possible to for example group by name and combine the rows keeping all the information in one row, and if in both rows in a column have the same value keep it. And also if 2 rows of different values have the same name, create a row for the other user with the new name.

for example an intermediate df:

     Name Mail            Birthday   Genre Subscription Age   Comments
0    A    [email protected]     1-1-1990   M     Y            33    -
1    B    [email protected]     1-1-1995   F     Y            27    -
2    C    [[email protected],   [1-1-1985, [M,   [Y,          [38,  [-,
           [email protected]]   1-1-1970]  F]    N]           53]   -]
3    D    [email protected]     1-1-1980   M     Y            43    -

I tried with groupby function, or with join function but did not work correctly for me.

Thanks.

CodePudding user response:

You can combine values with replace values per folumnc for exclude with aggregate by all columns without Name by custom function with reusing variable by walrus operator := with convert ene element lists to scalars and empty lists after remove missing values and duplicates to NaNs:

f = lambda x: y if (len(y:=list(dict.fromkeys(x.dropna()))) > 1) 
                else y[0] 
                if len(y) != 0 
                else np.nan

df = (df.replace({'Mail':{'None':np.nan},
                  'Birthday':{'NaT':np.nan},
                  'Genre':{'I':np.nan}, 
                  'Subscription':{'N':np.nan}})
        .set_index('Name')
        .groupby('Name')
        .agg(f))
print (df)
                             Mail              Birthday   Genre Subscription  \
Name                                                                           
A                     [email protected]              1-1-1990       M            Y   
B                     [email protected]              1-1-1995       F            Y   
C     [[email protected], [email protected]]  [1-1-1985, 1-1-1970]  [M, F]            Y   
D                     [email protected]              1-1-1980       M            Y   

               Age Comments  
Name                         
A             33.0        -  
B             27.0        -  
C     [38.0, 53.0]        -  
D             43.0        -  
  • Related