Python / Pandas Join records when they match in at least one column-CodePudding

I have the following dataframe:

id          phone       email
10352897        
10352897    10225967    
10352897                [email protected]
10352897    10225967    [email protected]
            10225967    
            10225967    [email protected]
                        [email protected]
23578910        
23578910    38256789    
23578910                [email protected]
23578910    38256789    [email protected]
            38256789    
            38256789    [email protected]
                        [email protected]
            65287930    [email protected]
            65287930
                        [email protected]
            65287930
            70203065
            70203065
            70203065
                        [email protected]
                        [email protected]
                        [email protected]

Not all the fields are always filled in, but they are related to each other in at least one column.

I would like that when it coincides in at least one of the three columns, the record joins and prioritizes the filled fields over the empty ones, in the end in this example I would expect the following output:

id          phone       email
10352897    10225967    [email protected]
23578910    38256789    [email protected]
            65287930    [email protected]
            70203065
                        [email protected]

How would you go about doing this?

CodePudding user response：

This is quite a specific requirement and I'm not aware of any in-built pandas function that does what you're looking for, so I tried to define exactly what you are looking to do and recreate it from scratch.

The best I could come up with is that reading from the top of the dataframe, we will look at the values in each of the columns until we find a row with a value in a column that is different from the previously encountered value in that column, at which point we put all the values encountered so far into a row in the new dataframe.

This looks like (assuming your original dataframe is named df and the empty cells are blanks (''):

new_rows = []
# Create a dictionary where keys are the columns of the original df
new_row = {col: '' for col in df.columns}
# Iterate over rows
for _, row in df.iterrows():
    # Iterate over columns in row
    for col in row.keys():
        # If this column is not blank
        if row[col]:
            # If this column has already been filled in the new row
            # and the value is different, add this row to the new dataframe
            if new_row[col] and new_row[col] != row[col]:
                new_rows.append(new_row)
                new_row = {col: '' for col in df.columns}
            # Otherwise, set this value for the current row of the new dataframe
            else:
                new_row[col] = row[col]
# Add the last row
new_rows.append(new_row)
new_df = pd.DataFrame(new_rows)
print(new_df)

However, 70203065 and [email protected] end up in the same row in the new dataframe:

        id     phone            email
0  10352897  10225967   [email protected]
1  23578910  38256789  [email protected]
2            65287930  [email protected]
3            70203065  [email protected]

You may need to think about what the logic is that causes 70203065 and [email protected] to end up in separate rows, but hopefully this gets you started in the right direction.

CodePudding user response：

If the goal is solely to obtain the desired output, one approach would be to get a dataframe with unique values of each column in the original dataframe df, and for that, one can use pandas.DataFrame.apply with a custom lambda function as follows

df_new = df.apply(lambda x: pd.Series(x.unique()[~pd.isnull(x.unique())]))

[Out]:

           id       phone            email
0  10352897.0  10225967.0   [email protected]
1  23578910.0  38256789.0  [email protected]
2         NaN  65287930.0  [email protected]
3         NaN  70203065.0  [email protected]

Then, even though there is room to implement a validation to check if, for a given unique value, the rest of the columns match in the df, as in this specific case the first three rows are actually correct, we won't consider that. One will simply duplicate the last row

df_new = pd.concat([df_new, df_new.iloc[-1:]], ignore_index=True)

Then adjust the respective values to obtain the desired output

df_new.iloc[-2,2] = np.nan
df_new.iloc[-1,1] = np.nan

[Out]:

           id       phone            email
0  10352897.0  10225967.0   [email protected]
1  23578910.0  38256789.0  [email protected]
2         NaN  65287930.0  [email protected]
3         NaN  70203065.0              NaN
4         NaN         NaN  [email protected]

Notes:

Even though the final part is not the most elegant, and require some "manual" work (duplicating a row and manually changing the cell values), this works for OP's specific case.
There are strong opinions on using .apply(). For that, one might want to read this.

CodePudding user response：

here is one way to do it.

please note: the third section in your question, don't follow your stated pattern coincides in at least one of the three columns,

# Compare each rows with the next and identify ones where all three columns have changed
df['keep']=((df['id'].ne(df['id'].shift(-1)) &
             df['phone'].ne(df['phone'].shift(-1)) &
             df['email'].ne(df['email'].shift(-1))
            ))

# map
df['chng']=df['keep'].map({False : np.nan, True: 1})

# cumsum to find the number of groups
df['chng']=df['chng'].cumsum().bfill()

# groupby and fill values
df=df.groupby('chng', as_index=True).ffill()

# keep the ones where values have changed 
out=df.loc[df['keep']==True].fillna('')[['id','phone','email']]
out

            id  phone       email
6   10352897.0  10225967.0  [email protected]
13  23578910.0  38256789.0  [email protected]
15              65287930.0  [email protected]
16                          [email protected]
17              65287930.0  
20              70203065.0  
23                          [email protected]

though, if you run the above code again, it further consolidates the middle three rows into your expected resultset

            id       phone           email
6   10352897.0  10225967.0  [email protected]
13  23578910.0  38256789.0  [email protected]
17              65287930.0  [email protected]
20              70203065.0  
23                          [email protected]