Pandas: Merge two string columns in Python, remove duplicated strings and remove unwanted string unl-CodePudding

I'm trying to merge two string columns and I wish to get rid of 'others' if the counter value is a 'non-others' value - like 'apple' 'others' = 'apple' but 'others' 'others' = 'others'. I managed the 2nd condition but how can I accommodate the two conditions on the merge?

data = {'fruit1':["organge", "apple", "organge", "organge", "others"],
        'fruit2':["apple", "others", "organge", "watermelon", "others"]}
df = pd.DataFrame(data)

df["together"] = df["fruit1"]   ' '   df["fruit2"]
df["together"] = df["together"].apply(lambda x: ' '.join(pd.unique(x.split())))

    fruit1      fruit2            together
0  organge       apple       organge apple
1    apple      others        apple others
2  organge     organge             organge
3  organge  watermelon  organge watermelon
4   others      others              others

Expected output:

    fruit1      fruit2            together
0  organge       apple       organge apple
1    apple      others               apple
2  organge     organge             organge
3  organge  watermelon  organge watermelon
4   others      others              others

CodePudding user response：

You want to replace only one "others", so simple join and then use str.replace once:

df["together"] = (df["fruit1"]   " "   df["fruit2"]).str.replace("others", "", n=1).str.strip()
print(df)

    fruit1      fruit2            together
0  organge       apple       organge apple
1    apple      others               apple
2  organge     organge     organge organge
3  organge  watermelon  organge watermelon
4   others      others              others

The n parameter specifies the number of replacements to be made, from the documentation:

n int, default -1 (all)
Number of replacements to make from start.

UPDATE

To also remove duplicates use the following regular expression:

df["together"] = df["together"].str.replace(r"\b(\w )\s \1\b", r"\1", n=1, regex=True).str.strip()
print(df)

Output

    fruit1      fruit2            together
0  organge       apple       organge apple
1    apple      others               apple
2  organge     organge             organge
3  organge  watermelon  organge watermelon
4   others      others              others

See here an explanation of the regex.

CodePudding user response：

You can replace others by NaN and dropna() during join followed by replacing empty string by a single others:

df["together"] = (df[['fruit1', 'fruit2']].replace('others', np.nan)
                   .apply(lambda x: ' '.join(pd.unique(x.dropna())), axis=1)
                   .replace('', 'others')
                 )

Or leveraging the str.replace with n=1 by @Dani (caution: won't work if ther are 3 columns to aggregate; may leave 2 instances of others) and combining the remove duplicate logics of OP, as follows:

df["together"] = (df["fruit1"]   " "   df["fruit2"]).str.replace("others", "", n=1).apply(lambda x: ' '.join(pd.unique(x.split())))

Result:

print(df)

    fruit1      fruit2            together
0  organge       apple       organge apple
1    apple      others               apple
2  organge     organge             organge
3  organge  watermelon  organge watermelon
4   others      others              others

CodePudding user response：

def merge_columns(df, col1, col2, new_col, unwanted_string):
    '''Merge two string columns and replace unwanted string with existing string'''
    df[new_col] = df[col1].astype(str)   df[col2].astype(str)
    df[new_col] = df[new_col].str.replace(unwanted_string, '')
    return df

CodePudding user response：

You can do it in a lambda function as follows (copy-pasting it will work):

df.together = df.together.apply(lambda x: x if 'others' not in x else ('others' if all([y == '' for y in x.split('others')]) else x.replace('others', '').strip()))

Giving you:

    fruit1  fruit2      together
0   organge apple       organge apple
1   apple   others      apple
2   organge organge     organge
3   organge watermelon  organge watermelon
4   others  others      others

CodePudding user response：

You can set the resulting value based on an if-statement and the fruit_x value. To do so, I suggest to replace() others with None which makes conditional checking and re-replacing very easy. Call .apply() with axis=1 to a perform row-wise operation which I implemented in a separate function concat_strings for sake of readability. In addition, I would chain all required operations in a single statement in order to not change the original data set.

A very basic approach could look like this:

import pandas as pd


def concat_strings(row):
    fruit_1 = row['fruit1']
    fruit_2 = row['fruit2']

    if fruit_1 == fruit_2:
        return fruit_1
    elif fruit_1 and fruit_2:
        return fruit_1   ' '   fruit_2
    elif fruit_1:
        return fruit_1
    elif fruit_2:
        return fruit_2


# create dataframe
data = {
    'fruit1': ["organge", "apple", "organge", "organge", "others"],
    'fruit2': ["apple", "others", "organge", "watermelon", "others"]
}
df = pd.DataFrame(data)

# replace "others" with None to use as boolean later
# concat strings
# replace None values with "others" to get desired output
df["together"] = (
    df
    .replace({"others": None})
    .apply(concat_strings, axis=1)
    .replace({None: "others"})
)

# print final results
print(df)

Resulting output:

      fruit1      fruit2            together
  0  organge       apple       organge apple
  1    apple      others               apple
  2  organge     organge             organge
  3  organge  watermelon  organge watermelon
  4   others      others              others