I'm trying to merge two string columns and I wish to get rid of 'others' if the counter value is a 'non-others' value - like 'apple' 'others' = 'apple' but 'others' 'others' = 'others'. I managed the 2nd condition but how can I accomdate the two conditions on the merge?
data = {'fruit1':["organge", "apple", "organge", "organge", "others"],
'fruit2':["apple", "others", "organge", "watermelon", "others"]}
df = pd.DataFrame(data)
df["together"] = df["fruit1"] ' ' df["fruit2"]
df["together"] = df["together"].apply(lambda x: ' '.join(pd.unique(x.split())))
fruit1 fruit2 together
0 organge apple organge apple
1 apple others apple others
2 organge organge organge
3 organge watermelon organge watermelon
4 others others others
Expected output:
fruit1 fruit2 together
0 organge apple organge apple
1 apple others apple
2 organge organge organge
3 organge watermelon organge watermelon
4 others others others
CodePudding user response:
You want to replace only one "others"
, so simple join and then use str.replace
once:
df["together"] = (df["fruit1"] " " df["fruit2"]).str.replace("others", "", n=1).str.strip()
print(df)
fruit1 fruit2 together
0 organge apple organge apple
1 apple others apple
2 organge organge organge organge
3 organge watermelon organge watermelon
4 others others others
The n
parameter specifies the number of replacements to be made, from the documentation:
n int, default -1 (all)
Number of replacements to make from start.
UPDATE
To also remove duplicates use the following regular expression:
df["together"] = df["together"].str.replace(r"\b(\w )\s \1\b", r"\1", n=1, regex=True).str.strip()
print(df)
Output
fruit1 fruit2 together
0 organge apple organge apple
1 apple others apple
2 organge organge organge
3 organge watermelon organge watermelon
4 others others others
See here an explanation of the regex.
CodePudding user response:
You can replace others
by NaN
and dropna()
during join
followed by replacing empty string by a single others
:
df["together"] = (df[['fruit1', 'fruit2']].replace('others', np.nan)
.apply(lambda x: ' '.join(pd.unique(x.dropna())), axis=1)
.replace('', 'others')
)
Or leveraging the str.replace
with n=1
by @Dani (caution: won't work if ther are 3 columns to aggregate; may leave 2 instances of others
) and combining the remove duplicate logics of OP, as follows:
df["together"] = (df["fruit1"] " " df["fruit2"]).str.replace("others", "", n=1).apply(lambda x: ' '.join(pd.unique(x.split())))
Result:
print(df)
fruit1 fruit2 together
0 organge apple organge apple
1 apple others apple
2 organge organge organge
3 organge watermelon organge watermelon
4 others others others
CodePudding user response:
def merge_columns(df, col1, col2, new_col, unwanted_string):
'''Merge two string columns and replace unwanted string with existing string'''
df[new_col] = df[col1].astype(str) df[col2].astype(str)
df[new_col] = df[new_col].str.replace(unwanted_string, '')
return df
CodePudding user response:
You can do it in a lambda function as follows (copy-pasting it will work):
df.together = df.together.apply(lambda x: x if 'others' not in x else ('others' if all([y == '' for y in x.split('others')]) else x.replace('others', '').strip()))
Giving you:
fruit1 fruit2 together
0 organge apple organge apple
1 apple others apple
2 organge organge organge
3 organge watermelon organge watermelon
4 others others others
CodePudding user response:
You can set the resulting value based on an if-statement and the fruit_x
value. To do so, I suggest to replace()
others
with None
which makes conditional checking and re-replacing very easy. Call .apply()
with axis=1
to a perform row-wise operation which I implemented in a separate function concat_strings
for sake of readability. In addition, I would chain all required operations in a single statement in order to not change the original data set.
A very basic approach could look like this:
import pandas as pd
def concat_strings(row):
fruit_1 = row['fruit1']
fruit_2 = row['fruit2']
if fruit_1 == fruit_2:
return fruit_1
elif fruit_1 and fruit_2:
return fruit_1 ' ' fruit_2
elif fruit_1:
return fruit_1
elif fruit_2:
return fruit_2
# create dataframe
data = {
'fruit1': ["organge", "apple", "organge", "organge", "others"],
'fruit2': ["apple", "others", "organge", "watermelon", "others"]
}
df = pd.DataFrame(data)
# replace "others" with None to use as boolean later
# concat strings
# replace None values with "others" to get desired output
df["together"] = (
df
.replace({"others": None})
.apply(concat_strings, axis=1)
.replace({None: "others"})
)
# print final results
print(df)
Resulting output:
fruit1 fruit2 together
0 organge apple organge apple
1 apple others apple
2 organge organge organge
3 organge watermelon organge watermelon
4 others others others