I am wondering if there is a way to search values from a list in a concatenated column without un-concatenate the column.
Look at the following table as an example:
ID | Aleatory Stuff |
---|---|
1 | banana, rice, potato |
2 | cucumber, breakfast |
3 | Heaven, Bucket, test |
4 | Ball, Pan, Candy, shirt |
5 | draw, cellphone, bubble |
I'd like to create another column based on a list, as follows:
list_0 = ["Ball", "cellphone", "breakfast", "rice", "test"]
new_df:
ID | List Stuff |
---|---|
1 | rice |
2 | breakfast |
3 | test |
4 | Ball |
5 | cellphone |
Is there any possibility?
CodePudding user response:
Use extractall
with aggregation:
import re
pattern = '|'.join(map(re.escape, list_0))
# 'Ball|cellphone|breakfast|rice|test'
df['List Stuff'] = (df['Aleatory Stuff'].str.extractall(f'({pattern})')
[0].groupby(level=0).agg(', '.join)
)
NB. if you want a case insensitive match add flags=re.I
as parameter to extractall
.
Output:
ID Aleatory Stuff List Stuff
0 1 banana, rice, potato rice
1 2 cucumber, breakfast breakfast
2 3 Heaven, Bucket, test test
3 4 Ball, Pan, Candy, shirt Ball
4 5 draw, cellphone, bubble cellphone
CodePudding user response:
with str.split
and a listcomp :
df["List Stuff"] = [x for l in df.pop("Aleatory Stuff").str.split(", ")
for x in l if x in list_0]
Output :
print(df)
ID List Stuff
0 1 rice
1 2 breakfast
2 3 test
3 4 Ball
4 5 cellphone
CodePudding user response:
You can try something like this. You can split the string column Aleatory stuff
using str.split(", ")
where ", "
is the separator between words in your case. Once the items are splited, you can use set.intersection(set)
to find the common items between each row and your list.
You can check my code with an example similar to yours:
import pandas as pd
test_data = [
{"id": 1, "aleatory_stuff": "banana, rice, potato"},
{"id": 2, "aleatory_stuff": "cucumber, breakfast"}
]
test_list = ["rice", "cucumber"]
test_df = pd.DataFrame(test_data)
test_df["list_stuff"] = test_df["aleatory_stuff"].str.split(", ").apply(lambda x: list(set(x).intersection(set(test_list)))[0])
print(test_df)
id aleatory_stuff list_stuff
0 1 banana, rice, potato rice
1 2 cucumber, breakfast cucumber
CodePudding user response:
Let's use str.findall
to find all occurrences of matching regex pattern
pat = r'(?i)\b(%s)\b' % '|'.join(list_0)
df['List Stuff'] = df['Aleatory Stuff'].str.findall(pat).str.join(',')
ID Aleatory Stuff List Stuff
0 1 banana, rice, potato rice
1 2 cucumber, breakfast breakfast
2 3 Heaven, Bucket, test test
3 4 Ball, Pan, Candy, shirt Ball
4 5 draw, cellphone, bubble cellphone
PS: The regex flag (?i)
is used for case insensitive match