Search values in a concatenated column based on a list-CodePudding

I am wondering if there is a way to search values from a list in a concatenated column without un-concatenate the column.

Look at the following table as an example:

ID	Aleatory Stuff
1	banana, rice, potato
2	cucumber, breakfast
3	Heaven, Bucket, test
4	Ball, Pan, Candy, shirt
5	draw, cellphone, bubble

I'd like to create another column based on a list, as follows:

list_0 = ["Ball", "cellphone", "breakfast", "rice", "test"]

new_df:

ID	List Stuff
1	rice
2	breakfast
3	test
4	Ball
5	cellphone

Is there any possibility?

CodePudding user response：

Use extractall with aggregation:

import re

pattern = '|'.join(map(re.escape, list_0))
# 'Ball|cellphone|breakfast|rice|test'

df['List Stuff'] = (df['Aleatory Stuff'].str.extractall(f'({pattern})')
                    [0].groupby(level=0).agg(', '.join)
                   )

NB. if you want a case insensitive match add flags=re.I as parameter to extractall.

Output:

   ID           Aleatory Stuff List Stuff
0   1     banana, rice, potato       rice
1   2      cucumber, breakfast  breakfast
2   3     Heaven, Bucket, test       test
3   4  Ball, Pan, Candy, shirt       Ball
4   5  draw, cellphone, bubble  cellphone

CodePudding user response：

with str.split and a listcomp :

df["List Stuff"] = [x for l in df.pop("Aleatory Stuff").str.split(", ")
                    for x in l if x in list_0]

Output :

print(df)

   ID List Stuff
0   1       rice
1   2  breakfast
2   3       test
3   4       Ball
4   5  cellphone

CodePudding user response：

You can try something like this. You can split the string column Aleatory stuff using str.split(", ") where ", " is the separator between words in your case. Once the items are splited, you can use set.intersection(set) to find the common items between each row and your list.

You can check my code with an example similar to yours:

import pandas as pd 

test_data = [
    {"id": 1, "aleatory_stuff": "banana, rice, potato"},
    {"id": 2, "aleatory_stuff": "cucumber, breakfast"}
]

test_list = ["rice", "cucumber"]

test_df = pd.DataFrame(test_data)
test_df["list_stuff"] = test_df["aleatory_stuff"].str.split(", ").apply(lambda x: list(set(x).intersection(set(test_list)))[0])

print(test_df)

   id        aleatory_stuff list_stuff
0   1  banana, rice, potato       rice
1   2   cucumber, breakfast   cucumber

CodePudding user response：

Let's use str.findall to find all occurrences of matching regex pattern

pat = r'(?i)\b(%s)\b' % '|'.join(list_0)
df['List Stuff'] = df['Aleatory Stuff'].str.findall(pat).str.join(',')

   ID           Aleatory Stuff List Stuff
0   1     banana, rice, potato       rice
1   2      cucumber, breakfast  breakfast
2   3     Heaven, Bucket, test       test
3   4  Ball, Pan, Candy, shirt       Ball
4   5  draw, cellphone, bubble  cellphone

PS: The regex flag (?i) is used for case insensitive match