List search in pandas text column-CodePudding

I have a dataset with two columns: date and text. The text column contains unstructured information. I have a list of city names to search for in a text column.

I need to get two sets of data:

list_city = [New York, Los Angeles, Chicago]

When all records from the list with a text message match with the dataframe lines

Sample example:

df_1
data         text
06-02-2022   New York, Los Angeles, Chicago, Phoenix
05-02-2022   New York, Houston, Phoenix
04-02-2022   San Antonio, San Diego, Jacksonville

Need result df_1_res:

df_1_res
data         text
06-02-2022   New York, Los Angeles, Chicago, Phoenix

I tried this design, it works, but it doesn't look very nice:

df_1_res= df_1.loc[df_1["text"].str.contains(list_city[0]) & df_1["text"].str.contains(list_city[1]) & df_1["text"].str.contains(list_city[2])]

When at least one value from the list matches the text in the dataframe lines

Sample example:

df_2
data         text
06-02-2022   New York, Los Angeles, Chicago, Phoenix
05-02-2022   New York, Houston, Phoenix
04-02-2022   San Antonio, San Diego, Jacksonville

Need result df_2_res:

df_2_res
data         text
06-02-2022   New York, Los Angeles, Chicago, Phoenix
05-02-2022   New York, Houston, Phoenix

I tried this design, it works, but it doesn't look very nice:

df_2_res= df_2.loc[df_1["text"].str.contains(list_city[0]) | df_2["text"].str.contains(list_city[1]) | df_2["text"].str.contains(list_city[2])]

How can it be improved? Since it is planned to change the number of cities in the filtering list.

CodePudding user response：

here is one way to do it

For case # 1 : AND Condition

# use re.IGNORECASE to make findall case insensitive

import re
(df_1.loc[df_1['text'].str
        .findall(r'|'.join(list_city),  re.IGNORECASE)
        .apply(lambda x: len(x)).eq(len(list_city))])

data    text
0   06-02-2022  New York, Los Angeles, Chicago, Phoenix

CASE #2 : OR Condition


#create an OR condition using join
# filter using loc

df_2.loc[df_1['text'].str.contains(r'|'.join(list_city))]

    data        text
0   06-02-2022  New York, Los Angeles, Chicago, Phoenix
1   05-02-2022  New York, Houston, Phoenix

CodePudding user response：

Try using the isin() function

Output:

CodePudding user response：

Solution with pandas.Series.str.count

list_city = ['New York', 'Los Angeles', 'Chicago']
print('\nfirst task - ALL')
df1 = df[df['text'].str.count(r'|'.join(list_city)).eq(len(list_city))]
print(df1)

print('\nsecond task - ANY')
df1 = df[df['text'].str.count(r'|'.join(list_city)).gt(0)]
print(df1)

first task - ALL
         data                                     text
0  06-02-2022  New York, Los Angeles, Chicago, Phoenix

second task - ANY
         data                                     text
0  06-02-2022  New York, Los Angeles, Chicago, Phoenix
1  05-02-2022               New York, Houston, Phoenix