I have a dataset with two columns: date
and text
. The text
column contains unstructured information. I have a list of city names to search for in a text
column.
I need to get two sets of data:
list_city = [New York, Los Angeles, Chicago]
- When all records from the list with a text message match with the dataframe lines
Sample example:
df_1
data text
06-02-2022 New York, Los Angeles, Chicago, Phoenix
05-02-2022 New York, Houston, Phoenix
04-02-2022 San Antonio, San Diego, Jacksonville
Need result df_1_res
:
df_1_res
data text
06-02-2022 New York, Los Angeles, Chicago, Phoenix
I tried this design, it works, but it doesn't look very nice:
df_1_res= df_1.loc[df_1["text"].str.contains(list_city[0]) & df_1["text"].str.contains(list_city[1]) & df_1["text"].str.contains(list_city[2])]
- When at least one value from the list matches the text in the dataframe lines
Sample example:
df_2
data text
06-02-2022 New York, Los Angeles, Chicago, Phoenix
05-02-2022 New York, Houston, Phoenix
04-02-2022 San Antonio, San Diego, Jacksonville
Need result df_2_res
:
df_2_res
data text
06-02-2022 New York, Los Angeles, Chicago, Phoenix
05-02-2022 New York, Houston, Phoenix
I tried this design, it works, but it doesn't look very nice:
df_2_res= df_2.loc[df_1["text"].str.contains(list_city[0]) | df_2["text"].str.contains(list_city[1]) | df_2["text"].str.contains(list_city[2])]
How can it be improved? Since it is planned to change the number of cities in the filtering list.
CodePudding user response:
here is one way to do it
For case # 1 : AND Condition
# use re.IGNORECASE to make findall case insensitive
import re
(df_1.loc[df_1['text'].str
.findall(r'|'.join(list_city), re.IGNORECASE)
.apply(lambda x: len(x)).eq(len(list_city))])
data text
0 06-02-2022 New York, Los Angeles, Chicago, Phoenix
CASE #2 : OR Condition
#create an OR condition using join
# filter using loc
df_2.loc[df_1['text'].str.contains(r'|'.join(list_city))]
data text
0 06-02-2022 New York, Los Angeles, Chicago, Phoenix
1 05-02-2022 New York, Houston, Phoenix
CodePudding user response:
Output:
CodePudding user response:
Solution with pandas.Series.str.count
list_city = ['New York', 'Los Angeles', 'Chicago']
print('\nfirst task - ALL')
df1 = df[df['text'].str.count(r'|'.join(list_city)).eq(len(list_city))]
print(df1)
print('\nsecond task - ANY')
df1 = df[df['text'].str.count(r'|'.join(list_city)).gt(0)]
print(df1)
first task - ALL
data text
0 06-02-2022 New York, Los Angeles, Chicago, Phoenix
second task - ANY
data text
0 06-02-2022 New York, Los Angeles, Chicago, Phoenix
1 05-02-2022 New York, Houston, Phoenix