How to find if elements of a column in a data frame are string-contained by the elements of a column-CodePudding

I have a data frame tweets_df that looks like this:

    sentiment                   id                       date                                               text
0           0  1502071360117424136  2022-03-10 23:58:14 00:00  AngelaRaeBoon1 Same Alabama Republicans charge...
1           0  1502070916318121994  2022-03-10 23:56:28 00:00  This ’ w/the sentencing JussieSmollett But mad...
2           0  1502057466267377665  2022-03-10 23:03:01 00:00  DannyClayton Not hard find takes smallest amou...
3           0  1502053718711316512  2022-03-10 22:48:08 00:00  I make fake scenarios getting fights protectin...
4           0  1502045714486022146  2022-03-10 22:16:19 00:00  WipeHomophobia Well people lands wildest thing...
..        ...                  ...                        ...                                                ...
94          0  1501702542899691525  2022-03-09 23:32:41 00:00  There 's reason deep look things kill bad peop...
95          0  1501700281729433606  2022-03-09 23:23:42 00:00  Shame UN United Dictators Shame NATO Repeat We...
96          0  1501699859803516934  2022-03-09 23:22:01 00:00  GayleKing The difference Ukrainian refugees IL...
97          0  1501697172441550848  2022-03-09 23:11:20 00:00  hrkbenowen And includes new United States I un...
98          0  1501696149853511687  2022-03-09 23:07:16 00:00  JLaw_OTD A world women minorities POC LGBTQ÷ d...

And the second dataFrame globe_df that looks like this:

                  Country               Region
0                 Andorra               Europe
1    United Arab Emirates          Middle east
2             Afghanistan       Asia & Pacific
3     Antigua and Barbuda  South/Latin America
4                Anguilla  South/Latin America
..                    ...                  ...
243              Guernsey               Europe
244           Isle of Man               Europe
245                Jersey               Europe
246      Saint Barthelemy  South/Latin America
247          Saint Martin  South/Latin America

I want to delete all rows of the dataframe tweets_df which have 'text' that does not contain a 'Country' or 'Region'.

This was my attempt:

globe_df = pd.read_csv('countriesAndRegions.csv')
tweets_df = pd.read_csv('tweetSheet.csv')


for entry in globe_df['Country']:
    tweet_index = tweets_df[entry in tweets_df['text']].index      # if tweets that *contain*, not equal...... entry in tweets_df['text] .... (in)or (not in)?
    tweets_df.drop(tweet_index , inplace=True)

print(tweets_df)

Edit: Also, fuzzy, case-insensitive matching with stemming would be preferred when searching the 'text' for countries and regions.

Ex) If the text contained 'Ukrainian', 'british', 'engliSH', etc... then it would not be deleted

CodePudding user response：

Convert country and region values to a list and use str.contains to filter out rows that do not contain these values.

#with case insensitive
vals=globe_df.stack().to_list()

tweets_df = tweets_df[tweets_df ['text'].str.contains('|'.join(vals), regex=True, case=False)]

or (with case insensitive)

vals="({})".format('|'.join(globe_df.stack().str.lower().to_list())) #make all letters lowercase
tweets_df['matched'] = tweets_df.text.str.lower().str.extract(vals, expand=False)
tweets_df = tweets_df.dropna()

CodePudding user response：

To check if the elements of one column of a DataFrame are contained by the elements of another column of another DataFrame, you can use the apply method along with a lambda function to apply a check to each row of the first DataFrame.

Here's an example of how you can do this:

# First, create a boolean mask that checks if the 'text' column in the first DataFrame contains
# any of the 'Country' or 'Region' values from the second DataFrame
mask = tweets_df['text'].apply(lambda x: any(word in x for word in globe_df['Country'].values.tolist()   globe_df['Region'].values.tolist()))

# Use the mask to filter rows in the first DataFrame that do not contain any of the values from the second DataFrame
filtered_df = tweets_df[mask]

The resulting filtered_df DataFrame will contain only the rows from the tweets_df DataFrame where the text column contains any of the Country or Region values from the globe_df DataFrame.

CodePudding user response：

# Import data
globe_df = pd.read_csv('countriesAndRegions.csv')
tweets_df = pd.read_csv('tweetSheet.csv')
# Get country and region column as list
globe_df_country = globe_df['Country'].values.tolist()
globe_df_region = globe_df['Region'].values.tolist()
# merge_lists, cause you want to check with or operator
merged_list = globe_df_country   globe_df_region
# If you want to update df while iterating it, best way to do it with using copy df
df_tweets2 = tweets_df.copy()
for index,row in tweets_df.iterrows():
    # Check if splitted row's text values are intersecting with merged_list
    if [i for i in merged_list if i in row['text'].split()] == []:
        df_tweets2 = df_tweets2.drop[index]
tweets_df_new = df_tweets2.copy()
print(tweets_df_new)

CodePudding user response：

You can try using pandas.Series.str.contains to find the values.

tweets_df[tweets_df['text'].contains('{}|{}'.format(entry['Country'],entry['Region'])]

And after creating a new column with boolean values, you can remove rows with the value True.