I have a data frame tweets_df
that looks like this:
sentiment id date text
0 0 1502071360117424136 2022-03-10 23:58:14 00:00 AngelaRaeBoon1 Same Alabama Republicans charge...
1 0 1502070916318121994 2022-03-10 23:56:28 00:00 This ’ w/the sentencing JussieSmollett But mad...
2 0 1502057466267377665 2022-03-10 23:03:01 00:00 DannyClayton Not hard find takes smallest amou...
3 0 1502053718711316512 2022-03-10 22:48:08 00:00 I make fake scenarios getting fights protectin...
4 0 1502045714486022146 2022-03-10 22:16:19 00:00 WipeHomophobia Well people lands wildest thing...
.. ... ... ... ...
94 0 1501702542899691525 2022-03-09 23:32:41 00:00 There 's reason deep look things kill bad peop...
95 0 1501700281729433606 2022-03-09 23:23:42 00:00 Shame UN United Dictators Shame NATO Repeat We...
96 0 1501699859803516934 2022-03-09 23:22:01 00:00 GayleKing The difference Ukrainian refugees IL...
97 0 1501697172441550848 2022-03-09 23:11:20 00:00 hrkbenowen And includes new United States I un...
98 0 1501696149853511687 2022-03-09 23:07:16 00:00 JLaw_OTD A world women minorities POC LGBTQ÷ d...
And the second dataFrame globe_df
that looks like this:
Country Region
0 Andorra Europe
1 United Arab Emirates Middle east
2 Afghanistan Asia & Pacific
3 Antigua and Barbuda South/Latin America
4 Anguilla South/Latin America
.. ... ...
243 Guernsey Europe
244 Isle of Man Europe
245 Jersey Europe
246 Saint Barthelemy South/Latin America
247 Saint Martin South/Latin America
I want to delete all rows of the dataframe tweets_df
which have 'text' that does not contain a 'Country' or 'Region'.
This was my attempt:
globe_df = pd.read_csv('countriesAndRegions.csv')
tweets_df = pd.read_csv('tweetSheet.csv')
for entry in globe_df['Country']:
tweet_index = tweets_df[entry in tweets_df['text']].index # if tweets that *contain*, not equal...... entry in tweets_df['text] .... (in)or (not in)?
tweets_df.drop(tweet_index , inplace=True)
print(tweets_df)
Edit: Also, fuzzy, case-insensitive matching with stemming would be preferred when searching the 'text' for countries and regions.
Ex) If the text contained 'Ukrainian', 'british', 'engliSH', etc... then it would not be deleted
CodePudding user response:
Convert country and region values to a list and use str.contains to filter out rows that do not contain these values.
#with case insensitive
vals=globe_df.stack().to_list()
tweets_df = tweets_df[tweets_df ['text'].str.contains('|'.join(vals), regex=True, case=False)]
or (with case insensitive)
vals="({})".format('|'.join(globe_df.stack().str.lower().to_list())) #make all letters lowercase
tweets_df['matched'] = tweets_df.text.str.lower().str.extract(vals, expand=False)
tweets_df = tweets_df.dropna()
CodePudding user response:
To check if the elements of one column of a DataFrame are contained by the elements of another column of another DataFrame, you can use the apply
method along with a lambda function to apply a check to each row of the first DataFrame.
Here's an example of how you can do this:
# First, create a boolean mask that checks if the 'text' column in the first DataFrame contains
# any of the 'Country' or 'Region' values from the second DataFrame
mask = tweets_df['text'].apply(lambda x: any(word in x for word in globe_df['Country'].values.tolist() globe_df['Region'].values.tolist()))
# Use the mask to filter rows in the first DataFrame that do not contain any of the values from the second DataFrame
filtered_df = tweets_df[mask]
The resulting filtered_df
DataFrame will contain only the rows from the tweets_df
DataFrame where the text column contains any of the Country
or Region
values from the globe_df DataFrame.
CodePudding user response:
# Import data
globe_df = pd.read_csv('countriesAndRegions.csv')
tweets_df = pd.read_csv('tweetSheet.csv')
# Get country and region column as list
globe_df_country = globe_df['Country'].values.tolist()
globe_df_region = globe_df['Region'].values.tolist()
# merge_lists, cause you want to check with or operator
merged_list = globe_df_country globe_df_region
# If you want to update df while iterating it, best way to do it with using copy df
df_tweets2 = tweets_df.copy()
for index,row in tweets_df.iterrows():
# Check if splitted row's text values are intersecting with merged_list
if [i for i in merged_list if i in row['text'].split()] == []:
df_tweets2 = df_tweets2.drop[index]
tweets_df_new = df_tweets2.copy()
print(tweets_df_new)
CodePudding user response:
You can try using pandas.Series.str.contains to find the values.
tweets_df[tweets_df['text'].contains('{}|{}'.format(entry['Country'],entry['Region'])]
And after creating a new column with boolean values, you can remove rows with the value True
.