How can I count locally by a custom condition in dataframe python3-CodePudding

I have a dataframe called DF which contains two types of information: datetime and sentence(string).

0   2019-02-01  point say give choice invest motor today money...
1   2019-02-01  get inside car drive drunk excuse bad driving ...
2   2019-02-01  look car snow know buy car snow
3   2019-02-01  drive home car day terrify experience stay least
4   2019-02-01  quid way ferry nice trip enjoy land list celeb...
... ... ...
35818   2021-09-30  choice life drive type car holiday type carava...
35819   2021-09-30  scarlet carson bloody marvellous big car lover...
35820   2021-09-30  podcast adriano great episode dude weird car d...
35821   2021-09-30  scarlet carson smugly cruise traffic know driv...
35822   2021-09-30  hornet know fuel shortage brexit destroy suppl...

Now I generate a word list to seek whether the sentence contains these string:

word_list=['drive','car','buy','fuel','electric','panic','tax','second hand','petrol','auto']

I only need to count once if the word in word list appears in the sentence, here comes my solution

set_list=[]
for word in word_list:
    for sentence in DF['new_processed_text']:
        if word in sentence:
            set_list.append(sentence)
count=len(set(set_list))

However, this will work for the whole dataset, and I want to do the process by day.

I have no ideas about dataframe.groupby, should I need that?

CodePudding user response：

You can use a regular expression, pd.DataFrame.loc method, pd.Series.value_counts() method and pandas string methods for your purposes:

In aggregate:

len(df.loc[df['new_processed_text'].str.contains('|'.join(word_list), regex=True)].index)

Processing per day:

df.loc[df['new_processed_text'].str.contains('|'.join(word_list), regex=True), 'date_column'].value_counts()

CodePudding user response：

You can remove duplicates first and then use the string methods of pandas Series objects.

import pandas as pd

s = pd.Series(['abc def', 'def xyz ijk', 'xyz ijk', 'abc def', 'abc def', 'ijk mn', 'def xyz'])

words = ['abc', 'xyz']
s_prime = s.drop_duplicates()
contains_word = s_prime.str.contains("|".join(words))
print(contains_word.sum())

In your case, s = DF['new_processed_text'] and words = word_list.

I forgot the per day part, which @Vladimir Vilimaitis provides in his answer with his last line.