Home > Software design >  How to extract the top N rows from a dataframe with most frequent occurences of a word in a list?
How to extract the top N rows from a dataframe with most frequent occurences of a word in a list?

Time:11-25

I have a Python dataframe with multiple rows and columns, a sample of which I have shared below -

DocName Content
Doc1 Hi how you are doing ? Hope you are well. I hear the food is great!
Doc2 The food is great. James loves his food. You not so much right ?
Doc3. Yeah he is alright.

I also have a list of 100 words as follows -

list = [food, you, ....]           

Now, I need to extract the top N rows with most frequent occurences of each word from the list in the "Content" column. For the given sample of data,

"food" occurs twice in Doc2 and once in Doc1.

"you" occurs twice in Doc 1 and once in Doc 2.

Hence, desired output is :

[food:[doc2, doc1], you:[doc1, doc2], .....]

where N = 2 ( top 2 rows having the most frequent occurence of each word )

I have tried something as follows but unsure how to move further -

list = [food, you, ....]
result = []

for word in list:
    result.append(df.Content.apply(lambda row: sum([row.count(word)])))

How can I implement an efficient solution to the above requirement in Python ?

CodePudding user response:

It seems like this problem can be broken down into two sub-problems:

  1. Get the frequency of words per "Content" cell
  2. For each word in the list, extract the top N rows

Luckily, the first sub-problem has many neat approaches, as shown here. TLDR use the Collections library to do a frequency count; or, if you aren't allowed to import libraries, call ".split()" and count in a loop. But again, there are many potential solutions

The second sub-problem is a bit trickier. From our first solution, what we have now is a dictionary of frequency counts, per row. To get to our desired answer, the naive method would be to "query" every dictionary for the word in question. E.g run

doc1.dict["food"]
doc2.dict["food"]
...

and compare the results in order.

There should be enough to get going, and also opportunity to find more streamlined/elegant solutions. Best of luck!

CodePudding user response:

With df your dataframe you could try something like:

words = ["food", "you"]
res = (
    df
    .assign(Content=df["Content"].str.casefold().str.findall(r"\w ")
                    .map(set(words).intersection))
    .explode("Content")
    .dropna()
    .groupby("Content").agg(list)
    .to_dict()["DocName"]
)

Result for the sample dataframe

  DocName                                                              Content
0    Doc1  Hi how you are doing ? Hope you are well. I hear the food is great!
1    Doc2  The food is great. James loves his food. You not so much right ?
2    Doc3  Yeah he is alright.

is

{'food': ['Doc1', 'Doc2'], 'you': ['Doc1', 'Doc2']}
  • Related