I have a Python dataframe with multiple rows and columns, a sample of which I have shared below -
DocName | Content |
---|---|
Doc1 | Hi how you are doing ? Hope you are well. I hear the food is great! |
Doc2 | The food is great. James loves his food. You not so much right ? |
Doc3. | Yeah he is alright. |
I also have a list of 100 words as follows -
list = [food, you, ....]
Now, I need to extract the top N rows with most frequent occurences of each word from the list in the "Content" column. For the given sample of data,
"food" occurs twice in Doc2 and once in Doc1.
"you" occurs twice in Doc 1 and once in Doc 2.
Hence, desired output is :
[food:[doc2, doc1], you:[doc1, doc2], .....]
where N = 2 ( top 2 rows having the most frequent occurence of each word )
I have tried something as follows but unsure how to move further -
list = [food, you, ....]
result = []
for word in list:
result.append(df.Content.apply(lambda row: sum([row.count(word)])))
How can I implement an efficient solution to the above requirement in Python ?
CodePudding user response:
It seems like this problem can be broken down into two sub-problems:
- Get the frequency of words per "Content" cell
- For each word in the list, extract the top N rows
Luckily, the first sub-problem has many neat approaches, as shown here. TLDR use the Collections library to do a frequency count; or, if you aren't allowed to import libraries, call ".split()" and count in a loop. But again, there are many potential solutions
The second sub-problem is a bit trickier. From our first solution, what we have now is a dictionary of frequency counts, per row. To get to our desired answer, the naive method would be to "query" every dictionary for the word in question. E.g run
doc1.dict["food"]
doc2.dict["food"]
...
and compare the results in order.
There should be enough to get going, and also opportunity to find more streamlined/elegant solutions. Best of luck!
CodePudding user response:
With df
your dataframe you could try something like:
words = ["food", "you"]
res = (
df
.assign(Content=df["Content"].str.casefold().str.findall(r"\w ")
.map(set(words).intersection))
.explode("Content")
.dropna()
.groupby("Content").agg(list)
.to_dict()["DocName"]
)
Result for the sample dataframe
DocName Content
0 Doc1 Hi how you are doing ? Hope you are well. I hear the food is great!
1 Doc2 The food is great. James loves his food. You not so much right ?
2 Doc3 Yeah he is alright.
is
{'food': ['Doc1', 'Doc2'], 'you': ['Doc1', 'Doc2']}