I need to display distinct keyword and its frequency (count) from queires searched on Google. then I need to select top 10 words by their frequency.
Input example:
site | query |
---|---|
google.com | https://www.google.com/search?q=shoe store in new york |
google.com | https://www.google.com/search?q=new york attractions |
Output example if df is chosen for displaying results:
keyword | count |
---|---|
shoe | 1 |
store | 1 |
in | 1 |
new | 2 |
york | 2 |
attractions | 1 |
So I extracted the keywords from queries but I don't really know what to do next. I'll appreciate any help
CodePudding user response:
Here is a function you can use to count keywords from a url containing a query:
from collections import Counter
from urllib.parse import urlparse
from urllib.parse import parse_qs
def get_keywords_count(url):
return Counter(parse_qs(urlparse(url).query)['q'][0].split())
Example of usage:
>>> get_keywords_count('https://www.google.com/search?q=shoe store in new york')
Counter({'shoe': 1, 'store': 1, 'in': 1, 'new': 1, 'york': 1})
You can now use it with your dataframe to get the total count:
result = pd.DataFrame(
df['query'].apply(get_keywords_count).sum().items(),
columns=['keyword', 'count'],
)
>>> result
keyword count
0 shoe 1
1 store 1
2 in 1
3 new 2
4 york 2
5 attractions 1
CodePudding user response:
I would use the sort_values
pandas method.
For your example:
import pandas as pd
keyword_count_df = pd.DataFrame({
'keyword':['shoe', 'store', 'in', 'new', 'york', 'attractions'],
'count':[1,1,1,2,2,1]
})
keyword_count_df.sort_values('count', ascending=False).head(10)
CodePudding user response:
Probably better ways to do this, but here's mine;
results = ['the','the','plane','bus','plane','light','the']
def countwords(results):
worddict = {}
for result in results:
if result in worddict:
worddict[result] = 1
else:
worddict[result] = 1
return worddict