Extracting unique keyword and count from dataframe column-CodePudding

I need to display distinct keyword and its frequency (count) from queires searched on Google. then I need to select top 10 words by their frequency.

Input example:

site	query
google.com	https://www.google.com/search?q=shoe store in new york
google.com	https://www.google.com/search?q=new york attractions

Output example if df is chosen for displaying results:

keyword	count
shoe	1
store	1
in	1
new	2
york	2
attractions	1

So I extracted the keywords from queries but I don't really know what to do next. I'll appreciate any help

CodePudding user response：

Here is a function you can use to count keywords from a url containing a query:

from collections import Counter
from urllib.parse import urlparse
from urllib.parse import parse_qs

def get_keywords_count(url):
    return Counter(parse_qs(urlparse(url).query)['q'][0].split())

Example of usage:

>>> get_keywords_count('https://www.google.com/search?q=shoe store in new york')
Counter({'shoe': 1, 'store': 1, 'in': 1, 'new': 1, 'york': 1})

You can now use it with your dataframe to get the total count:

result = pd.DataFrame(
    df['query'].apply(get_keywords_count).sum().items(),
    columns=['keyword', 'count'],
)

>>> result
       keyword  count
0         shoe      1
1        store      1
2           in      1
3          new      2
4         york      2
5  attractions      1

CodePudding user response：

I would use the sort_values pandas method.

For your example:

import pandas as pd

keyword_count_df = pd.DataFrame({
    'keyword':['shoe', 'store', 'in', 'new', 'york', 'attractions'],
    'count':[1,1,1,2,2,1]
})

keyword_count_df.sort_values('count', ascending=False).head(10)

CodePudding user response：

Probably better ways to do this, but here's mine;

results = ['the','the','plane','bus','plane','light','the']
def countwords(results):
    worddict = {}
    for result in results:
        if result in worddict:
            worddict[result]  = 1
        else:
            worddict[result] = 1
    return worddict