I am hoping to get a count of how often a specific word shows on a given URL. I currently have a way to do this for a small set of URLs and a single word:
import requests
from bs4 import BeautifulSoup
url_list = ["https://www.example.org/","https://www.example.com/"]
#the_word = input()
the_word = 'Python'
total_words = []
for url in url_list:
r = requests.get(url, allow_redirects=False)
soup = BeautifulSoup(r.content.lower(), 'lxml')
words = soup.find_all(text=lambda text: text and the_word.lower() in text)
count = len(words)
words_list = [ ele.strip() for ele in words ]
for word in words:
total_words.append(word.strip())
print('\nUrl: {}\ncontains {} of word: {}'.format(url, count, the_word))
print(words_list)
#print(total_words)
total_count = len(total_words)
However, my hope is to be able to do this for a mapped set of words to their respective URLs as shown in the below data frame.
Target Word | Target URL |
---|---|
word1 | www.example.com/topic-1/ |
word2 | www.example.com/topic-2/ |
The output would ideally give me a new column with a count of how often the word shows on its associated URL. For example, how often 'word1' shows on 'www.example.com/topic-1/'.
Any and all help is much appreciated!
CodePudding user response:
You should try the count() Method for the string And with your code, it will look like this:
count = url.count(the_word)
print('\nUrl: {}\ncontains {} of word: {}'.format(url, count, the_word))
CodePudding user response:
Just iterate over your structure - dict, list of dicts, ... Following example will just point in a direction, cause your question is not that clear and is missing an exact expected result. I am sure you could adapt it to your special needs.
Example
import requests
from bs4 import BeautifulSoup
import pandas as pd
data = [
{'word':'Python','url':'https://stackoverflow.com/questions/tagged/python'},
{'word':'Question','url':'https://stackoverflow.com/questions/tagged/python'}
]
for item in data:
r = requests.get(item['url'], allow_redirects=False)
count = r.text.lower().count(item['word'].lower())
item['count'] = count
pd.DataFrame(data)
Output
word | url | count |
---|---|---|
Python | https://stackoverflow.com/questions/tagged/python | 403 |
Question | https://stackoverflow.com/questions/tagged/python | 686 |