I'm doing this project to scrape how many links a series of webpages have.
My ideia is to add the count of the links for each page in a column of a Pandas dataframe. The ideia is to have something like this:
title count links
0 page1 2
1 page2 3
2 page3 0
I did this code:
links_bs4 = ['page1', 'page2']
article_title = []
links = []
for item in links_bs4:
page = requests.get(item)
soup = BeautifulSoup(page.content, 'html.parser')
title = soup.find('title')
article_title.append(title.string)
body_text = soup.find('div', class_='article-body')
for link in body_text.find_all('a'):
links.append((link.get('href')))
count_of_links = len(links)
s1 = pd.Series(article_title, name='title')
s2 = pd.Series(count_of_links, name='count links')
df = pd.concat([s1, s2], axis=1)
It partly works. The count_of_links = len(links)
generates a count of all links of all pages combined.
I wish the count for each page, not the total as is happening now. How can I do this? My for loop is adding the count for the whole list. I should create a new list for each URL I scrape? Or use another thing in Python?
I'm clearly missing some part of the logic.
CodePudding user response:
You can treat count_of_links
the same way as article_title
. Below is based on your code, but with my changes.
links_bs4 = ['page1', 'page2']
article_title = []
count_of_links = [] # <------ added
links = []
for item in links_bs4:
page = requests.get(item)
soup = BeautifulSoup(page.content, 'html.parser')
title = soup.find('title')
article_title.append(title.string)
body_text = soup.find('div', class_='article-body')
count = 0 # <------- added
for link in body_text.find_all('a'):
links.append((link.get('href')))
# count_of_links = len(links) # <------- commented out
count = 1 # <------- added
count_of_links.append(count) # <------- added
s1 = pd.Series(article_title, name='title')
s2 = pd.Series(count_of_links, name='count links')
df = pd.concat([s1, s2], axis=1)
Or you may code it this way, then you won't need to create one variable for one new column, instead you only need to expand the dictionary.
links_bs4 = ['page1', 'page2']
data = []
links = []
for item in links_bs4:
page = requests.get(item)
soup = BeautifulSoup(page.content, 'html.parser')
title = soup.find('title')
body_text = soup.find('div', class_='article-body')
link_temp = [link.get('href') for link in body_text.find_all('a')]
data.append({'title': title.string, 'count links': len(link_temp)})
links.extend(link_temp)
df = pd.DataFrame(data)