Home > front end >  Check how many links I have in each page and put the count at a dataframe column
Check how many links I have in each page and put the count at a dataframe column

Time:03-03

I'm doing this project to scrape how many links a series of webpages have.

My ideia is to add the count of the links for each page in a column of a Pandas dataframe. The ideia is to have something like this:

     title  count links
  0  page1  2
  1  page2  3
  2  page3  0

I did this code:

links_bs4 = ['page1', 'page2']
article_title = []
links = []

for item in links_bs4:
  page = requests.get(item)
  soup = BeautifulSoup(page.content, 'html.parser')
  title = soup.find('title')
  article_title.append(title.string)
  body_text = soup.find('div', class_='article-body')
  for link in body_text.find_all('a'):
    links.append((link.get('href')))
    count_of_links = len(links)

s1 = pd.Series(article_title, name='title')
s2 = pd.Series(count_of_links, name='count links')
df = pd.concat([s1, s2], axis=1)

It partly works. The count_of_links = len(links) generates a count of all links of all pages combined.

I wish the count for each page, not the total as is happening now. How can I do this? My for loop is adding the count for the whole list. I should create a new list for each URL I scrape? Or use another thing in Python?

I'm clearly missing some part of the logic.

CodePudding user response:

You can treat count_of_links the same way as article_title. Below is based on your code, but with my changes.

links_bs4 = ['page1', 'page2']
article_title = []
count_of_links = [] # <------ added
links = []

for item in links_bs4:
  page = requests.get(item)
  soup = BeautifulSoup(page.content, 'html.parser')
  title = soup.find('title')
  article_title.append(title.string)
  body_text = soup.find('div', class_='article-body')

  count = 0 # <------- added
  for link in body_text.find_all('a'):
    links.append((link.get('href')))
    # count_of_links = len(links) # <------- commented out
    count  = 1 # <------- added
  count_of_links.append(count) # <------- added

s1 = pd.Series(article_title, name='title')
s2 = pd.Series(count_of_links, name='count links')
df = pd.concat([s1, s2], axis=1)

Or you may code it this way, then you won't need to create one variable for one new column, instead you only need to expand the dictionary.

links_bs4 = ['page1', 'page2']
data = []
links = []

for item in links_bs4:
  page = requests.get(item)
  soup = BeautifulSoup(page.content, 'html.parser')
  title = soup.find('title')

  body_text = soup.find('div', class_='article-body')
  link_temp = [link.get('href') for link in body_text.find_all('a')]

  data.append({'title': title.string, 'count links': len(link_temp)})
  links.extend(link_temp)

df = pd.DataFrame(data)
  • Related