Home > other >  How to avoid attribute error while extracting links from articles?
How to avoid attribute error while extracting links from articles?

Time:09-10

I am trying my hand at webscraping using BeautifulSoup.

I had posted this before here, but I was not very clear as to what I wanted, so it only partially answers my issue. error

CodePudding user response:

That is cause articles is a ResultSet of soup.find_all(article_tag) what you can check with type(articles)

To get your goal you have to iterate articles first - So simply add an additional for-loop to your list comprehension:

[t.get('href') for article in articles for t in article.find_all(tag)]

In addition you may should use a set comprehension to avoid duplicates and also concat paths with base url:

list(set(t.get('href') if t.get('href').startswith('http') else 'https://bigbangtheory.fandom.com' t.get('href') for article in articles for t in article.find_all(tag)))

Output:

['https://bigbangtheory.fandom.com/wiki/The_Killer_Robot_Instability',
 'https://bigbangtheory.fandom.com/wiki/Rajesh_Koothrappali',
 'https://bigbangtheory.fandom.com/wiki/Bernadette_Rostenkowski-Wolowitz',
 'https://bigbangtheory.fandom.com/wiki/The_Valentino_Submergence',
 'https://bigbangtheory.fandom.com/wiki/The_Beta_Test_Initiation',
 'https://bigbangtheory.fandom.com/wiki/Season_2',
 'https://bigbangtheory.fandom.com/wiki/Dr._Pemberton',...]
  • Related