Home > Software engineering >  Scrapping / Extracting link from Webpage
Scrapping / Extracting link from Webpage

Time:03-04

url = "https://www.volvogroup.com/en/news-and-media/press-releases.html"
source = requests.get(url)
soup = BeautifulSoup(source.text , "html.parser")
for i in soup.find_all('p' ):
    for j in i.find_all('a'):
        href = j.get('href')
        print(href)

I am able to fetch the link here in href . But when i am creating a dataframe like this using list comprehensions not able to get the same output in dataframe

check = soup.find_all('p' , class_ = "articlelist__headerTitle")
for i in range(len(check)):
    df.loc[i , 'company_id']  = 'Volvo_AB'
    df.loc[i, 'links'] = [ i.a.get('href')for i in soup.find_all('p')]
print(df)

CodePudding user response:

There are few objects in the list who doesn't have a attribute so you get None values. So you need to filter those out. Try the below code.

check = soup.find_all('p' , class_ = "articlelist__headerTitle")
for i in range(len(check)):
    df.loc[i , 'company_id']  = 'Volvo_AB'
    values = [i.a.get('href') for i in soup.find_all('p') if i.a is not None]
    df.loc[i, 'links'] = values
print(df)

CodePudding user response:

check = soup.find_all('p' , class_ = "articlelist__headerTitle")
df_news = pd.DataFrame(columns = ['link'],data=[url.a.get('href') for url in check])
for i in range(len(check)) :
    df_news.loc[i , 'company_id']  = 'Volvo_AB'
    
print(df_news)

  • Related