how to get all 'href' from page in company profile box-CodePudding

Can anyone tell me where is the prblm i am new in python i want get all links from this page here is my code import requests from bs4 import BeautifulSoup import pandas as pd

re=requests.get('https://www.industrystock.com/en/companies/Agriculture')
re
soup = BeautifulSoup(re.text, 'lxml')
link_list = []
page1 = soup.find_all('a', class_ = 'btn awe-info gotoJS iconColor_white')
page1
for i in page1:
        link = (i.get('href'))
        link_list.append(link)

CodePudding user response：

The links to company profiles are stored in data-href= attribute:

import requests
from bs4 import BeautifulSoup

r = requests.get("https://www.industrystock.com/en/companies/Agriculture")
soup = BeautifulSoup(r.content, "lxml")
page1 = soup.find_all("a", class_="btn awe-info gotoJS iconColor_white")
for i in page1:
    print(i["data-href"])

Prints:

https://www.industrystock.com/en/company/profile/ARCA-Internet-Services-Ltd./370071
https://www.industrystock.com/en/company/profile/Забайкальская-аграрная-Ассоциация-образовательных-и-научных-учреждений/256182
https://www.industrystock.com/en/company/profile/...VÁŠ-INTERIÉR-s.r.o./534809
https://www.industrystock.com/en/company/profile/1-WITOS-s.r.o./529071
https://www.industrystock.com/en/company/profile/1.-TOUŠEŇSKÁ-s.r.o./544981
https://www.industrystock.com/en/company/profile/1.HEFAISTOS-s.r.o./541263
https://www.industrystock.com/en/company/profile/1.HRADECKÁ-ZEMĚDĚLSKÁ-a.s./548267
https://www.industrystock.com/en/company/profile/1.MAXIMA-INTERNATIONAL-s.r.o./530049
https://www.industrystock.com/en/company/profile/1.MIROSLAVSKÁ-STROJÍRNA-spol.-s-r.o./544781
https://www.industrystock.com/en/company/profile/1.VASTO-spol.-s-r.o./535985
https://www.industrystock.com/en/company/profile/1C-PRO-s.r.o./534831
https://www.industrystock.com/en/company/profile/1CSC-a.s./528169
https://www.industrystock.com/en/company/profile/1P-CONTROL/549995
https://www.industrystock.com/en/company/profile/2-ES-spol.-s-r.o./547849
https://www.industrystock.com/en/company/profile/2-G-SERVIS-spol.-s-r.o./528391
https://www.industrystock.com/en/company/profile/2-JCP-a.s./537151
https://www.industrystock.com/en/company/profile/2-THETA-ASE-s.r.o./545079
https://www.industrystock.com/en/company/profile/2LMAKERS-s.r.o./542127
https://www.industrystock.com/en/company/profile/2M-SERVIS-s.r.o./550923
https://www.industrystock.com/en/company/profile/2M-STATIC-s.r.o./549935
https://www.industrystock.com/en/company/profile/2M-STROJE-s.r.o./539885
https://www.industrystock.com/en/company/profile/2TMOTORS-s.r.o./543869
https://www.industrystock.com/en/company/profile/2VV-s.r.o./538993
https://www.industrystock.com/en/company/profile/2xSERVIS-s.r.o./528321
https://www.industrystock.com/en/company/profile/3-PLUS-1-SERVICE-s.r.o./535103
https://www.industrystock.com/en/company/profile/3-TOOLING-s.r.o./540599
https://www.industrystock.com/en/company/profile/3B-SOCIÁLNÍ-FIRMA-s.r.o./535127
https://www.industrystock.com/en/company/profile/3D-KOVÁRNA-s.r.o./549765
https://www.industrystock.com/en/company/profile/3D-TECH-spol.-s-r.o./548047
https://www.industrystock.com/en/company/profile/3DNC-SYSTEMS-s.r.o./549379

CodePudding user response：

Try this:

response = requests.get('https://www.industrystock.com/en/companies/Agriculture')
soup = BeautifulSoup(response.text, 'lxml')
link_list = []
page1 = soup.find_all('a', {"class":'btn awe-info gotoJS iconColor_white'})
for i in page1:
        link = i['href']
        link_list.append(link)

And I would also recommend using html.parser if you are not scraping XML.