When I am trying to scrap website over multiple pages BeautifulSoup
returning the 1st page content for all the page range.. It is getting repeated again and again..
data=pd.DataFrame()
for i in range(1,10):
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'}
url="https://www.collegesearch.in/engineering-colleges-india".format(i)
r = requests.get(url, headers=headers)
soup = BeautifulSoup(r.content, 'html5lib')
#clg url and name
clg=soup.find_all('h2', class_='media-heading mg-0')
#other details
details=soup.find_all('dl', class_='dl-horizontal mg-0')
_dict={'clg':clg,'details':details}
df=pd.DataFrame(_dict)
data=data.append(df,ignore_index=True)
CodePudding user response:
It is not an issue of BeautifulSoup
- Check your loop, you never change the page, cause url is always the same:
https://www.collegesearch.in/engineering-colleges-india
So change your code and set your counter as value of page parameter:
for i in range(1,10):
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'}
url=f"https://www.collegesearch.in/engineering-colleges-india?page={i}"
print(url)
May also take a short read: https://docs.python.org/3/tutorial/inputoutput.html