i am having trouble saving down urls from a string.
i have tried something like this
url = "https://in.indeed.com/jobs?q=software engineer &l=Kerala"
soup = BeautifulSoup(requests.get(url, headers=headers).content, "html.parser")
Links1 = soup.find_all("div",{"class:","pagination"})
url = [Links1.find(('a')['href'] for tag in Links1)]
WEbsite=f'https://in.indeed.com{url[0]}'
but its not returning full url list. I need url to navigate to next page .
CodePudding user response:
Are you just after the "next page" or do you want all the links?
so do you want just:
/jobs?q=software engineer &l=Kerala&start=10
or are you after all of these?
/jobs?q=software engineer &l=Kerala&start=10
/jobs?q=software engineer &l=Kerala&start=20
/jobs?q=software engineer &l=Kerala&start=30
/jobs?q=software engineer &l=Kerala&start=40
/jobs?q=software engineer &l=Kerala&start=10
Few issues:
Links1
is a list of elements. And you are then using.find('a')
on a list, which won't work.- Since you want href attributes, consider using the
find('a',href=True)
So here's how I would go about it:
import requests
from bs4 import BeautifulSoup
url = "https://in.indeed.com/jobs?q=software engineer &l=Kerala"
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36'}
soup = BeautifulSoup(requests.get(url, headers=headers).content, "html.parser")
Links1 = soup.find_all("div",{"class":"pagination"})
url = [tag.find('a',href=True)['href'] for tag in Links1]
website=f'https://in.indeed.com{url[0]}'
Output:
print(website)
https://in.indeed.com/jobs?q=software engineer &l=Kerala&start=10
To get all those links:
import requests
from bs4 import BeautifulSoup
url = "https://in.indeed.com/jobs?q=software engineer &l=Kerala"
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36'}
soup = BeautifulSoup(requests.get(url, headers=headers).content, "html.parser")
Links1 = soup.find("div",{"class":"pagination"})
urls = [tag['href'] for tag in Links1.find_all('a',href=True)]
website=f'https://in.indeed.com{url[0]}'
CodePudding user response:
You should use find()
instead of find_all()
, then this modified url list should work:
Links1 = soup.find_all("div",{"class:","pagination"})
urls = [i['href'] for i in Links1.find_all('a') if 'href' in i.attrs]