Home > database >  how to extract url from a string and save to a list
how to extract url from a string and save to a list

Time:03-12

i am having trouble saving down urls from a string.

i have tried something like this

url = "https://in.indeed.com/jobs?q=software engineer &l=Kerala"
soup = BeautifulSoup(requests.get(url, headers=headers).content, "html.parser")
Links1 = soup.find_all("div",{"class:","pagination"})

url = [Links1.find(('a')['href'] for tag in Links1)]
WEbsite=f'https://in.indeed.com{url[0]}'

but its not returning full url list. I need url to navigate to next page .

CodePudding user response:

Are you just after the "next page" or do you want all the links?

so do you want just:

/jobs?q=software engineer &l=Kerala&start=10

or are you after all of these?

/jobs?q=software engineer &l=Kerala&start=10
/jobs?q=software engineer &l=Kerala&start=20
/jobs?q=software engineer &l=Kerala&start=30
/jobs?q=software engineer &l=Kerala&start=40
/jobs?q=software engineer &l=Kerala&start=10

Few issues:

  1. Links1 is a list of elements. And you are then using .find('a') on a list, which won't work.
  2. Since you want href attributes, consider using the find('a',href=True)

So here's how I would go about it:

import requests
from bs4 import BeautifulSoup

url = "https://in.indeed.com/jobs?q=software engineer &l=Kerala"
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36'}
soup = BeautifulSoup(requests.get(url, headers=headers).content, "html.parser")
Links1 = soup.find_all("div",{"class":"pagination"})

url = [tag.find('a',href=True)['href'] for tag in Links1]
website=f'https://in.indeed.com{url[0]}'

Output:

print(website)
https://in.indeed.com/jobs?q=software engineer &l=Kerala&start=10

To get all those links:

import requests
from bs4 import BeautifulSoup

url = "https://in.indeed.com/jobs?q=software engineer &l=Kerala"
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36'}
soup = BeautifulSoup(requests.get(url, headers=headers).content, "html.parser")
Links1 = soup.find("div",{"class":"pagination"})

urls = [tag['href'] for tag in Links1.find_all('a',href=True)]
website=f'https://in.indeed.com{url[0]}'

CodePudding user response:

You should use find() instead of find_all(), then this modified url list should work:

Links1 = soup.find_all("div",{"class:","pagination"})
urls = [i['href'] for i in Links1.find_all('a') if 'href' in i.attrs]
  • Related