Home > OS >  How to scrape data from urls list
How to scrape data from urls list

Time:08-25

I am trying to make a code that scrapes information from a list of websites. My goal is to get all the data and save it in JSON file. The end should look like this :

[
    {
        "title": "Python developer",
        "place": "Slovensko",
        "salary": "od 1000 €",
        "contract_type": "dohoda",
        "contact_email": "[email protected]"
},
...
]

I made a code that gets all the links from a seed website and its working okay but i am stuck at data scraping. Here is the code i wrote:

from bs4 import BeautifulSoup
import requests
import re


zaciatok = "https://www.hyperia.sk/kariera"
def getHTMLdocument(zaciatok):
    response = requests.get(zaciatok)
    return response.text

vsetky_linky= []
html_document = getHTMLdocument(zaciatok)
soup = BeautifulSoup(html_document, "html.parser")

for link in soup.find_all("a", attrs={'href',"arrow-link", }):
    vsetky_linky.append(link.get("href"))


vsetky_linky.pop()

urls = []
for x in vsetky_linky:
    urls.append("https://www.hyperia.sk" x)
    
 

daaata = []
for url in urls:
    print(url)
    req = requests.get(url)
    req.encoding = "utf-8-sig"
    
    polievka = BeautifulSoup(req.text, "html.parser")

    
    nadpis = polievka.find("div", attrs={'class': 'hero-text col-lg-12'})
    br = polievka.find("br")
    for p in polievka.select("p:has(br)"):
        daaata.append(
            [
                nadpis.get_text(strip=True) ,
                br.get_text(strip=True) , 
                ]
            )
print(daaata)
                

At the end I printed the scrapped data and I see it also pulled a text from under the header ( I need only the header "Python developer" not the text under it). Can you help me?

CodePudding user response:

Try to select your elements more specific, in your case the <h1>:

"title": polievka.h1.text,

Example how to use in your for-loop feel free to adapt it to your final needs, my slovak is not that good, so I do not know what matters ;)

...
daaata = []
for url in urls:
    print(url)
    req = requests.get(url)
    req.encoding = "utf-8-sig"
    
    polievka = BeautifulSoup(req.text, "html.parser")
    
    daaata.append({
        "title": polievka.h1.text,
        "place": polievka.select_one('img[alt="place"]   p br').next,
        "salary": polievka.select_one('img[alt="wage"]   p br').next,
        "contract_type": polievka.select_one('img[alt="work"]   p br').next,
        "contact_email": polievka.select_one('a[href^="mailto"]').get('href').split(':')[-1]
    })

daaata

Output

[{'title': 'Python developer - študent', 'place': 'Slovensko', 'salary': '6 € / hodina', 'contract_type': 'dohoda o brig. práci študenta', 'contact_email': '[email protected]'}, {'title': 'Senior PPC špecialista', 'place': 'Slovensko', 'salary': 'od 1 800,- €', 'contract_type': 'TPP, živnosť', 'contact_email': '[email protected]'}, {'title': 'Product owner', 'place': 'Slovensko', 'salary': 'od 2 000 ,- €', 'contract_type': 'TPP, živnosť', 'contact_email': '[email protected]'}, {'title': 'Lead Frontend developer', 'place': 'Slovensko', 'salary': '2 000 - 4 000 ,- €', 'contract_type': 'TPP, živnosť', 'contact_email': '[email protected]'}, {'title': 'Frontend developer (medior/senior)', 'place': 'Slovensko', 'salary': '2 000 - 4 000 ,- €', 'contract_type': 'TPP, živnosť', 'contact_email': '[email protected]'}, {'title': 'Kimbino senior PHP developer', 'place': 'Slovensko', 'salary': 'od 2 000 ,- €', 'contract_type': 'TPP, živnosť', 'contact_email': '[email protected]'}]
  • Related