Home > Mobile >  Why is python showing empty brackets for my web scraping project?
Why is python showing empty brackets for my web scraping project?

Time:10-21

I thought it might be because the site asks that you login, so I went to cURL converter and got the cookies and header information. I also thought it could've been an outdated version of python, I installed the latest one and started a new project. I pip install lxml, bs4, and requests to be sure.

Here's the code:

from bs4 import BeautifulSoup
import requests

cookies = {
    '_ga': 'GA1.2.482371687.1666124695',
    '_gid': 'GA1.2.595755325.1666124695',
    '_gat': '1',
    'com.auth0.auth.~wzXaja6nVH3sT6VYctz1OnJLcemRYfb': '{"nonce":null,"state":"~wzXaja6nVH3sT6VYctz1OnJLcemRYfb","lastUsedConnection":"Username-Password-Authentication"}',
    'co/verifier/https%3A%2F%2Flogin.newsfilter.io/DGFVCDDR3SiY': '"sFtGmS7xwG78eWkvfmXmE47I7IjPRbkz"',
}

headers = {
    'authority': 'newsfilter.io',
    'accept': 'text/html,application/xhtml xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
    'accept-language': 'en-US,en;q=0.9',
    'cache-control': 'max-age=0',
    # Requests sorts cookies= alphabetically
    # 'cookie': '_ga=GA1.2.482371687.1666124695; _gid=GA1.2.595755325.1666124695; _gat=1; com.auth0.auth.~wzXaja6nVH3sT6VYctz1OnJLcemRYfb={"nonce":null,"state":"~wzXaja6nVH3sT6VYctz1OnJLcemRYfb","lastUsedConnection":"Username-Password-Authentication"}; co/verifier/https%3A%2F%2Flogin.newsfilter.io/DGFVCDDR3SiY="sFtGmS7xwG78eWkvfmXmE47I7IjPRbkz"',
    'if-modified-since': 'Thu, 01 Sep 2022 12:54:42 GMT',
    'if-none-match': 'W/"560e9bcfba289103a0d39974205f6314"',
    'sec-ch-ua': '"Chromium";v="106", "Google Chrome";v="106", "Not;A=Brand";v="99"',
    'sec-ch-ua-mobile': '?0',
    'sec-ch-ua-platform': '"Windows"',
    'sec-fetch-dest': 'document',
    'sec-fetch-mode': 'navigate',
    'sec-fetch-site': 'same-origin',
    'sec-fetch-user': '?1',
    'upgrade-insecure-requests': '1',
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36',
}

html_text = requests.get('https://newsfilter.io/latest/news', cookies=cookies, headers=headers).text
soup = BeautifulSoup(html_text, 'lxml')
jobs = soup.find_all('div', class_='sc-dnqmqq bxsfdc')

print(jobs)

CodePudding user response:

Using the dev tool of your browser, look at the request made to the server.

Look at the html, and check that it is mostly empty andd then notice that it makes a json request to https://static.newsfilter.io/landing-page/articles-latest-news.json

It is just a matter to get that and parse it.

import requests
    
resp = requests.get('https://static.newsfilter.io/landing-page/articles-latest-news.json')
if resp.status_code == 200:
    for headlines in resp.json():
        print(f'{headlines["publishedAt"]}: {headlines["title"]}')
  • Related