I thought it might be because the site asks that you login, so I went to cURL converter and got the cookies and header information. I also thought it could've been an outdated version of python, I installed the latest one and started a new project. I pip install lxml, bs4, and requests to be sure.
Here's the code:
from bs4 import BeautifulSoup
import requests
cookies = {
'_ga': 'GA1.2.482371687.1666124695',
'_gid': 'GA1.2.595755325.1666124695',
'_gat': '1',
'com.auth0.auth.~wzXaja6nVH3sT6VYctz1OnJLcemRYfb': '{"nonce":null,"state":"~wzXaja6nVH3sT6VYctz1OnJLcemRYfb","lastUsedConnection":"Username-Password-Authentication"}',
'co/verifier/https%3A%2F%2Flogin.newsfilter.io/DGFVCDDR3SiY': '"sFtGmS7xwG78eWkvfmXmE47I7IjPRbkz"',
}
headers = {
'authority': 'newsfilter.io',
'accept': 'text/html,application/xhtml xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'accept-language': 'en-US,en;q=0.9',
'cache-control': 'max-age=0',
# Requests sorts cookies= alphabetically
# 'cookie': '_ga=GA1.2.482371687.1666124695; _gid=GA1.2.595755325.1666124695; _gat=1; com.auth0.auth.~wzXaja6nVH3sT6VYctz1OnJLcemRYfb={"nonce":null,"state":"~wzXaja6nVH3sT6VYctz1OnJLcemRYfb","lastUsedConnection":"Username-Password-Authentication"}; co/verifier/https%3A%2F%2Flogin.newsfilter.io/DGFVCDDR3SiY="sFtGmS7xwG78eWkvfmXmE47I7IjPRbkz"',
'if-modified-since': 'Thu, 01 Sep 2022 12:54:42 GMT',
'if-none-match': 'W/"560e9bcfba289103a0d39974205f6314"',
'sec-ch-ua': '"Chromium";v="106", "Google Chrome";v="106", "Not;A=Brand";v="99"',
'sec-ch-ua-mobile': '?0',
'sec-ch-ua-platform': '"Windows"',
'sec-fetch-dest': 'document',
'sec-fetch-mode': 'navigate',
'sec-fetch-site': 'same-origin',
'sec-fetch-user': '?1',
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36',
}
html_text = requests.get('https://newsfilter.io/latest/news', cookies=cookies, headers=headers).text
soup = BeautifulSoup(html_text, 'lxml')
jobs = soup.find_all('div', class_='sc-dnqmqq bxsfdc')
print(jobs)
CodePudding user response:
Using the dev tool of your browser, look at the request made to the server.
Look at the html, and check that it is mostly empty andd then notice that it makes a json request to https://static.newsfilter.io/landing-page/articles-latest-news.json
It is just a matter to get that and parse it.
import requests
resp = requests.get('https://static.newsfilter.io/landing-page/articles-latest-news.json')
if resp.status_code == 200:
for headlines in resp.json():
print(f'{headlines["publishedAt"]}: {headlines["title"]}')