I'm trying to scrape the press releases of a Danish political party (https://danskfolkeparti.dk/nyheder/) but the content of the press releases only appears after clicking 'search' within a web browser. There is no navigable html (that I can find) that allows for identifying a unique URL for the 'search' function, and the website URL does not change after clicking search within a browser.
import pandas as pd
import requests
from bs4 import BeautifulSoup
url = 'http://danskfolkeparti.dk/nyheder.html'
headers = {'User-Agent': 'Mozilla/5.0'}
soup = BeautifulSoup(requests.get(url=url, headers=headers).content, 'html.parser')
### the data I'm looking for would usually be accessible using something like the following.
### However, the HTML does not appear until AFTER the search is clicked within a browser
soup.find_all("div", class_='timeline')
printing 'soup' shows the HTML without the content that's desired. The search button in the website (Søg, in Danish) is not accessible as an endpoint. After clicking the search button (<Søg>) in a web browser, the desired content appears in a web browser and is viewable by 'inspecting' the page, but the URL does not change so there's not a clear way to access the page with Beautiful soup.
The desired content is the title, url and date of each individual press release. For example, the first press release that appears when searching with default settings is the following:
title: Året 2022 viste hvad Dansk Folkeparti er gjort af
date: 23/12/2022
Any help with this would be greatly appreciated!!
CodePudding user response:
the HTML does not appear until AFTER the search is clicked
That was not what I experienced - when I went to the nyheder page, there were already 10 posts on the timeline, and more loaded when I scrolled down.
However, it's true that the by HTML fetched requests.get
does not contain the timeline. It's an empty frame with just the top and bottom panes of the page, and the rest is rendered with JavaScript. I can suggest 2 ways to get around this - either use selenium or scrape via their API.
Solution 1: Selenium
I have 2 functions which I often use for scraping:
linkToSoup_selenium
which takes a URL and [if everything goes ok] returns a BeautifulSoup object. For this site, you can use it to:- scroll down a certain number of times [it's best to over-estimate how many scrolls you need]
- wait for the links and dates to load
- click "Accept Cookies" (if you want to, doesn't make a difference tbh)
selectForList
which takes a bs4 Tag and list of CSS selectors and returns the corresponding details from that Tag- (If you are unfamiliar with CSS selectors, I often use this reference as a cheatsheet.)
So, you can set up a reference dictionary (selRef
) of selectors [that will be passed to selectForList
later] and then fetch and parse the loaded HTML with linkToSoup_selenium
:
selRef = {
'title': 'div.content>a.post-link[href]', 'date': 'p.text-date',
'url': ('div.content>a.post-link[href]', 'href'),
# 'category': 'div.content>p.post-category-timeline',
# 'excerpt': 'div.content>p.post-content-rendered',
}
soup = linkToSoup_selenium('https://danskfolkeparti.dk/nyheder/', ecx=[
'div.timeline>div>div.content>a.post-link[href]', # load links [ titles]
'div.timeline>div>p.text-date' # load dates [probably redundant]
], clickFirst=[
'a[role="button"][data-cli_action="accept_all"]' # accept cookies [optional]
], by_method='css', scrollN=(20, 5), returnErr=True) # scroll 20x with 5sec breaks
Since returnErr=True
is set, the function will return a string containing an error message if something causes it to fail, so you should habitually check for that before trying to extract the data.
if isinstance(soup, str):
print(type(soup), miniStr(soup)[:100]) # print error message
prTimeline = []
else:
prTimeline = [{k: v for k, v in zip(
list(selRef.keys()), selectForList(pr, list(selRef.values()))
)} for pr in soup.select('div.timeline>div')]
Now prTimeline
looks something like
## [only the first 3 of 76 are included below] ##
[{'title': 'Året 2022 viste hvad Dansk Folkeparti er gjort af',
'date': '23/12/2022',
'url': 'https://danskfolkeparti.dk/nyheder/mortens-nyhedsbrev/aaret-2022-viste-hvad-dansk-folkeparti-er-gjort-af/'},
{'title': 'Mette Frederiksen har stemt danskerne hjem til 2001',
'date': '13/12/2022',
'url': 'https://danskfolkeparti.dk/nyheder/mortens-nyhedsbrev/mette-frederiksen-har-stemt-danskerne-hjem-til-2001/'},
{'title': 'Vi klarede folketingsvalget – men der skal kæmpes lidt endnu',
'date': '23/11/2022',
'url': 'https://danskfolkeparti.dk/nyheder/mortens-nyhedsbrev/vi-klarede-folketingsvalget-men-der-skal-kaempes-lidt-endnu/'}]
Solution 2: API
If you open open the Network Tab before clicking search (or just scrolling all the way down, or refreshing the page), you might see this request with a JSON response that is being used to fetch the data for populating the timeline. So, you just need to replicate this API request.
However, as @SergeyK commented,
site have Cloudflare protection, so u cant get result without setting up cookies
and the same seems to be true for the API as well. I'm not good with setting headers and cookies as needed; so instead, I tend to just use cloudscraper
[or HTMLSession
sometimes] in such cases.
# import cloudscraper
qStr = 'categories=20,85,15,83,73,84,&before=2022-12-28T23:59:59'
qStr = '&after=1990-05-08T01:01:01&per_page=99&page=1'
apiUrl = f'https://danskfolkeparti.dk/wp-json/wp/v2/posts?{qStr}'
prTimeline = [{
'title': pr['title']['rendered'],
'date': pr['date'], # 'date': pr['date_gmt'],
'url': pr['link']
} for pr in cloudscraper.create_scraper().get(apiUrl).json()]
and the resulting prTimeline
looks pretty similar to the Selenium output.
There's an expanded version using a set of functions that lets you get the same results with
prTimeline, rStatus = danskfolkeparti_apiScraper(pathsRef={'title': ['title', 'rendered'], 'date': ['date'], 'url': ['link']})
But you can do much more, like passing searchFor={'before': '2022-10-01T00:00:00'}
to only get posts before October, or searchFor="search terms"
to search by keywords but
- you can't search for keywords and also set parameters like category/time/etc
- you have to make sure that
before
andafter
are in ISO format, and thatpage
is a positive integer, and thatcategories
is a list of integers [or they might be ignored]
You can get more information if you leave the default arguments and make use of all of the functions, as below:
# from bs4 import BeautifulSoup
### FIRST PASTE EVERYTHING FROM https://pastebin.com/aSQrW9ff ###
prTimeline, prtStatus = danskfolkeparti_apiScraper()
prCats = danskfolkeparti_catNames([pr['category'] for pr in prTimeline])
for pi, (pr, prCat) in enumerate(zip(prTimeline, prCats)):
prTimeline[pi]['category'] = prCat
cTxt = BeautifulSoup(pr['content']).get_text(' ')
cTxt = ' '.join(w for w in cTxt.split() if w) # reduce whitescpace
prTimeline[pi]['content'] = cTxt
FULL RESULTS
The full results from the last code snippet, as well as the selenium solution (with ALL of selRef
uncommented) have been uploaded to this spreadsheet. The CSVs were saved using pandas.
# import pandas
fileName = 'prTimeline.csv' # 'dfn_prTimeline_api.csv' # 'dfn_prTimeline_sel.csv'
pandas.DataFrame(prTimeline).to_csv(fileName, index=False)
Also, if you are curious, you can see the categories in the default API call with danskfolkeparti_categories([20,85,15,83,73,84])
.