Scrape website search results when content only appears after clicking 'search' (Python --CodePudding

I'm trying to scrape the press releases of a Danish political party (https://danskfolkeparti.dk/nyheder/) but the content of the press releases only appears after clicking 'search' within a web browser. There is no navigable html (that I can find) that allows for identifying a unique URL for the 'search' function, and the website URL does not change after clicking search within a browser.

import pandas as pd
import requests
from bs4 import BeautifulSoup
url = 'http://danskfolkeparti.dk/nyheder.html'
headers = {'User-Agent': 'Mozilla/5.0'}
soup = BeautifulSoup(requests.get(url=url, headers=headers).content, 'html.parser')


### the data I'm looking for would usually be accessible using something like the following. 
### However, the HTML does not appear until AFTER the search is clicked within a browser 
soup.find_all("div", class_='timeline')

printing 'soup' shows the HTML without the content that's desired. The search button in the website (Søg, in Danish) is not accessible as an endpoint. After clicking the search button (<Søg>) in a web browser, the desired content appears in a web browser and is viewable by 'inspecting' the page, but the URL does not change so there's not a clear way to access the page with Beautiful soup.

The desired content is the title, url and date of each individual press release. For example, the first press release that appears when searching with default settings is the following:

title: Året 2022 viste hvad Dansk Folkeparti er gjort af

date: 23/12/2022

url: https://danskfolkeparti.dk/nyheder/mortens-nyhedsbrev/aaret-2022-viste-hvad-dansk-folkeparti-er-gjort-af/

Any help with this would be greatly appreciated!!

CodePudding user response：

the HTML does not appear until AFTER the search is clicked

That was not what I experienced - when I went to the nyheder page, there were already 10 posts on the timeline, and more loaded when I scrolled down.

However, it's true that the by HTML fetched requests.get does not contain the timeline. It's an empty frame with just the top and bottom panes of the page, and the rest is rendered with JavaScript. I can suggest 2 ways to get around this - either use selenium or scrape via their API.

Solution 1: Selenium

I have 2 functions which I often use for scraping:

linkToSoup_selenium which takes a URL and [if everything goes ok] returns a BeautifulSoup object. For this site, you can use it to:
- scroll down a certain number of times [it's best to over-estimate how many scrolls you need]
- wait for the links and dates to load
- click "Accept Cookies" (if you want to, doesn't make a difference tbh)
selectForList which takes a bs4 Tag and list of CSS selectors and returns the corresponding details from that Tag
- (If you are unfamiliar with CSS selectors, I often use this reference as a cheatsheet.)

So, you can set up a reference dictionary (selRef) of selectors [that will be passed to selectForList later] and then fetch and parse the loaded HTML with linkToSoup_selenium:

selRef = {
    'title': 'div.content>a.post-link[href]', 'date': 'p.text-date',
    'url': ('div.content>a.post-link[href]', 'href'),
    # 'category': 'div.content>p.post-category-timeline',
    # 'excerpt': 'div.content>p.post-content-rendered',
}

soup = linkToSoup_selenium('https://danskfolkeparti.dk/nyheder/', ecx=[
    'div.timeline>div>div.content>a.post-link[href]',  # load links [ titles]
    'div.timeline>div>p.text-date'  # load dates [probably redundant]
], clickFirst=[
    'a[role="button"][data-cli_action="accept_all"]'  # accept cookies [optional]
], by_method='css', scrollN=(20, 5), returnErr=True) # scroll 20x with 5sec breaks

Since returnErr=True is set, the function will return a string containing an error message if something causes it to fail, so you should habitually check for that before trying to extract the data.

if isinstance(soup, str):
    print(type(soup), miniStr(soup)[:100])  # print error message
    prTimeline = []
else:
    prTimeline = [{k: v for k, v in zip(
        list(selRef.keys()), selectForList(pr, list(selRef.values()))
    )} for pr in soup.select('div.timeline>div')]

Now prTimeline looks something like

## [only the first 3 of 76 are included below] ##
[{'title': 'Året 2022 viste hvad Dansk Folkeparti er gjort af',
  'date': '23/12/2022',
  'url': 'https://danskfolkeparti.dk/nyheder/mortens-nyhedsbrev/aaret-2022-viste-hvad-dansk-folkeparti-er-gjort-af/'},
 {'title': 'Mette Frederiksen har stemt danskerne hjem til 2001',
  'date': '13/12/2022',
  'url': 'https://danskfolkeparti.dk/nyheder/mortens-nyhedsbrev/mette-frederiksen-har-stemt-danskerne-hjem-til-2001/'},
 {'title': 'Vi klarede folketingsvalget – men der skal kæmpes lidt endnu',
  'date': '23/11/2022',
  'url': 'https://danskfolkeparti.dk/nyheder/mortens-nyhedsbrev/vi-klarede-folketingsvalget-men-der-skal-kaempes-lidt-endnu/'}]

Solution 2: API

If you open open the Network Tab before clicking search (or just scrolling all the way down, or refreshing the page), you might see this request with a JSON response that is being used to fetch the data for populating the timeline. So, you just need to replicate this API request.

However, as @SergeyK commented,

site have Cloudflare protection, so u cant get result without setting up cookies

and the same seems to be true for the API as well. I'm not good with setting headers and cookies as needed; so instead, I tend to just use cloudscraper [or HTMLSession sometimes] in such cases.

# import  cloudscraper
qStr = 'categories=20,85,15,83,73,84,&before=2022-12-28T23:59:59'
qStr  = '&after=1990-05-08T01:01:01&per_page=99&page=1'
apiUrl = f'https://danskfolkeparti.dk/wp-json/wp/v2/posts?{qStr}'
prTimeline = [{
    'title': pr['title']['rendered'],
    'date': pr['date'], # 'date': pr['date_gmt'], 
    'url': pr['link']
} for pr in cloudscraper.create_scraper().get(apiUrl).json()]

and the resulting prTimeline looks pretty similar to the Selenium output.

There's an expanded version using a set of functions that lets you get the same results with

prTimeline, rStatus = danskfolkeparti_apiScraper(pathsRef={'title': ['title', 'rendered'], 'date': ['date'], 'url': ['link']})

But you can do much more, like passing searchFor={'before': '2022-10-01T00:00:00'} to only get posts before October, or searchFor="search terms" to search by keywords but

you can't search for keywords and also set parameters like category/time/etc
you have to make sure that before and after are in ISO format, and that page is a positive integer, and that categories is a list of integers [or they might be ignored]

You can get more information if you leave the default arguments and make use of all of the functions, as below:

# from bs4 import BeautifulSoup

### FIRST PASTE EVERYTHING FROM https://pastebin.com/aSQrW9ff ###

prTimeline, prtStatus = danskfolkeparti_apiScraper()
prCats = danskfolkeparti_catNames([pr['category'] for pr in prTimeline])
for pi, (pr, prCat) in enumerate(zip(prTimeline, prCats)):
    prTimeline[pi]['category'] = prCat
    cTxt = BeautifulSoup(pr['content']).get_text(' ')
    cTxt = ' '.join(w for w in cTxt.split() if w) # reduce whitescpace
    prTimeline[pi]['content'] = cTxt

FULL RESULTS

The full results from the last code snippet, as well as the selenium solution (with ALL of selRef uncommented) have been uploaded to this spreadsheet. The CSVs were saved using pandas.

# import pandas 

fileName = 'prTimeline.csv' # 'dfn_prTimeline_api.csv' # 'dfn_prTimeline_sel.csv'
pandas.DataFrame(prTimeline).to_csv(fileName, index=False)

Also, if you are curious, you can see the categories in the default API call with danskfolkeparti_categories([20,85,15,83,73,84]).