I am currently working on a side project to scrape the results of a web form that returns a table that is rendered with JavaScript.
I've managed to get this working fairly easily with Selenium. However, I am querying this form approximately 5,000 times based on a CSV file, which leads to a large processing time (approximately 9 hours).
I would like to know if there is a way I can access the response data directly through Python using the generated request URL instead of rendering the JavaScript.
The website form in question: https://probatesearch.service.gov.uk/
An example of the captured Network Request URL once both parts of the form are completed (entering a year before 1996 will output a different response, these responses can be ignored):
https://probatesearch.service.gov.uk/api/nuxeo/api/v1/search/pp/pp_mainstream_default_search/execute?pageProvider=pp_mainstream_default_search¤tPageIndex=0&hmcts_grant_schema_grantdocTypeOf=1&hmcts_grant_schema_surname=SMITH&hmcts_grant_schema_dateofdeath_min=2019-03-23T00:00:00.000Z&hmcts_grant_schema_dateofdeath_max=2019-03-23T00:00:00.000Z&hmcts_grant_schema_dateofprobate_min=2022-02-01T00:00:00.000Z&hmcts_grant_schema_dateofprobate_max=2022-03-02T00:00:00.000Z&hmcts_grant_schema_firstnames=TREVOR&sortBy=&sortOrder=DESC
I have tried to process this request using BeautifulSoup, urllib and requests but have had no luck in extracting meaningful data, however I am fairly amateur when it comes to web-scraping.
The output I keep getting from using urllib or requests is as follows: JSON Response
Unfortunately this does not include any of the actual data from the requested table (e.g. name, date of death etc)
I am hoping to capture the table response (if any) into either a JSON or Dataframe for further processing. Any help is appreciated.
Edit: Here is a screenshot of the table data I am trying to access once the form is completed and requested: Required Table
CodePudding user response:
The general answer is that it seems the UK goverment (or maybe just the court system) is implmetning an API to access the type of data you're looking for - you should definitely read up on that and on APIs generally.
More specifically in your case, the data is availbe through an API call which can be viewed using the developer tab in your browser. See more here, for one of many examples.
So in this case, I assume you know some (but not all) info (in the example below, you know last name, year of death and year of probate) about the case and send an API request containing that info. The call retrieves 7 entries.
import requests
import json
url = 'https://probatesearch.service.gov.uk/api/nuxeo/api/v1/search/pp/pp_mainstream_default_search/execute'
last_name, death, probate = 'SMITH',2019,2022
targets = ['hmctsgrant:surname','hmctsgrant:firstnames','hmctsgrant:dateofdeath','hmctsgrant:dateofprobate','hmctsgrant:probatenumber',
'hmctsgrant:grantdocTypeoOfName','hmctsgrant:registryofficename']
param_dict = (
('pageProvider', 'pp_mainstream_default_search'),
('currentPageIndex', '0'),
('hmcts_grant_schema_grantdocTypeOf', '1'),
('hmcts_grant_schema_surname', f'{last_name}'),
('hmcts_grant_schema_dateofdeath_min', f'{death}-01-01T00:00:00.000Z'),
('hmcts_grant_schema_dateofdeath_max',f'{death}-12-31T00:00:00.000Z'),
('hmcts_grant_schema_dateofprobate_min', f'{probate}-01-01T00:00:00.000Z'),
('hmcts_grant_schema_dateofprobate_max', f'{probate}-12-31T00:00:00.000Z'),
('hmcts_grant_schema_firstnames', ''),
('sortBy', ''),
('sortOrder', 'DESC'),
)
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:97.0) Gecko/20100101 Firefox/97.0',
'Accept': 'application/json',
'Referer': 'https://probatesearch.service.gov.uk/search-results',
'properties': 'hmcts_grant_schema',
}
response = requests.get(url, headers=headers, params=param_dict, cookies=cookies)
data = json.loads(response.text)
for entry in data['entries']:
info = entry['properties']
for target in targets:
print(info[target])
print('------------')
Output in this case is
Smith
Trevor Floyd
2019-03-23T00:00:00.000Z
2022-02-03T00:00:00.000Z
1641476859693801
ADMINISTRATION
Newcastle
------------
Smith
David William
2019-02-06T00:00:00.000Z
2022-02-04T00:00:00.000Z
1643363130442596
ADMINISTRATION
Newcastle
------------
etc.
You can ovbiously load the output into a pandas dataframe, or anything else you need to work with.