I am attempting to extract all the urls from the search results of this website. It has 754 search results across 26 pages. https://cdm20045.contentdm.oclc.org/digital/collection/p20045coll17/search/searchterm/Integrated Data Infrastructure (IDI)/field/projeb/mode/exact/conn/and
This is the code I wrote but it didn't get anything...sorry I am new to Python, can anyone give me some clue how I could be there? Many thanks
import requests
from bs4 import BeautifulSoup
url = 'https://cdm20045.contentdm.oclc.org/digital/collection/p20045coll17/search/searchterm/Integrated Data Infrastructure (IDI)/field/projeb/mode/exact/conn/and'
reqs = requests.get(url,verify=False)
soup = BeautifulSoup(reqs.text, 'html.parser')
urls = []
for link in soup.find_all('a'):
print(link.get('href'))
CodePudding user response:
There is 754 book, i show example with 35. To get all the books, at the end of the url change 35 to 754
import pandas as pd
import requests
url = 'https://cdm20045.contentdm.oclc.org/digital/api/search/collection/p20045coll17/searchterm/Integrated Data Infrastructure (IDI)/field/projeb/mode/exact/conn/and/maxRecords/35'
response = requests.get(url)
books = []
for book in response.json()['items']:
books.append({
'link': ('https://cdm20045.contentdm.oclc.org' book['itemLink']).replace('singleitem', 'digital'),
'title': book['metadataFields'][0]['value'],
'subjec': book['metadataFields'][1]['value'],
'date': book['metadataFields'][2]['value'],
'publis': book['metadataFields'][3]['value']
})
df = pd.DataFrame(books)
print(df.to_string())
OUTPUT:
link title subjec date publis
0 https://cdm20045.contentdm.oclc.org/digital/collection/p20045coll17/id/1181 The Future of Work in New Zealand - An Empirical Examiniation (MAA2019-95) Business Practices; 2021 Auckland University of Technology;
1 https://cdm20045.contentdm.oclc.org/digital/collection/p20045coll17/id/993 In Fairness to Our Schools: Better measures for better outcomes\n Education 2019 The New Zealand Initiative
2 https://cdm20045.contentdm.oclc.org/digital/collection/p20045coll17/id/1029 Dementia case-finding and prevalence estimation using routinely collected health data in the Integrated Data Infrastructure (IDI) (MAA2020-12) Health; 2020 University of Auckland
3 https://cdm20045.contentdm.oclc.org/digital/collection/p20045coll17/id/1218 Investigating linkage bias in the IDI using education and census data (MAA2020-69) Meta-research; Education; 2020-12 University of Auckland;
4 https://cdm20045.contentdm.oclc.org/digital/collection/p20045coll17/id/757 Explaining Ethnic Differences in Student Success at University in New Zealand [MAA2018-09] Education and training 2018 Auckland University of Technology
5 https://cdm20045.contentdm.oclc.org/digital/collection/p20045coll17/id/1357 Health and social harms from alcohol: what does NZ's data tell us? Health; 2022-03 University of Otago;
6 https://cdm20045.contentdm.oclc.org/digital/collection/p20045coll17/id/1268 'What about the Menz?' Low employer attachment and ineligibility for partner parental leave Income and Work; People and Communities; 2021-08 Auckland Council; Social Wellbeing Agency;
7 https://cdm20045.contentdm.oclc.org/digital/collection/p20045coll17/id/936 Migrant Networks, Brain Waste and its Economic Impacts: Evidence from Immigrants in New Zealand [MAA2019-31] Employment; People and Communities 2019 University of Auckland
8 https://cdm20045.contentdm.oclc.org/digital/collection/p20045coll17/id/1306 Pacific uptake of temporary work visas Employment; Business financials; People and Communities; 2020-05 NZIER; Ministry of Business, Innovation & Employment, MBIE; Ministry of Foreign Affairs and Trade;
9 https://cdm20045.contentdm.oclc.org/digital/collection/p20045coll17/id/123 Firm Productivity Growth and Skill (MAA2012-16) Income and work; Business practices 2015 Motu Economic and Public Policy Research
10 https://cdm20045.contentdm.oclc.org/digital/collection/p20045coll17/id/380 The Productivity Costs of Four Health Conditions in New Zealand [MAA2016-59] Health; Income and work 2016 University of Otago
11 https://cdm20045.contentdm.oclc.org/digital/collection/p20045coll17/id/732 Differential labour supply effects of pension eligibility to beneficiaries and non-beneficiaries [MAA2018-50] Income and Work; Benefits and Social Services; 2018 Auckland University of Technology;
12 https://cdm20045.contentdm.oclc.org/digital/collection/p20045coll17/id/189 Intergenerational Analyses Using the IDI People and communities 2017 COMPASS (The University of Auckland)
13 https://cdm20045.contentdm.oclc.org/digital/collection/p20045coll17/id/198 School to work: What matters? Education and Employment of Young People Born in 1991 Education and training 2016 Ministry of Education
14 https://cdm20045.contentdm.oclc.org/digital/collection/p20045coll17/id/975 Equity Index of socioeconomic disadvantage in education [MAA2019-85] Education; Benefits and Social Services; Income and Work 2019-12 Ministry of Education
15 https://cdm20045.contentdm.oclc.org/digital/collection/p20045coll17/id/1225 Linkage bias in the IDI (MAA2020-58) Meta-research; 2020-11 University of Otago;
16 https://cdm20045.contentdm.oclc.org/digital/collection/p20045coll17/id/431 The relationship between exposure to the natural environment and children's health at different life stages [MAA2017-11] Health 2017 Massey University
17 https://cdm20045.contentdm.oclc.org/digital/collection/p20045coll17/id/1025 Who are the 1M and 1X? Police engagement with citizens in mental distress (MAA2020-08) Justice; Health; 2020 University of Auckland
18 https://cdm20045.contentdm.oclc.org/digital/collection/p20045coll17/id/1073 Who received the Wage Subsidy and Wage Subsidy Extension? (MAA2018-48) Benefits and Social Services; 2020 Ministry of Social Development, MSD;
19 https://cdm20045.contentdm.oclc.org/digital/collection/p20045coll17/id/929 Factsheet IDI first stage - Exploring work-related claims difference by Maori and non-Maori Business Practices; Health; Income and Work; People and Communities 2019-02 Worksafe New Zealand
20 https://cdm20045.contentdm.oclc.org/digital/collection/p20045coll17/id/1138 Measuring Commute Patterns Over Time: Using administrative data to identify where employees live and work (MAA2018-55) Transport; Employment; 2020 Motu Economic and Public Policy Research; New Zealand Transport Agency;
21 https://cdm20045.contentdm.oclc.org/digital/collection/p20045coll17/id/835 Evaluating the Family Start Programme [MAA2018-87] People and Communities; Health; Education; Employment; Benefits and Social Services; Justice 2018-12 Ministry for Vulnerable Children Oranga Tamariki
22 https://cdm20045.contentdm.oclc.org/digital/collection/p20045coll17/id/980 Access to primary health care services for people in Canterbury with poor access: improving our understanding of people who are unenrolled or tenuously enrolled with a general practice team [MAA2019-51] Health 2019-11 Pegasus Health (Charitable) Limited
23 https://cdm20045.contentdm.oclc.org/digital/collection/p20045coll17/id/996 Intergenerational Analyses Using the IDI:\nAn update\n [MAA2016-53] Population 2020-03 COMPASS, University of Auckland;
24 https://cdm20045.contentdm.oclc.org/digital/collection/p20045coll17/id/1351 Individualised Funding in Aotearoa Benefits and Social Services; 2020 Nicholson Consulting;
25 https://cdm20045.contentdm.oclc.org/digital/collection/p20045coll17/id/39 Comparing the Household Economic Survey to administrative records: an analysis of income and benefit receipt (MAA2015-27) Benefits and Social Services 2017 New Zealand Treasury
26 https://cdm20045.contentdm.oclc.org/digital/collection/p20045coll17/id/1320 International students and graduates and their impact on the NZ housing market (MAA2017-31) Housing; Education and Training; 2021 Universities New Zealand;
27 https://cdm20045.contentdm.oclc.org/digital/collection/p20045coll17/id/1002 The expression, experience and transcendence of low-skill in Aotearoa New Zealand (MAA2019-91) Education and Training; 2019 Auckland University of Technology
28 https://cdm20045.contentdm.oclc.org/digital/collection/p20045coll17/id/394 Comparison of NZDep2013 with Index of Multiple Deprivation (IMD2013) [MAA2017-70] Health; Income and Work; Housing; Justice 2017 University of Otago
29 https://cdm20045.contentdm.oclc.org/digital/collection/p20045coll17/id/1005 Accessibility of Disability Support Services funding in NZ (MAA2019-102) Health; 2019 Nicholson Consulting;
30 https://cdm20045.contentdm.oclc.org/digital/collection/p20045coll17/id/1135 Understanding Parental education and health of Pacific families: Background and study protocol: Parental Education and Pacific Health, study protocol (MAA2018-47) Education; People and Communities; Children; 2020 University of Otago;
31 https://cdm20045.contentdm.oclc.org/digital/collection/p20045coll17/id/47 Evaluation of the Impact of the Youth Service: Youth Payment and Young Parent Payment (MAA2013-16) Benefits and Social Services 2017 New Zealand Treasury
32 https://cdm20045.contentdm.oclc.org/digital/collection/p20045coll17/id/63 Using IDI data to estimate fiscal impacts of better social sector performance (MAA2013-16) Benefits and Social Services 2016 New Zealand Treasury
33 https://cdm20045.contentdm.oclc.org/digital/collection/p20045coll17/id/1040 Māori Student Transitions (MAA2020-35) Education and Training; Benefits and Social Services; Income and Work; Employment; 2020 Social Wellbeing Agency;
34 https://cdm20045.contentdm.oclc.org/digital/collection/p20045coll17/id/64 Financial wellbeing of older workers following injury: research utilising Statistics New Zealand’s Integrated Data Infrastructure Health; Income and work 2017 University of Otago
CodePudding user response:
@bmcculley,already has stated that the required data is loaded dynamically via JS and Bs4 can render JS. So you have two options: use either selenium (more complex) or use API
. As they are using API, so you can esily grab the required data from API as GET methosd as Json format data which is the robust way.
I've made the pagination using for loop and range
function.
Example:
import requests
import pandas as pd
api_url = 'https://cdm20045.contentdm.oclc.org/digital/api/search/collection/p20045coll17/searchterm/Integrated Data Infrastructure (IDI)/field/projeb/mode/exact/conn/and/page/{page}/maxRecords/30'
data = []
for page in range(1,27):
r = requests.get(api_url.format(page=page))
for link in r.json()['items']:
url = 'https://cdm20045.contentdm.oclc.org' link['itemLink'].replace('singleitem','digital')
data.append({
'URL':url
})
df = pd.DataFrame(data)
print(df)
Outout:
URL
0 https://cdm20045.contentdm.oclc.org/digital/co...
1 https://cdm20045.contentdm.oclc.org/digital/co...
2 https://cdm20045.contentdm.oclc.org/digital/co...
3 https://cdm20045.contentdm.oclc.org/digital/co...
4 https://cdm20045.contentdm.oclc.org/digital/co...
.. ...
749 https://cdm20045.contentdm.oclc.org/digital/co...
750 https://cdm20045.contentdm.oclc.org/digital/co...
751 https://cdm20045.contentdm.oclc.org/digital/co...
752 https://cdm20045.contentdm.oclc.org/digital/co...
753 https://cdm20045.contentdm.oclc.org/digital/co...
[754 rows x 1 columns]