Home > database >  Python beautifulsoup extract all urls from a website search results. New Python beginner
Python beautifulsoup extract all urls from a website search results. New Python beginner


I am attempting to extract all the urls from the search results of this website. It has 754 search results across 26 pages. https://cdm20045.contentdm.oclc.org/digital/collection/p20045coll17/search/searchterm/Integrated Data Infrastructure (IDI)/field/projeb/mode/exact/conn/and

This is the code I wrote but it didn't get anything...sorry I am new to Python, can anyone give me some clue how I could be there? Many thanks

import requests
from bs4 import BeautifulSoup

url = 'https://cdm20045.contentdm.oclc.org/digital/collection/p20045coll17/search/searchterm/Integrated Data Infrastructure (IDI)/field/projeb/mode/exact/conn/and'
reqs = requests.get(url,verify=False)
soup = BeautifulSoup(reqs.text, 'html.parser')

urls = []
for link in soup.find_all('a'):

CodePudding user response:

There is 754 book, i show example with 35. To get all the books, at the end of the url change 35 to 754

import pandas as pd
import requests

url = 'https://cdm20045.contentdm.oclc.org/digital/api/search/collection/p20045coll17/searchterm/Integrated Data Infrastructure (IDI)/field/projeb/mode/exact/conn/and/maxRecords/35'
response = requests.get(url)
books = []
for book in response.json()['items']:
        'link': ('https://cdm20045.contentdm.oclc.org'   book['itemLink']).replace('singleitem', 'digital'),
        'title': book['metadataFields'][0]['value'],
        'subjec': book['metadataFields'][1]['value'],
        'date': book['metadataFields'][2]['value'],
        'publis': book['metadataFields'][3]['value']
df = pd.DataFrame(books)


                                                                           link                                                                                                                                                                                                       title                                                                                              subjec     date                                                                                              publis
0   https://cdm20045.contentdm.oclc.org/digital/collection/p20045coll17/id/1181                                                                                                                                  The Future of Work in New Zealand - An Empirical Examiniation (MAA2019-95)                                                                                 Business Practices;     2021                                                                  Auckland University of Technology;
1    https://cdm20045.contentdm.oclc.org/digital/collection/p20045coll17/id/993                                                                                                                                           In Fairness to Our Schools: Better measures for better outcomes\n                                                                                           Education     2019                                                                          The New Zealand Initiative
2   https://cdm20045.contentdm.oclc.org/digital/collection/p20045coll17/id/1029                                                              Dementia case-finding and prevalence estimation using routinely collected health data in the Integrated Data Infrastructure (IDI) (MAA2020-12)                                                                                             Health;     2020                                                                              University of Auckland
3   https://cdm20045.contentdm.oclc.org/digital/collection/p20045coll17/id/1218                                                                                                                          Investigating linkage bias in the IDI using education and census data (MAA2020-69)                                                                           Meta-research; Education;  2020-12                                                                             University of Auckland;
4    https://cdm20045.contentdm.oclc.org/digital/collection/p20045coll17/id/757                                                                                                                  Explaining Ethnic Differences in Student Success at University in New Zealand [MAA2018-09]                                                                              Education and training     2018                                                                   Auckland University of Technology
5   https://cdm20045.contentdm.oclc.org/digital/collection/p20045coll17/id/1357                                                                                                                                          Health and social harms from alcohol: what does NZ's data tell us?                                                                                             Health;  2022-03                                                                                University of Otago;
6   https://cdm20045.contentdm.oclc.org/digital/collection/p20045coll17/id/1268                                                                                                                 'What about the Menz?' Low employer attachment and ineligibility for partner parental leave                                                            Income and Work; People and Communities;  2021-08                                                          Auckland Council; Social Wellbeing Agency;
7    https://cdm20045.contentdm.oclc.org/digital/collection/p20045coll17/id/936                                                                                                Migrant Networks, Brain Waste and its Economic Impacts: Evidence from Immigrants in New Zealand [MAA2019-31]                                                                  Employment; People and Communities     2019                                                                              University of Auckland
8   https://cdm20045.contentdm.oclc.org/digital/collection/p20045coll17/id/1306                                                                                                                                                                      Pacific uptake of temporary work visas                                            Employment; Business financials; People and Communities;  2020-05  NZIER; Ministry of Business, Innovation & Employment, MBIE; Ministry of Foreign Affairs and Trade;
9    https://cdm20045.contentdm.oclc.org/digital/collection/p20045coll17/id/123                                                                                                                                                             Firm Productivity Growth and Skill (MAA2012-16)                                                                 Income and work; Business practices     2015                                                            Motu Economic and Public Policy Research
10   https://cdm20045.contentdm.oclc.org/digital/collection/p20045coll17/id/380                                                                                                                                The Productivity Costs of Four Health Conditions in New Zealand [MAA2016-59]                                                                             Health; Income and work     2016                                                                                 University of Otago
11   https://cdm20045.contentdm.oclc.org/digital/collection/p20045coll17/id/732                                                                                               Differential labour supply effects of pension eligibility to beneficiaries and non-beneficiaries [MAA2018-50]                                                      Income and Work; Benefits and Social Services;     2018                                                                  Auckland University of Technology;
12   https://cdm20045.contentdm.oclc.org/digital/collection/p20045coll17/id/189                                                                                                                                                                    Intergenerational Analyses Using the IDI                                                                              People and communities     2017                                                                COMPASS (The University of Auckland)
13   https://cdm20045.contentdm.oclc.org/digital/collection/p20045coll17/id/198                                                                                                                         School to work: What matters? Education and Employment of Young People Born in 1991                                                                              Education and training     2016                                                                               Ministry of Education
14   https://cdm20045.contentdm.oclc.org/digital/collection/p20045coll17/id/975                                                                                                                                        Equity Index of socioeconomic disadvantage in education [MAA2019-85]                                            Education; Benefits and Social Services; Income and Work  2019-12                                                                               Ministry of Education
15  https://cdm20045.contentdm.oclc.org/digital/collection/p20045coll17/id/1225                                                                                                                                                                        Linkage bias in the IDI (MAA2020-58)                                                                                      Meta-research;  2020-11                                                                                University of Otago;
16   https://cdm20045.contentdm.oclc.org/digital/collection/p20045coll17/id/431                                                                                    The relationship between exposure to the natural environment and children's health at different life stages [MAA2017-11]                                                                                              Health     2017                                                                                   Massey University
17  https://cdm20045.contentdm.oclc.org/digital/collection/p20045coll17/id/1025                                                                                                                      Who are the 1M and 1X? Police engagement with citizens in mental distress (MAA2020-08)                                                                                    Justice; Health;     2020                                                                              University of Auckland
18  https://cdm20045.contentdm.oclc.org/digital/collection/p20045coll17/id/1073                                                                                                                                      Who received the Wage Subsidy and Wage Subsidy Extension? (MAA2018-48)                                                                       Benefits and Social Services;     2020                                                                Ministry of Social Development, MSD;
19   https://cdm20045.contentdm.oclc.org/digital/collection/p20045coll17/id/929                                                                                                                Factsheet IDI first stage - Exploring work-related claims difference by Maori and non-Maori                                  Business Practices; Health; Income and Work; People and Communities  2019-02                                                                                Worksafe New Zealand
20  https://cdm20045.contentdm.oclc.org/digital/collection/p20045coll17/id/1138                                                                                      Measuring Commute Patterns Over Time: Using administrative data to identify where employees live and work (MAA2018-55)                                                                              Transport; Employment;     2020                             Motu Economic and Public Policy Research; New Zealand Transport Agency;
21   https://cdm20045.contentdm.oclc.org/digital/collection/p20045coll17/id/835                                                                                                                                                          Evaluating the Family Start Programme [MAA2018-87]        People and Communities; Health; Education; Employment; Benefits and Social Services; Justice  2018-12                                                    Ministry for Vulnerable Children Oranga Tamariki
22   https://cdm20045.contentdm.oclc.org/digital/collection/p20045coll17/id/980  Access to primary health care services for people in Canterbury with poor access: improving our understanding of people who are unenrolled or tenuously enrolled with a general practice team [MAA2019-51]                                                                                              Health  2019-11                                                                 Pegasus Health (Charitable) Limited
23   https://cdm20045.contentdm.oclc.org/digital/collection/p20045coll17/id/996                                                                                                                                         Intergenerational Analyses Using the IDI:\nAn update\n [MAA2016-53]                                                                                          Population  2020-03                                                                    COMPASS, University of Auckland;
24  https://cdm20045.contentdm.oclc.org/digital/collection/p20045coll17/id/1351                                                                                                                                                                          Individualised Funding in Aotearoa                                                                       Benefits and Social Services;     2020                                                                               Nicholson Consulting;
25    https://cdm20045.contentdm.oclc.org/digital/collection/p20045coll17/id/39                                                                                   Comparing the Household Economic Survey to administrative records: an analysis of income and benefit receipt (MAA2015-27)                                                                        Benefits and Social Services     2017                                                                                New Zealand Treasury
26  https://cdm20045.contentdm.oclc.org/digital/collection/p20045coll17/id/1320                                                                                                                 International students and graduates and their impact on the NZ housing market (MAA2017-31)                                                                    Housing; Education and Training;     2021                                                                           Universities New Zealand;
27  https://cdm20045.contentdm.oclc.org/digital/collection/p20045coll17/id/1002                                                                                                              The expression, experience and transcendence of low-skill in Aotearoa New Zealand (MAA2019-91)                                                                             Education and Training;     2019                                                                   Auckland University of Technology
28   https://cdm20045.contentdm.oclc.org/digital/collection/p20045coll17/id/394                                                                                                                           Comparison of NZDep2013 with Index of Multiple Deprivation (IMD2013) [MAA2017-70]                                                           Health; Income and Work; Housing; Justice     2017                                                                                 University of Otago
29  https://cdm20045.contentdm.oclc.org/digital/collection/p20045coll17/id/1005                                                                                                                                    Accessibility of Disability Support Services funding in NZ (MAA2019-102)                                                                                             Health;     2019                                                                               Nicholson Consulting;
30  https://cdm20045.contentdm.oclc.org/digital/collection/p20045coll17/id/1135                                         Understanding Parental education and health of Pacific families: Background and study protocol: Parental Education and Pacific Health, study protocol  (MAA2018-47)                                                        Education; People and Communities; Children;     2020                                                                                University of Otago;
31    https://cdm20045.contentdm.oclc.org/digital/collection/p20045coll17/id/47                                                                                                          Evaluation of the Impact of the Youth Service: Youth Payment and Young Parent Payment (MAA2013-16)                                                                        Benefits and Social Services     2017                                                                                New Zealand Treasury
32    https://cdm20045.contentdm.oclc.org/digital/collection/p20045coll17/id/63                                                                                                                  Using IDI data to estimate fiscal impacts of better social sector performance (MAA2013-16)                                                                        Benefits and Social Services     2016                                                                                New Zealand Treasury
33  https://cdm20045.contentdm.oclc.org/digital/collection/p20045coll17/id/1040                                                                                                                                                                      Māori Student Transitions (MAA2020-35)                  Education and Training; Benefits and Social Services; Income and Work; Employment;     2020                                                                            Social Wellbeing Agency;
34    https://cdm20045.contentdm.oclc.org/digital/collection/p20045coll17/id/64                                                                           Financial wellbeing of older workers following injury: research utilising Statistics New Zealand’s Integrated Data Infrastructure                                                                             Health; Income and work     2017                                                                                 University of Otago

CodePudding user response:

@bmcculley,already has stated that the required data is loaded dynamically via JS and Bs4 can render JS. So you have two options: use either selenium (more complex) or use API. As they are using API, so you can esily grab the required data from API as GET methosd as Json format data which is the robust way.

I've made the pagination using for loop and range function.


import requests
import pandas as pd
api_url = 'https://cdm20045.contentdm.oclc.org/digital/api/search/collection/p20045coll17/searchterm/Integrated Data Infrastructure (IDI)/field/projeb/mode/exact/conn/and/page/{page}/maxRecords/30'

data = []
for page in range(1,27):
    r = requests.get(api_url.format(page=page))
    for link in r.json()['items']:
        url = 'https://cdm20045.contentdm.oclc.org'   link['itemLink'].replace('singleitem','digital')

df = pd.DataFrame(data)


0    https://cdm20045.contentdm.oclc.org/digital/co...
1    https://cdm20045.contentdm.oclc.org/digital/co...
2    https://cdm20045.contentdm.oclc.org/digital/co...
3    https://cdm20045.contentdm.oclc.org/digital/co...
4    https://cdm20045.contentdm.oclc.org/digital/co...
..                                                 ...
749  https://cdm20045.contentdm.oclc.org/digital/co...
750  https://cdm20045.contentdm.oclc.org/digital/co...
751  https://cdm20045.contentdm.oclc.org/digital/co...
752  https://cdm20045.contentdm.oclc.org/digital/co...
753  https://cdm20045.contentdm.oclc.org/digital/co...

[754 rows x 1 columns]
  • Related