What is the best way to scrape multiple urls and tackle pagination problem (load more button)?-CodePudding

The main link is (https://www.europarl.europa.eu/meps/en/197818/BILLY_KELLEHER/meetings/past#detailedcardmep)

My code shows me only fist pages but I need to browse all of them for all the links (I have more than 100 links)

from bs4 import BeautifulSoup
import requests

page=0
list=[]

isHaveNextPage=True
links = [(f"https://www.europarl.europa.eu/meps/en/loadmore-meetings?meetingType=PAST&memberId=197506&termId=9&page={page}&pageSize=10"), (f"https://www.europarl.europa.eu/meps/en/loadmore-meetings?meetingType=PAST&memberId=124861&termId=9&page={page}&pageSize=10"), (f"https://www.europarl.europa.eu/meps/en/loadmore-meetings?meetingType=PAST&memberId=229519&termId=9&page={page}&pageSize=10"
.....),
while(isHaveNextPage):
    for url in links:
        r= requests.get(url).text
        soup =BeautifulSoup(r,"lxml")
        product = soup.find_all("div",class_="europarl-expandable-item")
    
        for data in product:
            title = data.find(class_="t-item").get_text()
            date = data.find(class_="erpl_document-subtitle-date").get_text()
            address = data.find(class_="erpl_document-subtitle-location").get_text()
            reporter = data.find(class_="erpl_document-subtitle-reporter").get_text()
            author = data.find(class_="erpl_document-subtitle-author").get_text()
        
            list.append([author.strip(), date.strip(), address.strip(), reporter.strip(), title.strip()])
        
        print("page---",page)         
        if soup.find("button",class_='btn btn-default europarl-expandable-async-loadmore') is None:
            isHaveNextPage=False
        page =1

CodePudding user response：

The problem is: you may be incrementing the page number, but the format string has already been made. Updating page doesn't update the string, at all. You have to keep remaking the string with the new data.

Instead of this: f"https://...&page={page}..."
do this: "https://...&page=%i..."

Then do this:

for url in links:
    r= requests.get(url % page).text

Alternately, you can do this: "https://...&page={}..."
and this: r= requests.get(url.format(page)).text

Both versions are just different ways to format a string after the string has already been created. The version of formatting you used only allows you to format the string during creation.

CodePudding user response：

Here is one way of getting that data, handling pagination, and generally solving this issue in a decent manner:

import requests
from bs4 import BeautifulSoup as bs
import pandas as pd
from tqdm import tqdm ## if using Jupyter: from tqdm.notebook import tqdm 

pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36'
}

s = requests.Session()
s.headers.update(headers)
big_list = []
slightly_incompetent_people_ids = ['197818', '96829', '197530', '97968', '197691', '189065', '197636', '33997']
for p in tqdm(slightly_incompetent_people_ids):
    counter = 0
    while True:
        soup = bs(s.get(f'https://www.europarl.europa.eu/meps/en/loadmore-meetings?meetingType=PAST&memberId={p}&termId=9&page={counter}&pageSize=20').text, 'html.parser')
        has_more = soup.select_one('button[]') if soup.select_one('button[]') else None
        
        meetings = soup.select('div[]')
        for m in meetings:
            title = m.select_one('h3').text.strip()
            date = m.select_one('span[]').text.strip()
            place = m.select_one('span[]').text.strip()
            big_list.append((p, title, date, place))
        if has_more == None:
            counter = 0
            break
        counter  = 1
df = pd.DataFrame(big_list, columns = ['MEP', 'Title', 'Date', 'Place'])
print(df)

Result in terminal:

100%
8/8 [00:01<00:00, 5.61it/s]
MEP Title   Date    Place
0   197818  AIFMD   25-05-2022  Virtual meeting
1   197818  DORA    25-05-2022  Virtual meeting
2   197818  AIFMD   25-05-2022  Virtual meeting
3   197818  AIFMD   18-05-2022  Brussels
4   197818  AIFMD   17-05-2022  Virtual meeting
... ... ... ... ...
77  33997   Meeting with H.E. Aigul Kuspan, the Ambassador of the Republic of Kazakhstan to the Kingdom of Belgium and Head of Mission of the Republic of Kazakhstan to the European Union  08-01-2020  European Parliament
78  33997   Meeting with H.E. Daniel Ioniță, Ambassador Extraordinary and Plenipotentiary of Romania to the Republic of Moldova 09-12-2019  Embassy of Romania to the Republic of Moldova
79  33997   Meeting with Mihai Chirica, Mayor of Iași   07-12-2019  Iași, Romania
80  33997   Meeting with Laura Codruța Kövesi, the European Public Prosecut 06-11-2019  European Parliament
81  33997   Meeting with Tony Murphy, Member of the European Court of Auditors  24-09-2019  European Parliament
82 rows × 4 columns

You can get more details from the meetings, and you can add more MEP id's to that list.

Relevant documentation for packages used: