The main link is (https://www.europarl.europa.eu/meps/en/197818/BILLY_KELLEHER/meetings/past#detailedcardmep)
My code shows me only fist pages but I need to browse all of them for all the links (I have more than 100 links)
from bs4 import BeautifulSoup
import requests
page=0
list=[]
isHaveNextPage=True
links = [(f"https://www.europarl.europa.eu/meps/en/loadmore-meetings?meetingType=PAST&memberId=197506&termId=9&page={page}&pageSize=10"), (f"https://www.europarl.europa.eu/meps/en/loadmore-meetings?meetingType=PAST&memberId=124861&termId=9&page={page}&pageSize=10"), (f"https://www.europarl.europa.eu/meps/en/loadmore-meetings?meetingType=PAST&memberId=229519&termId=9&page={page}&pageSize=10"
.....),
while(isHaveNextPage):
for url in links:
r= requests.get(url).text
soup =BeautifulSoup(r,"lxml")
product = soup.find_all("div",class_="europarl-expandable-item")
for data in product:
title = data.find(class_="t-item").get_text()
date = data.find(class_="erpl_document-subtitle-date").get_text()
address = data.find(class_="erpl_document-subtitle-location").get_text()
reporter = data.find(class_="erpl_document-subtitle-reporter").get_text()
author = data.find(class_="erpl_document-subtitle-author").get_text()
list.append([author.strip(), date.strip(), address.strip(), reporter.strip(), title.strip()])
print("page---",page)
if soup.find("button",class_='btn btn-default europarl-expandable-async-loadmore') is None:
isHaveNextPage=False
page =1
CodePudding user response:
The problem is: you may be incrementing the page number, but the format string has already been made. Updating page
doesn't update the string, at all. You have to keep remaking the string with the new data.
Instead of this: f"https://...&page={page}..."
do this: "https://...&page=%i..."
Then do this:
for url in links:
r= requests.get(url % page).text
Alternately, you can do this: "https://...&page={}..."
and this: r= requests.get(url.format(page)).text
Both versions are just different ways to format a string after the string has already been created. The version of formatting you used only allows you to format the string during creation.
CodePudding user response:
Here is one way of getting that data, handling pagination, and generally solving this issue in a decent manner:
import requests
from bs4 import BeautifulSoup as bs
import pandas as pd
from tqdm import tqdm ## if using Jupyter: from tqdm.notebook import tqdm
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36'
}
s = requests.Session()
s.headers.update(headers)
big_list = []
slightly_incompetent_people_ids = ['197818', '96829', '197530', '97968', '197691', '189065', '197636', '33997']
for p in tqdm(slightly_incompetent_people_ids):
counter = 0
while True:
soup = bs(s.get(f'https://www.europarl.europa.eu/meps/en/loadmore-meetings?meetingType=PAST&memberId={p}&termId=9&page={counter}&pageSize=20').text, 'html.parser')
has_more = soup.select_one('button[]') if soup.select_one('button[]') else None
meetings = soup.select('div[]')
for m in meetings:
title = m.select_one('h3').text.strip()
date = m.select_one('span[]').text.strip()
place = m.select_one('span[]').text.strip()
big_list.append((p, title, date, place))
if has_more == None:
counter = 0
break
counter = 1
df = pd.DataFrame(big_list, columns = ['MEP', 'Title', 'Date', 'Place'])
print(df)
Result in terminal:
100%
8/8 [00:01<00:00, 5.61it/s]
MEP Title Date Place
0 197818 AIFMD 25-05-2022 Virtual meeting
1 197818 DORA 25-05-2022 Virtual meeting
2 197818 AIFMD 25-05-2022 Virtual meeting
3 197818 AIFMD 18-05-2022 Brussels
4 197818 AIFMD 17-05-2022 Virtual meeting
... ... ... ... ...
77 33997 Meeting with H.E. Aigul Kuspan, the Ambassador of the Republic of Kazakhstan to the Kingdom of Belgium and Head of Mission of the Republic of Kazakhstan to the European Union 08-01-2020 European Parliament
78 33997 Meeting with H.E. Daniel Ioniță, Ambassador Extraordinary and Plenipotentiary of Romania to the Republic of Moldova 09-12-2019 Embassy of Romania to the Republic of Moldova
79 33997 Meeting with Mihai Chirica, Mayor of Iași 07-12-2019 Iași, Romania
80 33997 Meeting with Laura Codruța Kövesi, the European Public Prosecut 06-11-2019 European Parliament
81 33997 Meeting with Tony Murphy, Member of the European Court of Auditors 24-09-2019 European Parliament
82 rows × 4 columns
You can get more details from the meetings, and you can add more MEP id's to that list.
Relevant documentation for packages used: