Home > other >  Pagination not iterating over pages
Pagination not iterating over pages

Time:02-10

Want to iterate all pages from this url ""url = "https://www.iata.org/en/about/members/airline-list/"" and dump the results in a .csv file. Only the 1st page gets dump, but want all of them?

import requests
import pandas as pd
from bs4 import BeautifulSoup
from urllib.request import Request

url = 'https://www.iata.org/en/about/members/airline-list/'
req = Request(url , headers = {
                            'accept':'*/*',
                            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.190 Safari/537.36'})

data = []

while True:
    print(url)
    html = requests.get(url)
    soup = BeautifulSoup(html.text, 'html.parser')
    data.append(pd.read_html(soup.select_one('table.datatable').prettify())[0])

    if soup.select_one('span.pagination-link.is-active   div a[href]'):
        url = soup.select_one('span.pagination-link.is-active   div a')['href']
    else:
        break
df = pd.concat(data)
df.to_csv('airline-list.csv',encoding='utf-8-sig',index=False)

CodePudding user response:

Try this approach:

for i in range(1, 30):
    url = f'https://www.iata.org/en/about/members/airline-list/?page={i}&search=&ordering=Alphabetical'
    html = requests.get(url)
    soup = BeautifulSoup(html.text, 'html.parser')
    data.append(pd.read_html(soup.select_one('table.datatable').prettify())[0])

CodePudding user response:

To get data dynamically, use:

import pandas as pd
import requests
import bs4

url = 'https://www.iata.org/en/about/members/airline-list/?page={page}&search=&ordering=Alphabetical'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.190 Safari/537.36'}

# Total number of pages
html = requests.get(url.format(page=1), headers=headers)
soup = bs4.BeautifulSoup(html.text)
pages = int(soup.find_all('a', {'class': 'pagination-link'})[-2].text)

data = []
for page in range(1, pages 1):
    html = requests.get(url.format(page=1, headers=headers))
    data.append(pd.read_html(html.text)[0])   
df = pd.concat(data)

Output:

>>> df
             Airline Name IATA Designator  3 digit code ICAO code Country / Territory
0                 ABX Air              GB           832       ABX       United States
1         Aegean Airlines              A3           390       AEE              Greece
2              Aer Lingus              EI            53       EIN             Ireland
3          Aero Republica              P5           845       RPB            Colombia
4                Aeroflot              SU           555       AFL  Russian Federation
..                    ...             ...           ...       ...                 ...
5   Aerolineas Argentinas              AR            44       ARG           Argentina
6                 Aeromar              VW           942       TAO              Mexico
7              Aeromexico              AM           139       AMX              Mexico
8   Africa World Airlines              AW           394       AFW               Ghana
9             Air Algérie              AH           124       DAH             Algeria

[290 rows x 5 columns]
  • Related