Home > Back-end >  Scraping table with multiple pages throwing AttributeError
Scraping table with multiple pages throwing AttributeError

Time:09-06

I am trying to loop through multiple pages on this website I am scraping with BS.

pg = soup.find('ul', 'pagination')
current_pg = pg.find('li', 'active')
next_url = current_pg.findNextSibling('li').a.get('href')

Any ideas on how to solve the AttributeError: 'NoneType' object has no attribute 'get'?

CodePudding user response:

To get to the next page, try to parse the href= from <a rel="next"> tag. If the <a> doesn't exist, exit the loop:

import requests
import pandas as pd
from bs4 import BeautifulSoup

url = "https://nicn.gov.ng/judgement?page=1"

while True:
    soup = BeautifulSoup(requests.get(url).content, "html.parser")

    # parse the table and print some data:
    df = pd.read_html(str(soup))[0]
    print(url)
    print(df.tail())
    print("-" * 80)

    next_link = soup.select_one("a[rel=next]")
    if not next_link:
        break

    url = next_link["href"]

Prints all the 20 pages:


...
--------------------------------------------------------------------------------
https://nicn.gov.ng/judgement?page=19
    S/N            Suit No                                                           Case Title                    Parties                                   Respondents                    Justice       Judgment        Date
95   96  NICN/ABJ/273/2014                       CHUKUEZI CHINEDU GOODWILL VS SIRAJ NIGERIA LTD  CHUKUEZI CHINEDU GOODWILL                             SIRAJ NIGERIA LTD  HON. JUSTICE E.D. E ISELE  view judgment  1970-01-01
96   97  NICN/ABJ/302/2012      BASIL ONYEBUCHI OKORO VS NIGERIA NATIONAL PETROLEUM CORPORATION      BASIL ONYEBUCHI OKORO        NIGERIA NATIONAL PETROLEUM CORPORATION  HON. JUSTICE E.D. E ISELE  view judgment  1970-01-01
97   98  NICN/ABJ/340/2013                            CHINEKWU NNENNA UDOKWU VS ZENITH BANK PLC     CHINEKWU NNENNA UDOKWU                               ZENITH BANK PLC  HON. JUSTICE E.D. E ISELE  view judgment  1970-01-01
98   99  NICN/ABJ/202/2013                IBRAHIM MUSLIM AYOADE VS NIGERIA BOTTLING COMPANY LTD      IBRAHIM MUSLIM AYOADE                  NIGERIA BOTTLING COMPANY LTD  HON. JUSTICE E.D. E ISELE  view judgment  1970-01-01
99  100  NICN/ABJ/246/2013  MR. MAHA ISIAKA ABU VS SKYE BANK (FORMERLY MAINSTREET BANK LIMITED)        MR. MAHA ISIAKA ABU  SKYE BANK (FORMERLY MAINSTREET BANK LIMITED)  HON. JUSTICE E.D. E ISELE  view judgment  1970-01-01
--------------------------------------------------------------------------------
https://nicn.gov.ng/judgement?page=20
    S/N           Suit No                                                                                                     Case Title                   Parties                                                                              Respondents                                                                         Justice       Judgment        Date
26   27  NICN/CA/141/2013                                                 ENGR. PATRICK EDET OQUA VS ATTORNEY-GENERAL, CROSS RIVER STATE   ENGR. PATRICK EDET OQUA                                                      ATTORNEY-GENERAL, CROSS RIVER STATE                                               HONOURABLE JUSTICE E. N. AGBAKOBA  view judgment  1970-01-01
27   28  NICN/LA/243/2013                                                                       Emmanuel Fagbamila V University of Lagos        Emmanuel Fagbamila                                                                      University of Lagos                                                      Hon. Justice P.O Lifu (JP)  view judgment  0000-00-00
28   29    NIC/LA/03/2011  Sunday Olufelo VERSUS Schlumberger Anadrill Nigeria Ltd. . Schlumberger Support Nigeria Ltd.Schlumberger Ltd.            Sunday Olufelo  Schlumberger Anadrill Nigeria Ltd. . Schlumberger Support Nigeria Ltd.Schlumberger Ltd.  Hon. Justice B. B. Kanyip - Presiding Judge Hon. Justice O. A. Obaseki-Osaghae  view judgment  0000-00-00
29   30  NICN/LA/291/2012                                                       CAPTAIN SOLOMON J. GAMRA V CHANCHANGI AIRLINES (NIG) LTD  Captain Solomon J. Gamra                                                            Chanchangi Airlines (Nig) Ltd                                               HON. JUSTICE O.A. OBASEKI-OSAGHAE  view judgment  0000-00-00
30   31   NICN/CA/75/2013                                                                                        MR. MATTHEW EBONG UDO V     MR. MATTHEW EBONG UDO                                                     NATIONAL EXAMINATIONS COUNCIL (NECO)                                                 HON.JUSTICE O.A OBASEKI-OSAGHAE  view judgment  0000-00-00
--------------------------------------------------------------------------------

CodePudding user response:

You are getting that error because your selector does not match anything on the page you are trying to scrape.

The website you are trying to scrape contains 20 pages; you can just edit the link you are using for each request.

This goes to all the pages on the website and collects all the links from the tables.

from bs4 import BeautifulSoup
import requests

for x in range(1,21):

    link='https://nicn.gov.ng/judgement?page={}'.format(x)
    web_info=requests.get(link).text
    soup=BeautifulSoup(web_info,'lxml')

    #finding the table body on the page
    table=soup.find('tbody')

    #collecting all the rows
    rows=table.find_all('tr')

    #now you can examine each row for links

    for row in rows:
        link=row.find('a').attrs['href']
        print(link)

CodePudding user response:

I don't think that you even need bs4 here and even requests but the reason why i used it is to persist the same session.

import requests
import pandas as pd

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:104.0) Gecko/20100101 Firefox/104.0'
}


def main(url):
    with requests.Session() as req:
        req.headers.update(headers)
        params = {
            'page': 1
        }
        allin = []
        while params['page']:
            r = req.get(url, params=params)
            df = pd.read_html(r.content, attrs={'id': 'mytable'})[0]
            if 'next' in r.text:
                params['page']  = 1
                allin.append(df)
                continue
            params['page'] = False
        final = pd.concat(allin, ignore_index=True)
        print(final)


if __name__ == "__main__":
    main('https://nicn.gov.ng/judgement')

Output:

      S/N                     Suit No  ...       Judgment        Date
0       1            NICN/ABJ/67/2021  ...  view judgment  2021-10-14
1       2            NICN/ABJ/62/2021  ...  view judgment  2021-10-07
2       3          NICN/ABJ/304M/2020  ...  view judgment  2021-10-05
3       4           NICN/ABJ/240/2018  ...  view judgment  2021-07-28
4       5  SUIT NO. NICN/ABJ/185/2018  ...  view judgment  2021-07-14
...   ...                         ...  ...            ...         ...
1895   96           NICN/ABJ/273/2014  ...  view judgment  1970-01-01
1896   97           NICN/ABJ/302/2012  ...  view judgment  1970-01-01
1897   98           NICN/ABJ/340/2013  ...  view judgment  1970-01-01
1898   99           NICN/ABJ/202/2013  ...  view judgment  1970-01-01
1899  100           NICN/ABJ/246/2013  ...  view judgment  1970-01-01

[1900 rows x 8 columns]
  • Related