I am trying to loop through multiple pages on this website I am scraping with BS.
pg = soup.find('ul', 'pagination')
current_pg = pg.find('li', 'active')
next_url = current_pg.findNextSibling('li').a.get('href')
Any ideas on how to solve the AttributeError: 'NoneType' object has no attribute 'get'?
CodePudding user response:
To get to the next page, try to parse the href=
from <a rel="next">
tag. If the <a>
doesn't exist, exit the loop:
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = "https://nicn.gov.ng/judgement?page=1"
while True:
soup = BeautifulSoup(requests.get(url).content, "html.parser")
# parse the table and print some data:
df = pd.read_html(str(soup))[0]
print(url)
print(df.tail())
print("-" * 80)
next_link = soup.select_one("a[rel=next]")
if not next_link:
break
url = next_link["href"]
Prints all the 20 pages:
...
--------------------------------------------------------------------------------
https://nicn.gov.ng/judgement?page=19
S/N Suit No Case Title Parties Respondents Justice Judgment Date
95 96 NICN/ABJ/273/2014 CHUKUEZI CHINEDU GOODWILL VS SIRAJ NIGERIA LTD CHUKUEZI CHINEDU GOODWILL SIRAJ NIGERIA LTD HON. JUSTICE E.D. E ISELE view judgment 1970-01-01
96 97 NICN/ABJ/302/2012 BASIL ONYEBUCHI OKORO VS NIGERIA NATIONAL PETROLEUM CORPORATION BASIL ONYEBUCHI OKORO NIGERIA NATIONAL PETROLEUM CORPORATION HON. JUSTICE E.D. E ISELE view judgment 1970-01-01
97 98 NICN/ABJ/340/2013 CHINEKWU NNENNA UDOKWU VS ZENITH BANK PLC CHINEKWU NNENNA UDOKWU ZENITH BANK PLC HON. JUSTICE E.D. E ISELE view judgment 1970-01-01
98 99 NICN/ABJ/202/2013 IBRAHIM MUSLIM AYOADE VS NIGERIA BOTTLING COMPANY LTD IBRAHIM MUSLIM AYOADE NIGERIA BOTTLING COMPANY LTD HON. JUSTICE E.D. E ISELE view judgment 1970-01-01
99 100 NICN/ABJ/246/2013 MR. MAHA ISIAKA ABU VS SKYE BANK (FORMERLY MAINSTREET BANK LIMITED) MR. MAHA ISIAKA ABU SKYE BANK (FORMERLY MAINSTREET BANK LIMITED) HON. JUSTICE E.D. E ISELE view judgment 1970-01-01
--------------------------------------------------------------------------------
https://nicn.gov.ng/judgement?page=20
S/N Suit No Case Title Parties Respondents Justice Judgment Date
26 27 NICN/CA/141/2013 ENGR. PATRICK EDET OQUA VS ATTORNEY-GENERAL, CROSS RIVER STATE ENGR. PATRICK EDET OQUA ATTORNEY-GENERAL, CROSS RIVER STATE HONOURABLE JUSTICE E. N. AGBAKOBA view judgment 1970-01-01
27 28 NICN/LA/243/2013 Emmanuel Fagbamila V University of Lagos Emmanuel Fagbamila University of Lagos Hon. Justice P.O Lifu (JP) view judgment 0000-00-00
28 29 NIC/LA/03/2011 Sunday Olufelo VERSUS Schlumberger Anadrill Nigeria Ltd. . Schlumberger Support Nigeria Ltd.Schlumberger Ltd. Sunday Olufelo Schlumberger Anadrill Nigeria Ltd. . Schlumberger Support Nigeria Ltd.Schlumberger Ltd. Hon. Justice B. B. Kanyip - Presiding Judge Hon. Justice O. A. Obaseki-Osaghae view judgment 0000-00-00
29 30 NICN/LA/291/2012 CAPTAIN SOLOMON J. GAMRA V CHANCHANGI AIRLINES (NIG) LTD Captain Solomon J. Gamra Chanchangi Airlines (Nig) Ltd HON. JUSTICE O.A. OBASEKI-OSAGHAE view judgment 0000-00-00
30 31 NICN/CA/75/2013 MR. MATTHEW EBONG UDO V MR. MATTHEW EBONG UDO NATIONAL EXAMINATIONS COUNCIL (NECO) HON.JUSTICE O.A OBASEKI-OSAGHAE view judgment 0000-00-00
--------------------------------------------------------------------------------
CodePudding user response:
You are getting that error because your selector does not match anything on the page you are trying to scrape.
The website you are trying to scrape contains 20 pages; you can just edit the link you are using for each request.
This goes to all the pages on the website and collects all the links from the tables.
from bs4 import BeautifulSoup
import requests
for x in range(1,21):
link='https://nicn.gov.ng/judgement?page={}'.format(x)
web_info=requests.get(link).text
soup=BeautifulSoup(web_info,'lxml')
#finding the table body on the page
table=soup.find('tbody')
#collecting all the rows
rows=table.find_all('tr')
#now you can examine each row for links
for row in rows:
link=row.find('a').attrs['href']
print(link)
CodePudding user response:
I don't think that you even need bs4
here and even requests
but the reason why i used it is to persist the same session.
import requests
import pandas as pd
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:104.0) Gecko/20100101 Firefox/104.0'
}
def main(url):
with requests.Session() as req:
req.headers.update(headers)
params = {
'page': 1
}
allin = []
while params['page']:
r = req.get(url, params=params)
df = pd.read_html(r.content, attrs={'id': 'mytable'})[0]
if 'next' in r.text:
params['page'] = 1
allin.append(df)
continue
params['page'] = False
final = pd.concat(allin, ignore_index=True)
print(final)
if __name__ == "__main__":
main('https://nicn.gov.ng/judgement')
Output:
S/N Suit No ... Judgment Date
0 1 NICN/ABJ/67/2021 ... view judgment 2021-10-14
1 2 NICN/ABJ/62/2021 ... view judgment 2021-10-07
2 3 NICN/ABJ/304M/2020 ... view judgment 2021-10-05
3 4 NICN/ABJ/240/2018 ... view judgment 2021-07-28
4 5 SUIT NO. NICN/ABJ/185/2018 ... view judgment 2021-07-14
... ... ... ... ... ...
1895 96 NICN/ABJ/273/2014 ... view judgment 1970-01-01
1896 97 NICN/ABJ/302/2012 ... view judgment 1970-01-01
1897 98 NICN/ABJ/340/2013 ... view judgment 1970-01-01
1898 99 NICN/ABJ/202/2013 ... view judgment 1970-01-01
1899 100 NICN/ABJ/246/2013 ... view judgment 1970-01-01
[1900 rows x 8 columns]