scraping the next page : next page's url staying on the same page-CodePudding

I start from this page https://www.cnrtl.fr/portailindex/LEXI/TLFI/A and want to scrape all the next pages until it has reached the bottom.

For each letter A to Z the next pages'url (as shown in the browser) are https://www.cnrtl.fr/portailindex/LEXI/TLFI/A/<index> where the index increments each time by 80. For instance the first next page is https://www.cnrtl.fr/portailindex/LEXI/TLFI/A/80. First idea was to build the url addresses based on this rule and fetch them with urllib. However, when I implement in python,

res = urllib.request.urlopen(url)
soup = BeautifulSoup(res, "lxml")

it seems that I always stay on the first page https://www.cnrtl.fr/portailindex/LEXI/TLFI/A/.

A second idea is to get the next page from the next page button, an example of next page button is

<a href="/portailindex/LEXI/TLFI/B/480"><img src="/images/portail/right.gif" title="Page suivante" \
           border="0" width="32" height="32" alt="" />

but all I will get is again /portailindex/LEXI/TLFI/B/480 and when calling urllib.request on this, it does not increment to the next page.

So, why does https://www.cnrtl.fr/portailindex/LEXI/TLFI/A/80 in browser work while the urllib.request brings me back to https://www.cnrtl.fr/portailindex/LEXI/TLFI/A/ ?

Any elegant way to go from one page to the next here until it finishes nicely?

CodePudding user response：

Not very sure what's going on, but something like the following worked well for me recently:

Python 3.10.2 on Windows 10. The following code is from the context of a larger program.

from bs4 import BeautifulSoup as Soup
from urllib import request

START = 1
END = 82

BASE_URL = "https://www.cnrtl.fr/portailindex/LEXI/TLFI/A/*"

def pull(url: str) -> Soup:
    my_headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.3'}

    my_request = request.Request(url, headers=my_headers)
    html_text = request.urlopen(my_request).read()

    return Soup(html_text, 'html.parser')

def main():
    for i in range(START, END   1):
        print(f"\nStarting page {i}...")
        url = BASE_URL.replace("*", str(i))

        soup = pull(url)

Could be that you need headers? Source

CodePudding user response：

It seems to do it

import urllib
from bs4 import BeautifulSoup
import requests
import string

dictionary = []

def get_words_in_page( url ):
    res = urllib.request.urlopen(url)
    soup = BeautifulSoup(res, "lxml")
    lst = ""
    for w in soup.findAll("a",{"href":regex}):
        dictionary.append(w.string)
        lst=w.string

base_url = "https://www.cnrtl.fr/portailindex/LEXI/TLFI/"
    
for l in string.ascii_lowercase:    
    base_url = base_url   l.upper()    
    get_words_in_page( base_url )        
    next_index = 0    
    while True:    
        next_index  = 80
        url = base_url "/" str(next_index)        
        try:
            res = urllib.request.urlopen(url)
        except ValueError:
            break    
        get_words_in_page( url )

CodePudding user response：

Just iterate over the letter href and for each use the href of the <a> that holds the arrow for next page to iterate over all sub pages.

In my opinion this would be more generic than the approache that deals with count up the numbers.

Example

from bs4 import BeautifulSoup
import requests

baseUrl = 'https://www.cnrtl.fr'
response = requests.get('https://www.cnrtl.fr/portailindex/LEXI/TLFI/A')
soup = BeautifulSoup(response.content, 'html.parser')

data = []

for url in soup.select('table.letterHeader a'):

    while True:
        response = requests.get(baseUrl url['href'])
        soup = BeautifulSoup(response.content, 'html.parser')

        data.extend([x.text for x in soup.select('table.hometab a')])

        if (a := soup.select_one('a:has(img[title="Page suivante"])')):
            url = a
        else:
            break

        time.sleep(2)

Output

['à', 'à-plat', 'abaissement', 'abas', 'a', 'a-raciste', 'abaisser', 'abasie', 'a b c', 'à-venir', 'abaisseur', 'abasourdir', 'à contre-lumière', 'aalénien', 'abajoue', 'abasourdissant', "à l'envers", 'aaronide', 'abalober', 'abasourdissement', 'à la bonne franquette', 'ab hoc et ab hac', 'abalone', 'abat', 'à muche-pot', 'ab intestat', 'abalourdir', 'abat-chauvée', 'à musse-pot', 'ab irato', 'abalourdissement', 'abat-faim', 'à pic', 'ab ovo', 'abandon', 'abat-feuille', 'à posteriori', 'aba', 'abandonnataire', 'abat-flanc', 'à priori', 'abaca', 'abandonné', 'abat-foin', 'à tire-larigot', 'abaddir', 'abandonnée', 'abat-joue', 'à vau', 'abadie', 'abandonnement', 'abat-jour', 'à vau-de-route', 'abadis', 'abandonnément', 'abat-relui', 'à vau-le-feu', 'abaissable', 'abandonner', 'abat-reluit', 'à-bas', 'abaissant', 'abandonneur', 'abat-son', 'à-compte', 'abaisse', 'abandonneuse', 'abat-vent', 'a-humain', 'abaissé', 'abaque', 'abat-voix', 'a-mi-la', 'abaisse-langue', 'abarticulaire', 'abatage', 'à-pic', 'abaissée', 'abarticulation', 'abâtardi', 'abâtardir', 'abbatial', 'abdominal', 'abécé', 'abâtardissement', 'abbatiale', 'abdominale', 'abécédaire', 'abatée', 'abbatiat', 'abdominien', 'abécédé', 'abatis', 'abbattre', 'abdominienne', 'abéchement', 'abatre', 'abbaye', 'abdomino-coraco-huméral', 'abécher', 'abattable', 'abbé', 'abdomino-coraco-humérale', 'abecquage', 'abattage', 'abbesse', 'abdomino-génital', 'abecquement', 'abattant', 'abbevillien', 'abdomino-génitale', 'abecquer', 'abattée', 'abbevillienne', 'abdomino-guttural', 'abecqueuse', 'abattement', 'abbevillois', 'abdomino-gutturale', 'abée', 'abatteur', 'abbevilloise', 'abdomino-huméral', 'abeillage', 'abatteuse', 'abcéder', 'abdomino-humérale', 'abeille', 'abattis', 'abcès', 'abdomino-périnéal', 'abeillé', 'abattoir', 'abdalas', 'abdomino-scrotal', 'abeiller', 'abattre', 'abdéritain', 'abdomino-thoracique', 'abeillier', 'abattu', 'abdéritaine', 'abdomino-utérotomie', 'abeillon', 'abattue', 'abdicataire', 'abdominoscopie', 'abélien', 'abatture', 'abdication', 'abdominoscopique', 'abéquage', 'abax', 'abdiquer', 'abducteur', 'abéquer', 'abbadie', 'abdomen', 'abduction', 'abéqueuse', 'aber', 'abiétine', 'abjurer', 'aboi', 'aberrance', 'abiétiné', 'ablatif', 'aboiement', 'aberrant', 'abiétinée', 'ablation', 'aboilage', 'aberration', 'abiétique', 'ablativo', 'abolir', 'aberrer', 'abigaïl', 'able', 'abolissable', 'aberrographe', 'abigéat', 'ablégat', 'abolissement', 'aberroscope', 'abigotir', 'ablégation', 'abolitif', 'abessif', 'abîme', 'abléphare', 'abolition', 'abêtifier', 'abîmé', 'ablépharie', 'abolitionnisme', 'abêtir', 'abîmement', 'ablépharoplastique', 'abolitionniste', 'abêtissant', 'abîmer', 'ableret', 'aboma', 'abêtissement', 'abiogenèse', 'ablet', 'abominable', 'abêtissoir', 'abiose', 'ablette', 'abominablement', 'abhorrable', 'abiotique', 'ablier', 'abomination', 'abhorré', 'abject', 'abluant', 'abominer', 'abhorrer', 'abjectement', 'abluante', 'abondamment', 'abicher', 'abjection', 'abluer', 'abondance', 'abies', 'abjurateur', 'ablution', 'abondant', 'abiétacée', 'abjuration', 'ablutionner', 'abonder', 'abiétin', 'abjuratoire', 'abnégation', 'abonnable',...]

CodePudding user response：

click this link to visit the page