Home > database >  Scraping words from online dictionary : while/loop issue
Scraping words from online dictionary : while/loop issue

Time:11-18

I'm facing an issue while trying to scrape all the words from an online dictionary in order to later get their definitions. I'm scraping with BeautifulSoup and I think there is an issue in my while and for loop.

As you can see in my code below, I have two variables in my url to scrape, one for the letters of the alphabet and the second one for the numbers of pages in order to get all the words from one letter.

def get_data():
    page = 1
    letters = ['A', 'B', 'C']
    all_words = []

    for letter in letters:
        while page != 100:
            url = f"https://dictionnaire.lerobert.com/explore/def/{letter}/{page}"
            soup = BeautifulSoup(requests.get(url=url).text, 'html.parser')
            data = soup.find(class_='l-l')
            for word in data.find_all('a'):
                all_words.append(word['href'])
            page = page   1

    print(all_words)
    print(len(all_words))

With this code it only takes the letter A in consideration. So I tried to put the while before the for loop and I do have a mix of A, B and C words but with only a few dozens of words more, so the count doesn't match up at all.

Do you guys have an idea on this ? I'm surely missing something in the while and for loop operations but I don't know what (I'm a bit new to coding to be honest)

Thanks a lot, Btv-

CodePudding user response:

You are not resetting page to 1.

After getting the words of letter A the value of page will be 100. In the next iteration when the letter is B the page is still 100 and so the while loop will not be executed.

for letter in letters:
        page = 1   # Resetting the page to 1
        while page != 100:
            url = f"https://dictionnaire.lerobert.com/explore/def/{letter}/{page}"
            soup = BeautifulSoup(requests.get(url=url).text, 'html.parser')
            data = soup.find(class_='l-l')
            for word in data.find_all('a'):
                all_words.append(word['href'])
            page = page   1
  • Related