Home > OS >  beautifulsoup: how to scrape multiple urls that end differently
beautifulsoup: how to scrape multiple urls that end differently

Time:01-04

I want to scrape this dictionary for it's different verbs. the verbs appear in this 'https://www.spanishdict.com/conjugate/' plus the verb . so,e.g : for verb 'hacer' we will have: https://www.spanishdict.com/conjugate/hacer

I would like to scrape all possible links that contain the conjugation of each verb, and return them as a list of strings. so I did the following:

import requests
from bs4 import BeautifulSoup
url = 'https://www.spanishdict.com/conjugate/' 

for i in url:
    reqs = requests.get(url   str())
    soup = BeautifulSoup(reqs.text, 'html.parser')

    urls = []
    for link in soup.find_all('a'):
        urls.append(link.get('href'))

    print(urls)

but i only get a few empty list as when I print urls.

expected output sample:

['https://www.spanishdict.com/conjugate/hacer', 'https://www.spanishdict.com/conjugate/tener',...etc]

CodePudding user response:

You are iterating through a string when you loop through `url'. Look at this code:

url = 'https://www.spanishdict.com/conjugate/' 

for i in url:
    print(i)

This produces every letter of the URL:

h
t
t
p
s
:
/
/
w
w
w
<truncated>

You are also doing something wrong here:

reqs = requests.get(url   str())

I am not sure what you are trying to do but 'url str()' is just the URL plus an empty string, which is the URL.

If you remove the for loop and unnecessary empty string, you get what I think you are trying to get:

import requests
from bs4 import BeautifulSoup
url = 'https://www.spanishdict.com/conjugate/' 

reqs = requests.get(url   str())
soup = BeautifulSoup(reqs.text, 'html.parser')
urls = []
for link in soup.find_all('a'):
    urls.append(link.get('href'))

print(urls)

This produces:

['/', '/learn', '/translation', '/conjugation', '/vocabulary', '#', '/translation', '/conjugation', '/vocabulary', '/guide', '/pronunciation', '/wordoftheday', '/learn', '/guide/spanish-present-tense-forms', '/guide/spanish-present-progressive-forms', '/guide/spanish-preterite-tense-forms', '/guide/spanish-imperfect-tense-forms', '/guide/simple-future-regular-forms-and-tenses', '/guide/spanish-present-subjunctive', '/guide/commands', '/guide/spanish-imperfect-subjunctive', '/guide', '/drill?drill_start_source=conjugation hubpage', 'https://play.google.com/store/apps/details?id=com.spanishdict.spanishdict&referrer=utm_campaign=adhesion', '/wordoftheday', '/translate/patinar', '/', 'https://www.ingles.com/verbos', 'https://www.curiositymedia.com/', 'https://help.spanishdict.com/', '/company/privacy', '/company/tos', '/sitemap', '/', 'https://www.ingles.com/verbos', '/translation', '/conjugation', '/vocabulary', '/learn', '/guide', '/wordoftheday', 'https://www.curiositymedia.com/', '/company/privacy', '/company/tos', '/sitemap', 'https://help.spanishdict.com/', 'https://help.spanishdict.com/contact', 'https://www.facebook.com/pages/SpanishDict/92805940179', 'https://twitter.com/spanishdict', 'https://www.instagram.com/spanishdict/', 'https://itunes.apple.com/us/app/spanishdict/id332510494', 'https://play.google.com/store/apps/details?id=com.spanishdict.spanishdict&referrer=utm_source=sd-footer']

Is this list of links what you were aiming for?

CodePudding user response:

Getting your expected output you should have a list of verbs. While there is no source provided in your question a good start to generate such these information I used the list verbs-top-500 and a list comprehension.

For all <a> that contains translate in its href it concat your url and the verb that is text in the direct child <div>of <a>:

['https://www.spanishdict.com/conjugate/' a.div.text for a in soup.select('a[href*="translate"]')]

Example

import requests,json
from bs4 import BeautifulSoup
url='https://www.spanishdict.com/lists/1690101/verbs-top-500'
headers ={
    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36'
}
r =requests.get(url,headers=headers)
soup=BeautifulSoup(r.text, 'lxml')

urls = ['https://www.spanishdict.com/conjugate/' a.div.text for a in soup.select('a[href*="translate/"]')]

Output

['https://www.spanishdict.com/conjugate/procurar', 'https://www.spanishdict.com/conjugate/podar', 'https://www.spanishdict.com/conjugate/pillar', 'https://www.spanishdict.com/conjugate/perrear', 'https://www.spanishdict.com/conjugate/perfeccionar', 'https://www.spanishdict.com/conjugate/perdonar', 'https://www.spanishdict.com/conjugate/pegar', 'https://www.spanishdict.com/conjugate/pasear', 'https://www.spanishdict.com/conjugate/ordenar', 'https://www.spanishdict.com/conjugate/ondear', 'https://www.spanishdict.com/conjugate/ojalar', 'https://www.spanishdict.com/conjugate/ocultar', 'https://www.spanishdict.com/conjugate/nombrar',...]
  •  Tags:  
  • Related