Home > database >  How to get final links from find_all('a') as a list?
How to get final links from find_all('a') as a list?

Time:10-23

import requests
import re
from bs4 import BeautifulSoup
respond = requests.get("http://www.kulugyminiszterium.hu/dtwebe/Irodak.aspx")
print(respond)

soup = BeautifulSoup(respond.text, 'html.parser')

for link in soup.find_all('a'):
    links = link.get('href')

    linki_bloc = ('http://www.kulugyminiszterium.hu/dtwebe/' links).replace(' ', ' ' )
    print(linki_bloc)

value = linki_bloc
print(value.split())

I am trying to use the results of find_all('a') as a list. The only thing that succeeds for me is the last link.

It seems to me that the problem is the results as a list of links deselected \n. I tried many ways to get rid of the new line character but failed. Saving to a file (e.g. .txt) also fails, saving only the last link.

CodePudding user response:

Close to your goal, but you overwrite the result wit each iteration - Simply append your manipulated links to a list with list comprehension directly:

['http://www.kulugyminiszterium.hu/dtwebe/' link.get('href').replace(' ', ' ' ) for link in soup.find_all('a')]

or as in your example:

links = []
for link in soup.find_all('a'):
    links.append('http://www.kulugyminiszterium.hu/dtwebe/' link.get('href').replace(' ', ' ' ))

Example

import requests
from bs4 import BeautifulSoup
respond = requests.get("http://www.kulugyminiszterium.hu/dtwebe/Irodak.aspx")

soup = BeautifulSoup(respond.text, 'html.parser')

links = []
for link in soup.find_all('a'):
    links.append('http://www.kulugyminiszterium.hu/dtwebe/' link.get('href').replace(' ', ' ' ))

links

CodePudding user response:

Assuming you're just trying to get a list of HREFS then:

import requests
from bs4 import BeautifulSoup as BS
from urllib.parse import urljoin

BASE = 'http://www.kulugyminiszterium.hu/dtwebe/'

(r := requests.get(urljoin(BASE, 'Irodak.aspx'))).raise_for_status()

soup = BS(r.text, 'lxml')
hrefs = []

for a in soup.find_all('a'):
    hrefs.append(urljoin(BASE, a['href']).replace(' ', ' '))
    
print(*hrefs, sep='\n')

(partial) Output:

http://www.kulugyminiszterium.hu/dtwebe/reszletes.aspx?Orszag=Barbados
http://www.kulugyminiszterium.hu/dtwebe/reszletes.aspx?Orszag=Bolivarian Republic of Venezuela
http://www.kulugyminiszterium.hu/dtwebe/reszletes.aspx?Orszag=Bosnia and Herzegovina
http://www.kulugyminiszterium.hu/dtwebe/reszletes.aspx?Orszag=Canada
http://www.kulugyminiszterium.hu/dtwebe/reszletes.aspx?Orszag=Commonwealth of Australia
  • Related