Home > other >  Scraping issue with id_tag
Scraping issue with id_tag

Time:10-03

I'm trying to extract data from a website with BeautifulSoup. I'm actually stuck with this :
"Trad. de l'anglais par < a href="/searchinternet/advanced?all_authors_id=35534&SearchAction=1">Camille Fabien < /a>"
I want to get the names of translaters but the tag uses their id.

my code is

translater = soup.find_all("a", href="/searchinternet/advanced?all_authors_id=")

I tried with a str.startswith but it doesn't work. Can someone help me plz?

CodePudding user response:

Providing your HTML is correct, static (doesn't get loaded with javascript after initial page load), this is one way to select that/those links:

from bs4 import BeautifulSoup as bs

html = '''<p>Trad. de l'anglais par <a href="/searchinternet/advanced?all_authors_id=35534&SearchAction=1">Camille Fabien </a></p>'''

soup = bs(html, 'html.parser')
a = soup.select('a[href^="/searchinternet/advanced?all_authors_id="]')
print(a[0])
print(a[0].get_text(strip=True))
print(a[0].get('href'))

Result in terminal:

<a href="/searchinternet/advanced?all_authors_id=35534&amp;SearchAction=1">Camille Fabien </a>
Camille Fabien
/searchinternet/advanced?all_authors_id=35534&SearchAction=1

EDIT: Who doesn't like a challenge?... Based on further comments made by OP, here is a way of obtaining titles, authors, translators and illustrator from that page - considering there can be one, or more translators/one or more illustrators:

from bs4 import BeautifulSoup as bs
import requests
import pandas as pd

headers = {
    'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36"
    }

url = 'https://www.gallimard.fr/searchinternet/advanced/(editor_brand_id)/1/(fserie)/FOLIO-JUNIOR LIVRE HEROS::Folio Junior - Un Livre dont Vous êtes le Héros @ DEFIS FANTASTIQ::Série Défis Fantastiques/(limit)/3?date[from]=1980-01-01&date[to]=1995-01-01&SearchAction=OK'

big_list = []
r = requests.get(url, headers=headers)
soup = bs(r.text, 'html.parser')
items = soup.select('div[] > table div[]')
print()
for i in items:
    title = i.select_one('div[] h3')
    author = i.select_one('div[] a')
    history = i.select_one('p[]')
    translators = [[y.get_text() for y in x.find_previous_siblings('a')] for x in history.contents  if "Illustrations" in x]
    illustrators = [[y.get_text() for y in x.find_next_siblings('a')] for x in history.contents  if "Illustrations" in x]
    big_list.append((title.text.strip(), author.text.strip(), ', '.join([x for y in translators for x in y]), ', '.join([x for y in illustrators for x in y])))
df = pd.DataFrame(big_list, columns = ['Title', 'Author', 'Translator(s)', 'Illustrator(s)'])
print(df)

Result in terminal:

Title Author Translator(s) Illustrator(s)
0 Le Sépulcre des Ombres Jonathan Green Noël Chassériau Alan Langford
1 La Légende de Zagor Ian Livingstone Pascale Houssin Martin McKenna
2 Les Mages de Solani Keith Martin Noël Chassériau Russ Nicholson
3 Le Siège de Sardath Keith P. Phillips Yannick Surcouf Pete Knifton
4 Retour à la Montagne de Feu Ian Livingstone Yannick Surcouf Martin McKenna
5 Les Mondes de l'Aleph Peter Darvill-Evans Yannick Surcouf Tony Hough
6 Les Mercenaires du Levant Paul Mason Mona de Pracontal Terry Oakes
7 L'Arpenteur de la Lune Stephen Hand Pierre de Laubier Martin McKenna, Terry Oakes
8 La Tour de la Destruction Keith Martin Mona de Pracontal Pete Knifton
9 La Légende des Guerriers Fantômes Stephen Hand Alexis Galmot Martin McKenna
10 Le Repaire des Morts-Vivants Dave Morris Nicolas Grenier David Gallagher
11 L'Ancienne Prophétie Paul Mason Mona de Pracontal Terry Oakes
12 La Vengeance des Démons Jim Bambra Mona de Pracontal Martin McKenna
13 Le Sceptre Noir Keith Martin Camille Fabien David Gallagher
14 La Nuit des Mutants Peter Darvill-Evans Anne Collas Alan Langford
15 L'Élu des Six Clans Luke Sharp Noël Chassériau Martin Mac Kenna, Martin McKenna
16 Le Volcan de Zamarra Luke Sharp Olivier Meyer David Gallagher
17 Les Sombres Cohortes Ian Livingstone Noël Chassériau Nik William
18 Le Vampire du Château Noir Keith Martin Mona de Pracontal Martin McKenna
19 Le Voleur d'Âmes Keith Martin Mona de Pracontal Russ Nicholson
20 Le Justicier de l'Univers Martin Allen Mona de Pracontal Tim Sell
21 Les Esclaves de l'Eternité Paul Mason Sylvie Bonnet Bob Harvey
22 La Créature venue du Chaos Steve Jackson Noël Chassériau Alan Langford
23 Les Rôdeurs de la Nuit Graeme Davis Nicolas Grenier John Sibbick
24 L'Empire des Hommes-Lézards Marc Gascoigne Jean Lacroix David Gallagher
25 Les Gouffres de la Cruauté Luke Sharp Sylvie Bonnet Russ Nicholson
26 Les Spectres de l'Angoisse Robin Waterfield Mona de Pracontal Ian Miller
27 Le Chasseur des Étoiles Luke Sharp Arnaud Dupin de Beyssat Cary Mayes, Gary Mayes
28 Les Sceaux de la Destruction Robin Waterfield Sylvie Bonnet Russ Nicholson
29 La Crypte du Sorcier Ian Livingstone Noël Chassériau John Sibbick
30 La Forteresse du Cauchemar Peter Darvill-Evans Mona de Pracontal Dave Carson
31 La Grande Menace des Robots Steve Jackson Danielle Plociennik Gary Mayes
32 L'Épée du Samouraï Mark Smith Pascale Jusforgues Alan Langford
33 L'Épreuve des Champions Ian Livingstone Alain Vaulont, Pascale Jusforgues Brian Williams
34 Défis Sanglants sur l'Océan Andrew Chapman Jean Walter Bob Harvey
35 Les Démons des Profondeurs Steve Jackson Noël Chassériau Bob Harvey
36 Rendez-vous avec la M.O.R.T. Steve Jackson Arnaud Dupin de Beyssat Declan Considine
37 La Planète Rebelle Robin Waterfield C. Degolf Gary Mayes
38 Les Trafiquants de Kelter Andrew Chapman Anne Blanchet Nik Spender
39 Le Combattant de l'Autoroute Ian Livingstone Alain Vaulont, Pascale Jusforgues Kevin Bulmer
40 Le Mercenaire de l'Espace Andrew Chapman Jean Walthers Geoffroy Senior
41 Le Temple de la Terreur Ian Livingstone Denise May Bill Houston
42 Le Manoir de l'Enfer Steve Jackson
43 Le Marais aux Scorpions Steve Jackson Camille Fabien Duncan Smith
44 Le Talisman de la Mort Steve Jackson Camille Fabien Bob Harvey
45 La Sorcière des Neiges Ian Livingstone Michel Zénon Edward Crosby, Gary Ward
46 La Citadelle du Chaos Steve Jackson Marie-Raymond Farré Russ Nicholson
47 La Galaxie Tragique Steve Jackson Camille Fabien Peter Jones
48 La Forêt de la Malédiction Ian Livingstone Camille Fabien Malcolm Barter
49 La Cité des Voleurs Ian Livingstone Henri Robillot Iain McCaig
50 Le Labyrinthe de la Mort Ian Livingstone Patricia Marais Iain McCaig
51 L'Île du Roi Lézard Ian Livingstone Fabienne Vimereu Alan Langford
52 Le Sorcier de la Montagne de Feu Steve Jackson Camille Fabien Russ Nicholson

Bear in mind this method fails for Le Manoir de l'Enfer, because word 'Illustrations' is not found in text. It's down to the OP to find a solution for that one.

BeautifulSoup documentation can be found at https://beautiful-soup-4.readthedocs.io/en/latest/index.html

Also, Pandas docs can be found here: https://pandas.pydata.org/pandas-docs/stable/index.html

CodePudding user response:

from bs4 import BeautifulSoup

soup = BeautifulSoup(open("./test.html", "r"),'html.parser') #returns a list

names = []

for elem in soup:
    names.append(elem.text)
  • Related