I'm trying to extract data from a website with BeautifulSoup.
I'm actually stuck with this :
"Trad. de l'anglais par < a href="/searchinternet/advanced?all_authors_id=35534&SearchAction=1">Camille Fabien < /a>"
I want to get the names of translaters but the tag uses their id.
my code is
translater = soup.find_all("a", href="/searchinternet/advanced?all_authors_id=")
I tried with a str.startswith but it doesn't work. Can someone help me plz?
CodePudding user response:
Providing your HTML is correct, static (doesn't get loaded with javascript after initial page load), this is one way to select that/those links:
from bs4 import BeautifulSoup as bs
html = '''<p>Trad. de l'anglais par <a href="/searchinternet/advanced?all_authors_id=35534&SearchAction=1">Camille Fabien </a></p>'''
soup = bs(html, 'html.parser')
a = soup.select('a[href^="/searchinternet/advanced?all_authors_id="]')
print(a[0])
print(a[0].get_text(strip=True))
print(a[0].get('href'))
Result in terminal:
<a href="/searchinternet/advanced?all_authors_id=35534&SearchAction=1">Camille Fabien </a>
Camille Fabien
/searchinternet/advanced?all_authors_id=35534&SearchAction=1
EDIT: Who doesn't like a challenge?... Based on further comments made by OP, here is a way of obtaining titles, authors, translators and illustrator from that page - considering there can be one, or more translators/one or more illustrators:
from bs4 import BeautifulSoup as bs
import requests
import pandas as pd
headers = {
'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36"
}
url = 'https://www.gallimard.fr/searchinternet/advanced/(editor_brand_id)/1/(fserie)/FOLIO-JUNIOR LIVRE HEROS::Folio Junior - Un Livre dont Vous êtes le Héros @ DEFIS FANTASTIQ::Série Défis Fantastiques/(limit)/3?date[from]=1980-01-01&date[to]=1995-01-01&SearchAction=OK'
big_list = []
r = requests.get(url, headers=headers)
soup = bs(r.text, 'html.parser')
items = soup.select('div[] > table div[]')
print()
for i in items:
title = i.select_one('div[] h3')
author = i.select_one('div[] a')
history = i.select_one('p[]')
translators = [[y.get_text() for y in x.find_previous_siblings('a')] for x in history.contents if "Illustrations" in x]
illustrators = [[y.get_text() for y in x.find_next_siblings('a')] for x in history.contents if "Illustrations" in x]
big_list.append((title.text.strip(), author.text.strip(), ', '.join([x for y in translators for x in y]), ', '.join([x for y in illustrators for x in y])))
df = pd.DataFrame(big_list, columns = ['Title', 'Author', 'Translator(s)', 'Illustrator(s)'])
print(df)
Result in terminal:
Title | Author | Translator(s) | Illustrator(s) | |
---|---|---|---|---|
0 | Le Sépulcre des Ombres | Jonathan Green | Noël Chassériau | Alan Langford |
1 | La Légende de Zagor | Ian Livingstone | Pascale Houssin | Martin McKenna |
2 | Les Mages de Solani | Keith Martin | Noël Chassériau | Russ Nicholson |
3 | Le Siège de Sardath | Keith P. Phillips | Yannick Surcouf | Pete Knifton |
4 | Retour à la Montagne de Feu | Ian Livingstone | Yannick Surcouf | Martin McKenna |
5 | Les Mondes de l'Aleph | Peter Darvill-Evans | Yannick Surcouf | Tony Hough |
6 | Les Mercenaires du Levant | Paul Mason | Mona de Pracontal | Terry Oakes |
7 | L'Arpenteur de la Lune | Stephen Hand | Pierre de Laubier | Martin McKenna, Terry Oakes |
8 | La Tour de la Destruction | Keith Martin | Mona de Pracontal | Pete Knifton |
9 | La Légende des Guerriers Fantômes | Stephen Hand | Alexis Galmot | Martin McKenna |
10 | Le Repaire des Morts-Vivants | Dave Morris | Nicolas Grenier | David Gallagher |
11 | L'Ancienne Prophétie | Paul Mason | Mona de Pracontal | Terry Oakes |
12 | La Vengeance des Démons | Jim Bambra | Mona de Pracontal | Martin McKenna |
13 | Le Sceptre Noir | Keith Martin | Camille Fabien | David Gallagher |
14 | La Nuit des Mutants | Peter Darvill-Evans | Anne Collas | Alan Langford |
15 | L'Élu des Six Clans | Luke Sharp | Noël Chassériau | Martin Mac Kenna, Martin McKenna |
16 | Le Volcan de Zamarra | Luke Sharp | Olivier Meyer | David Gallagher |
17 | Les Sombres Cohortes | Ian Livingstone | Noël Chassériau | Nik William |
18 | Le Vampire du Château Noir | Keith Martin | Mona de Pracontal | Martin McKenna |
19 | Le Voleur d'Âmes | Keith Martin | Mona de Pracontal | Russ Nicholson |
20 | Le Justicier de l'Univers | Martin Allen | Mona de Pracontal | Tim Sell |
21 | Les Esclaves de l'Eternité | Paul Mason | Sylvie Bonnet | Bob Harvey |
22 | La Créature venue du Chaos | Steve Jackson | Noël Chassériau | Alan Langford |
23 | Les Rôdeurs de la Nuit | Graeme Davis | Nicolas Grenier | John Sibbick |
24 | L'Empire des Hommes-Lézards | Marc Gascoigne | Jean Lacroix | David Gallagher |
25 | Les Gouffres de la Cruauté | Luke Sharp | Sylvie Bonnet | Russ Nicholson |
26 | Les Spectres de l'Angoisse | Robin Waterfield | Mona de Pracontal | Ian Miller |
27 | Le Chasseur des Étoiles | Luke Sharp | Arnaud Dupin de Beyssat | Cary Mayes, Gary Mayes |
28 | Les Sceaux de la Destruction | Robin Waterfield | Sylvie Bonnet | Russ Nicholson |
29 | La Crypte du Sorcier | Ian Livingstone | Noël Chassériau | John Sibbick |
30 | La Forteresse du Cauchemar | Peter Darvill-Evans | Mona de Pracontal | Dave Carson |
31 | La Grande Menace des Robots | Steve Jackson | Danielle Plociennik | Gary Mayes |
32 | L'Épée du Samouraï | Mark Smith | Pascale Jusforgues | Alan Langford |
33 | L'Épreuve des Champions | Ian Livingstone | Alain Vaulont, Pascale Jusforgues | Brian Williams |
34 | Défis Sanglants sur l'Océan | Andrew Chapman | Jean Walter | Bob Harvey |
35 | Les Démons des Profondeurs | Steve Jackson | Noël Chassériau | Bob Harvey |
36 | Rendez-vous avec la M.O.R.T. | Steve Jackson | Arnaud Dupin de Beyssat | Declan Considine |
37 | La Planète Rebelle | Robin Waterfield | C. Degolf | Gary Mayes |
38 | Les Trafiquants de Kelter | Andrew Chapman | Anne Blanchet | Nik Spender |
39 | Le Combattant de l'Autoroute | Ian Livingstone | Alain Vaulont, Pascale Jusforgues | Kevin Bulmer |
40 | Le Mercenaire de l'Espace | Andrew Chapman | Jean Walthers | Geoffroy Senior |
41 | Le Temple de la Terreur | Ian Livingstone | Denise May | Bill Houston |
42 | Le Manoir de l'Enfer | Steve Jackson | ||
43 | Le Marais aux Scorpions | Steve Jackson | Camille Fabien | Duncan Smith |
44 | Le Talisman de la Mort | Steve Jackson | Camille Fabien | Bob Harvey |
45 | La Sorcière des Neiges | Ian Livingstone | Michel Zénon | Edward Crosby, Gary Ward |
46 | La Citadelle du Chaos | Steve Jackson | Marie-Raymond Farré | Russ Nicholson |
47 | La Galaxie Tragique | Steve Jackson | Camille Fabien | Peter Jones |
48 | La Forêt de la Malédiction | Ian Livingstone | Camille Fabien | Malcolm Barter |
49 | La Cité des Voleurs | Ian Livingstone | Henri Robillot | Iain McCaig |
50 | Le Labyrinthe de la Mort | Ian Livingstone | Patricia Marais | Iain McCaig |
51 | L'Île du Roi Lézard | Ian Livingstone | Fabienne Vimereu | Alan Langford |
52 | Le Sorcier de la Montagne de Feu | Steve Jackson | Camille Fabien | Russ Nicholson |
Bear in mind this method fails for Le Manoir de l'Enfer
, because word 'Illustrations' is not found in text. It's down to the OP to find a solution for that one.
BeautifulSoup documentation can be found at https://beautiful-soup-4.readthedocs.io/en/latest/index.html
Also, Pandas docs can be found here: https://pandas.pydata.org/pandas-docs/stable/index.html
CodePudding user response:
from bs4 import BeautifulSoup
soup = BeautifulSoup(open("./test.html", "r"),'html.parser') #returns a list
names = []
for elem in soup:
names.append(elem.text)