I am using both selenium and BeautifulSoup in order to do some web scraping. I have managed myself to obtain the next piece of code:
from selenium.webdriver import Chrome
from bs4 import BeautifulSoup
url = 'https://www.renfe.com/content/renfe/es/es/cercanias/cercanias-valencia/lineas/jcr:content/root/responsivegrid/rftabdetailline/item_1591014181985.html'
driver.get(url)
soup = BeautifulSoup(driver.page_source, 'lxml')
The output soup produces has the following structure:
<html>
<head>
</head>
<body>
<rf-list-detail line-color="245,150,40" line-number="C2" line-text="Línea C2"
list="[{... ;direction":"Place1"}
,... ,
;direction":"Place2"}...
Recall both text and output style have been modified for reading reasons. I attach an image of the actual output just in case it is more convinient.
Does anyone know how could I obtain every PlaceN (in the image, Moixent would be Place1) in a list? Something like
places = [Place1,...,PlaceN]
I have tried parsing it, but as it has no tags (or at least my html knowledge, which is barely none, says so) I obtain nothing. I have also tried using a regular expression, which I have just found out where a thing, but I am not sure how to do it properly.
Any thoughts?
Thank you in advance!!
CodePudding user response:
This site responds with non-html structure. So, you need no html-parser like BeautifulSoup or lxml for this task.
Here example using requests library. You can install it like this
pip install requests
import requests
import html
import json
url = 'https://www.renfe.com/content/renfe/es/es/cercanias/cercanias-valencia/lineas/jcr:content/root/responsivegrid/rftabdetailline/item_1591014181985.html'
response = requests.get(url)
data = response.text # get data from site
raw_list = data.split("'")[1] # extract rf-list-detail.list attribute
json_list = html.unescape(raw_list) # decode html symbols
parsed_list = json.loads(json_list) # parse json
print(parsed_list) # printing result
directions = []
for item in parsed_list:
directions.append(item["direction"])
print(directions) # extracting directions
# ['Moixent', 'Vallada', 'Montesa', "L'Alcudia de Crespins", 'Xàtiva', "L'Enova-Manuel", 'La Pobla Llarga', 'Carcaixent', 'Alzira', 'Algemesí', 'Benifaió-Almussafes', 'Silla', 'Catarroja', 'Massanassa', 'Alfafar-Benetússer', 'València Nord']