Home > database >  Extractinf info form HTML that has no tags
Extractinf info form HTML that has no tags

Time:12-26

I am using both selenium and BeautifulSoup in order to do some web scraping. I have managed myself to obtain the next piece of code:

from selenium.webdriver import Chrome 
from bs4 import BeautifulSoup

url = 'https://www.renfe.com/content/renfe/es/es/cercanias/cercanias-valencia/lineas/jcr:content/root/responsivegrid/rftabdetailline/item_1591014181985.html'
driver.get(url)
soup = BeautifulSoup(driver.page_source, 'lxml')

The output soup produces has the following structure:

<html>
<head>
</head>
<body>
<rf-list-detail line-color="245,150,40" line-number="C2" line-text="Línea C2" 
list="[{... ;direction&quot;:&quot;Place1&quot;}
,... , 
;direction&quot;:&quot;Place2&quot;}...

Recall both text and output style have been modified for reading reasons. I attach an image of the actual output just in case it is more convinient.

Does anyone know how could I obtain every PlaceN (in the image, Moixent would be Place1) in a list? Something like

places = [Place1,...,PlaceN]

I have tried parsing it, but as it has no tags (or at least my html knowledge, which is barely none, says so) I obtain nothing. I have also tried using a regular expression, which I have just found out where a thing, but I am not sure how to do it properly.

Any thoughts?

Thank you in advance!!

output of soup

CodePudding user response:

This site responds with non-html structure. So, you need no html-parser like BeautifulSoup or lxml for this task.

Here example using requests library. You can install it like this

pip install requests
import requests
import html
import json

url = 'https://www.renfe.com/content/renfe/es/es/cercanias/cercanias-valencia/lineas/jcr:content/root/responsivegrid/rftabdetailline/item_1591014181985.html'
response = requests.get(url)
data = response.text  # get data from site

raw_list = data.split("'")[1]  # extract rf-list-detail.list attribute
json_list = html.unescape(raw_list)  # decode html symbols
parsed_list = json.loads(json_list)  # parse json 

print(parsed_list)  # printing result

directions = []
for item in parsed_list:
    directions.append(item["direction"])
print(directions)  # extracting directions

# ['Moixent', 'Vallada', 'Montesa', "L'Alcudia de Crespins", 'Xàtiva', "L'Enova-Manuel", 'La Pobla Llarga', 'Carcaixent', 'Alzira', 'Algemesí', 'Benifaió-Almussafes', 'Silla', 'Catarroja', 'Massanassa', 'Alfafar-Benetússer', 'València Nord']


  • Related