I am trying to obtain a list containing different url that appear (partially) when you see the HTML version of this webpage:
https://www.renfe.com/es/es/cercanias/cercanias-valencia/lineas
I have tried a couple of different things, yet they don't really work.
First attempt
from bs4 import BeautifulSoup
import requests
import html
import urllib
import json
import re
url = 'https://www.renfe.com/es/es/cercanias/cercanias-valencia/lineas'
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
links = soup.find_all('div', class_ = "rftabdetailline accordion aem-GridColumn aem-GridColumn--default--12")
links contains the following:
[<div >
<!-- rf-tab-detail-line en resto de modos -->
<rf-tab-detail-line content='[{"color":"120,180,225","name":"C1","active":"true","stations":"València Nord \u2013 Gandía","url":"/content/renfe/es/es/cercanias/cercanias-valencia/lineas/jcr:content/root/responsivegrid/rftabdetailline/item_1.html"},{"color":"245,150,40","name":"C2","active":"false","stations":"València Nord \u2013 Xàtiva \u2013 Moixent","url":"/content/renfe/es/es/cercanias/cercanias-valencia/lineas/jcr:content/root/responsivegrid/rftabdetailline/item_1591014181985.html"},{"color":"125,37,130","name":"C3","active":"false","stations":"València Sant Isidre \u2013 Buñol \u2013 Utiel","url":"/content/renfe/es/es/cercanias/cercanias-valencia/lineas/jcr:content/root/responsivegrid/rftabdetailline/item_1591014184209.html"},{"color":"215,0,30","name":"C4","active":"false","stations":"València Sant Isidre \u2013 Xirivella L\u2019Alter","url":"/content/renfe/es/es/cercanias/cercanias-valencia/lineas/jcr:content/root/responsivegrid/rftabdetailline/item_1591014185974.html"},{"color":"0,139,41","name":"C5","active":"false","stations":"València Nord \u2013 Caudiel","url":"/content/renfe/es/es/cercanias/cercanias-valencia/lineas/jcr:content/root/responsivegrid/rftabdetailline/item_1591014187588.html"},{"color":"15,50,135","name":"C6","active":"false","stations":"València Nord \u2013 Castelló","url":"/content/renfe/es/es/cercanias/cercanias-valencia/lineas/jcr:content/root/responsivegrid/rftabdetailline/item_1591014189921.html"},{"color":"150,100,40","name":"ER02","active":"false","stations":"Castelló - Vinaròs","url":"/content/renfe/es/es/cercanias/cercanias-valencia/lineas/jcr:content/root/responsivegrid/rftabdetailline/item_1598612779629.html"}]' title-text="Seleccione una línea:">
</rf-tab-detail-line>
</div>]
In it, you can see the pieces that I want: for example, * "url":"/content/renfe/es/es/cercanias/cercanias-valencia/lineas/jcr:content/root/responsivegrid/rftabdetailline/item_1.html" *. I would like to obtain all the different /content/renfe/es/es/cercanias/cercanias-valencia/lineas/jcr:content/root/responsivegrid/rftabdetailline/item_WHATEVER.html in a list. In order to do so, I have tried an extract and using regular expressions, but I have not been succesful.
Second Attempt
Following the steps that are shown in the answer to this question Extractinf info form HTML that has no tags I obtained the next piece of code:
import requests
import html
import json
url = 'https://www.renfe.com/content/renfe/es/es/cercanias/cercanias-valencia/lineas'
response = requests.get(url)
data = response.text # get data from site
raw_list = data.split("'")[8] # extract attributes
json_list = html.unescape(raw_list) # decode html symbols
parsed_list = json.loads(json_list) # parse json
I thought that it would work because of the similarities in the output it produces, but when defining parsed_list the next error is returned:
- JSONDecodeError: Expecting value: line 1 column 1 (char 0)*
Does anyone have anythoughts? Thank you all in advance!!!
CodePudding user response:
This way:
import html
import json
import re
import requests
url = 'https://www.renfe.com/content/renfe/es/es/cercanias/cercanias-valencia/lineas'
response = requests.get(url)
page_text = response.text # get data from site
regex = r"<rf-tab-detail-line title-text=\"Seleccione una línea:\" content=\"([^\"] )"
encoded_content = re.findall(regex, page_text)
if len(encoded_content) == 0:
print("Nothing found, possibly page structure changed.")
exit()
encoded_content = html.unescape(encoded_content[0])
json_content = json.loads(encoded_content)
for item in json_content:
print(item["url"])
Output:
/content/renfe/es/es/cercanias/cercanias-valencia/lineas/jcr:content/root/responsivegrid/rftabdetailline/item_1.html
/content/renfe/es/es/cercanias/cercanias-valencia/lineas/jcr:content/root/responsivegrid/rftabdetailline/item_1591014181985.html
/content/renfe/es/es/cercanias/cercanias-valencia/lineas/jcr:content/root/responsivegrid/rftabdetailline/item_1591014184209.html
/content/renfe/es/es/cercanias/cercanias-valencia/lineas/jcr:content/root/responsivegrid/rftabdetailline/item_1591014185974.html
/content/renfe/es/es/cercanias/cercanias-valencia/lineas/jcr:content/root/responsivegrid/rftabdetailline/item_1591014187588.html
/content/renfe/es/es/cercanias/cercanias-valencia/lineas/jcr:content/root/responsivegrid/rftabdetailline/item_1591014189921.html
/content/renfe/es/es/cercanias/cercanias-valencia/lineas/jcr:content/root/responsivegrid/rftabdetailline/item_1598612779629.html
Hope this is what you needed.
CodePudding user response:
I would instead use a css attribute = value selector to target the single element housing that data as it is more intuitive upon reading. Then you simply need to extract the content
attribute and handle with json
library filtering for the url
key value pairs.
import json
import requests
from bs4 import BeautifulSoup
url = 'https://www.renfe.com/es/es/cercanias/cercanias-valencia/lineas'
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
data = json.loads(soup.select_one('[title-text="Seleccione una línea:"]')['content'])
links = [i['url'] for i in data]