Home > Software design >  Extracting URL from HTML in python
Extracting URL from HTML in python

Time:12-27

I am trying to obtain a list containing different url that appear (partially) when you see the HTML version of this webpage:

https://www.renfe.com/es/es/cercanias/cercanias-valencia/lineas

I have tried a couple of different things, yet they don't really work.

First attempt

from bs4 import BeautifulSoup
import requests
import html
import urllib
import json
import re

url = 'https://www.renfe.com/es/es/cercanias/cercanias-valencia/lineas'
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
links = soup.find_all('div', class_ = "rftabdetailline accordion aem-GridColumn aem-GridColumn--default--12")

links contains the following:

 [<div >
 <!-- rf-tab-detail-line en resto de modos -->
 <rf-tab-detail-line content='[{"color":"120,180,225","name":"C1","active":"true","stations":"València Nord \u2013 Gandía","url":"/content/renfe/es/es/cercanias/cercanias-valencia/lineas/jcr:content/root/responsivegrid/rftabdetailline/item_1.html"},{"color":"245,150,40","name":"C2","active":"false","stations":"València Nord \u2013 Xàtiva \u2013 Moixent","url":"/content/renfe/es/es/cercanias/cercanias-valencia/lineas/jcr:content/root/responsivegrid/rftabdetailline/item_1591014181985.html"},{"color":"125,37,130","name":"C3","active":"false","stations":"València Sant Isidre \u2013 Buñol \u2013 Utiel","url":"/content/renfe/es/es/cercanias/cercanias-valencia/lineas/jcr:content/root/responsivegrid/rftabdetailline/item_1591014184209.html"},{"color":"215,0,30","name":"C4","active":"false","stations":"València Sant Isidre \u2013 Xirivella L\u2019Alter","url":"/content/renfe/es/es/cercanias/cercanias-valencia/lineas/jcr:content/root/responsivegrid/rftabdetailline/item_1591014185974.html"},{"color":"0,139,41","name":"C5","active":"false","stations":"València Nord \u2013 Caudiel","url":"/content/renfe/es/es/cercanias/cercanias-valencia/lineas/jcr:content/root/responsivegrid/rftabdetailline/item_1591014187588.html"},{"color":"15,50,135","name":"C6","active":"false","stations":"València Nord \u2013 Castelló","url":"/content/renfe/es/es/cercanias/cercanias-valencia/lineas/jcr:content/root/responsivegrid/rftabdetailline/item_1591014189921.html"},{"color":"150,100,40","name":"ER02","active":"false","stations":"Castelló - Vinaròs","url":"/content/renfe/es/es/cercanias/cercanias-valencia/lineas/jcr:content/root/responsivegrid/rftabdetailline/item_1598612779629.html"}]' title-text="Seleccione una línea:">
 </rf-tab-detail-line>
 </div>]

In it, you can see the pieces that I want: for example, * "url":"/content/renfe/es/es/cercanias/cercanias-valencia/lineas/jcr:content/root/responsivegrid/rftabdetailline/item_1.html" *. I would like to obtain all the different /content/renfe/es/es/cercanias/cercanias-valencia/lineas/jcr:content/root/responsivegrid/rftabdetailline/item_WHATEVER.html in a list. In order to do so, I have tried an extract and using regular expressions, but I have not been succesful.

Second Attempt

Following the steps that are shown in the answer to this question Extractinf info form HTML that has no tags I obtained the next piece of code:

import requests
import html
import json

url = 'https://www.renfe.com/content/renfe/es/es/cercanias/cercanias-valencia/lineas'
response = requests.get(url)
data = response.text  # get data from site
raw_list = data.split("'")[8]  # extract attributes
json_list = html.unescape(raw_list)  # decode html symbols
parsed_list = json.loads(json_list)  # parse json 

I thought that it would work because of the similarities in the output it produces, but when defining parsed_list the next error is returned:

  • JSONDecodeError: Expecting value: line 1 column 1 (char 0)*

Does anyone have anythoughts? Thank you all in advance!!!

CodePudding user response:

This way:

import html
import json
import re
import requests

url = 'https://www.renfe.com/content/renfe/es/es/cercanias/cercanias-valencia/lineas'
response = requests.get(url)
page_text = response.text  # get data from site

regex = r"<rf-tab-detail-line title-text=\"Seleccione una línea:\" content=\"([^\"] )"
encoded_content = re.findall(regex, page_text)

if len(encoded_content) == 0:
    print("Nothing found, possibly page structure changed.")
    exit()

encoded_content = html.unescape(encoded_content[0])
json_content = json.loads(encoded_content)

for item in json_content:
    print(item["url"])

Output:

/content/renfe/es/es/cercanias/cercanias-valencia/lineas/jcr:content/root/responsivegrid/rftabdetailline/item_1.html
/content/renfe/es/es/cercanias/cercanias-valencia/lineas/jcr:content/root/responsivegrid/rftabdetailline/item_1591014181985.html
/content/renfe/es/es/cercanias/cercanias-valencia/lineas/jcr:content/root/responsivegrid/rftabdetailline/item_1591014184209.html
/content/renfe/es/es/cercanias/cercanias-valencia/lineas/jcr:content/root/responsivegrid/rftabdetailline/item_1591014185974.html
/content/renfe/es/es/cercanias/cercanias-valencia/lineas/jcr:content/root/responsivegrid/rftabdetailline/item_1591014187588.html
/content/renfe/es/es/cercanias/cercanias-valencia/lineas/jcr:content/root/responsivegrid/rftabdetailline/item_1591014189921.html
/content/renfe/es/es/cercanias/cercanias-valencia/lineas/jcr:content/root/responsivegrid/rftabdetailline/item_1598612779629.html

Hope this is what you needed.

CodePudding user response:

I would instead use a css attribute = value selector to target the single element housing that data as it is more intuitive upon reading. Then you simply need to extract the content attribute and handle with json library filtering for the url key value pairs.

import json
import requests
from bs4 import BeautifulSoup

url = 'https://www.renfe.com/es/es/cercanias/cercanias-valencia/lineas'
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
data = json.loads(soup.select_one('[title-text="Seleccione una línea:"]')['content'])
links = [i['url'] for i in data]
  • Related