Home > Software engineering >  How to fetch a very special data from HTML file
How to fetch a very special data from HTML file

Time:02-19

Trying scrape data from a HTML file, which has a react props DIV in it like this:

<html>

<div data-react-
    data-react-props="{
        &quot;targetUser
            &quot;:{
                &quot;targetUserLogin&quot;:&quot;user&quot;,
                &quot;targetUserDuration&quot;:&quot;11 months, 27 days&quot;,&quot;""
            }
        }

and the thing I am looking for is the date! like 11 months, 27 days so I can add them up to get an exact number of "days"

I have no idea how to accurately get this data since different person can be 2 years exactly and no days would be in the text. I need both year and days so I can calculate. so I wrote this to find the the part of the code that I need, but I don't know to how to approach the rest..

with open("data.html", 'r') as fpIn:
    for line in fpIn:
        line = line.rstrip()   # Strip trailing spaces and newline
        if "targetUserDuration" in line:
            print("Found")

CodePudding user response:

Use regular expresions to find it.

import re

html = '...&quot;targetUserDuration&quot;:&quot;11 months, 27 days&quot;,&quot;""...'

years_re = re.compile(r'UserDuration&quot.*?([1-9] ) year.*?&quot;""')
months_re = re.compile(r'UserDuration&quot.*?([1-9]|1[0-2]) month.*?&quot;""')
days_re = re.compile(r'UserDuration&quot.*?([1-9]|2[0-9]|3[0-1]) day.*?&quot;""')

year_found = years_re.search(html)
months_found = months_re.search(html)
days_found = days_re.search(html)

years, months, days = 0, 0, 0
if year_found:
    years = int(year_found.group(1))
if months_found:
    months = int(months_found.group(1))
if days_found:
    days = int(days_found.group(1))

print('years: ', years)
print('months: ', months)
print('days: ', days)

Result:

years:  0
months:  11
days:  27

CodePudding user response:

I would probably start by looking at "BeautifulSoup". I think it will unescape automatically. I know it is more libraries to load, but I would use html.unescape() and json.loads() as this seems to naturally fit the way the data is provided rather an try to parse it myself. Hand parsing seems unnecessarily brittle here.

from html import unescape
from json import loads
text = """
{
    &quot;targetUser&quot;:{
        &quot;targetUserLogin&quot;:&quot;user&quot;,
        &quot;targetUserDuration&quot;:&quot;11 months, 27 days&quot;
    }
}
"""
print(loads(unescape(text))["targetUser"]["targetUserDuration"])

Gives you:

11 months, 27 days
  • Related