How to scrape website with mouseover events and no class on most elements-CodePudding

I am trying to scrape a table with data off a website with mouse-over color changing events and on each row only the first two columns have a class: Elements

Additionally, when I do try to scrape those first two rows I get the output as such: Output

This is the code I am running.

lists = soup.find_all('table', class_= "report")

for list in lists:
    date = list.find_all('td', class_= "time")
    flag = list.find_all('td', class_= "flag")
    info = [date, flag]
    print(info)

I was expecting to receive only the numerical values so that I can export them and work with them.
I tried to use the .replace() function but it didn't remove anything. I was unable to use .text even after converting date and flag to strings.

CodePudding user response：

Notes

It might be better to not use list as a variable name since it already means something in python....

Also, I can't test any of the suggestions below without having access to your HTML. It would be helpful if you include how you fetched the HTML to parse with BeautifulSoup. (Did you use some form of requests? Or something like selenium? Or do you just have the HTML file?) With requests or even selenium, sometimes the HTML fetched is not what you expected, so that's another issue...

The Reason behind the Issue

You can't apply .text/.get_text to the ResultSets (lists), which are what .find_all and .select return. You can apply .text/.get_text to Tags like the ones returned by .find or .select_one (but you should check first None was not returned - which happens when nothing is found).

So date[0].get_text() might have returned something, but since you probably want all the dates and flags, that's not the solution.

Solution Part 1 Option A: Getting the Rows with `.find...` chain

Instead of iterating through tables directly, you need to get the rows first (tr tags) before trying to get at the cells (td tags); if you have only one table with class= "report", you could just do something like:

rows = soup.find('table', class_= "report").find_all('tr',  {'onmouseover': True})

But it's risky to chain multiple .find...s like that, because an error will be raised if any of them return None before reaching the last one.

Solution Part 1 Option B: Getting the Rows More Safely with `.find...`

It would be safer to do something like

table = soup.find('table', class_= "report")
rows = table.find_all('tr',  {'onmouseover': True}) if table else []

or, if there might be more than one table with class= "report",

rows = []
for table in soup.find_all('table', class_= "report"): 
    rows  = table.find_all('tr',  {'onmouseover': True})

Solution Part 1 Option C: Getting the Rows with `.select`

However, I think the most convenient way is to use .select with CSS selectors

# if there might be other tables with report AND other classes that you don't want:
# rows = soup.select('table[] tr[onmouseover]') 

rows = soup.select('table.report tr[onmouseover]')

this method is only unsuitable if there might be more than one table with class= "report", but you only want rows from the first one; in that case, you might prefer the table.find_....if table else [] approach.

Solution Part 2 Option A: Iterating Over `rows` to Print Cell Contents

Once you have rows, you can iterate over them to print the date and flag cell contents:

for r in rows:
    date, flag = r.find('td', class_= "time"), r.find('td', class_= "flag")
    info = [i.get_text() if i else i for i in [date, flag]]
    if any([i is not None for i in info]): print(info) 
    # [ only prints if there are any non-null values ]

Solution Part 2 Option B: Iterating Over `rows` with a Fucntion

Btw, if you're going to be extracting multiple tag attributes/texts repeatedly, you might find my selectForList function useful - it could have been used like

for r in rows: 
    info = selectForList(r, ['td.time', 'td.flag'], printList=True)
    if any([i is not None for i in info]): print(info)

or, to get a list of dictionaries like [{'time': time_1, 'flag': flag_1}, {'time': time_2, 'flag': flag_2}, ...],

infList = [selectForList(r, {
    'time': 'td.time', 'flag': 'td.flag', ## add selectors for any info you want 
}) for r in soup.select('table.report tr[onmouseover]')]
infList = [d for d in infoList if [v for v in d.values() if v is not None]]

Added EDIT:

To get all displayed cell contents:

for r in rows: 
    info = [(td.get_text(),''.join(td.get('style','').split())) for td in r.select('td')]
    info = [txt.strip() for txt, styl in r.select('td') if 'display:none' not in styl]
    # info = [i if i else 0 for i in info] # fill blanks with zeroes intead of ''
    if any(info):  print(info) ## or write to CSV

Notes

The Reason behind the Issue

Solution Part 1 Option A: Getting the Rows with .find... chain

Solution Part 1 Option B: Getting the Rows More Safely with .find...

Solution Part 1 Option C: Getting the Rows with .select