Saving all the columns from a html table using beautiful soup in python-CodePudding

I have two kinds of rows I am trying to convert to a table from a website.

The first one looks like this,

<tr id="eventRowId_750"> <td >All Day</td> <td ><span  data-img_key="France" title="France"> </span></td> <td ><span >Holiday</span></td> <td  colspan="6">French - Flower Festival</td> </tr>

The second kind of row looks like this,

<tr  data-event-datetime="2022/02/02 01:00:00" event_attr_id="114" id="eventRowId_444333"> <td  title="">01:00</td> <td ><span  data-img_key="Australia" title="Australia"> </span> AUS</td> <td  data-img_key="bull1" title="Low Impact"><i ></i><i ></i><i ></i></td> <td  title="Click to view more info on Australian Budget"><a href="australian-budget-114" target="_blank">      Australian Budget  (Dec)</a> </td> <td  id="eventActual_444333" title="">-5M</td> <td  id="eventForecast_444333"> </td> <td  id="eventPrevious_444333"><span title="Revised From -3M">-2M</span></td> <td  data-event-id="114" data-name="Australian Budget" data-status-enabled="0"> <span  data-tooltip="Create Alert" data-tooltip-alt="Alert is active"></span> </td> </tr>

I am trying to convert them into rows using python and beautifulsoup. I use the following code,

for items in soup.select("tr"):
    data = [item.get_text(strip=True) for item in items.select("th,td")]
    print(data)

But my output looks like this,

['All Day', '', 'Holiday', 'French - Flower Festival']
['01:00', 'AUS', '', 'Australian Budget  (Dec)', '-5M', '', '-2M', '']

How can I make I get the "low impact" text into the third column where "holiday" is in the first column, and saving the name "France" in the first row into the second column and make it look like this?

['All Day', 'France', 'Holiday', 'French - Flower Festival']
['01:00', 'AUS', 'Low Impact', 'Australian Budget  (Dec)', '-5M', '', '-2M', '']

This part is not really important but, is it possible to save the span title if it exists in by adding it to the end of the list? the part where it says, "Revised From -3M". So it could look like this,

['All Day', '', 'Holiday', 'French - Flower Festival']
['02:45', 'AUS', 'Low Impact', 'Australian Budget  (Dec)', '-5M', '', '-2M', '', "Revised From -3M"]

CodePudding user response：

It's unlikely to find a proper pattern, so here we go. I couldn't think of anything to get the title except regex, because it's not tied to a determined tag.

from bs4 import BeautifulSoup
import re

with open("example.html") as html_doc:
    soup = BeautifulSoup(html_doc, "html.parser")

for items in soup.select("tr"):
    row = []
    for item in items.select("th,td"):
        text = item.get_text(strip=True)
        if not text:
            title = re.search(r"title=\"(.*?)\"", str(item))
            if title:
                text = title.group(1)
        row.append(text)
    print(row)

# output
['All Day', 'France', 'Holiday', 'French - Flower Festival']
['01:00', 'AUS', 'Low Impact', 'Australian Budget (Dec)', '-5M', '', '-2M', '']

CodePudding user response：

I believe the closest you can get (assuming the pattern hold across all your rows) is something like this:

for items in soup.select("tr"):
    row = [item.text.strip()  for item in items.select('td')] \
          [item['title'] for item in items.select('span[title]')]    
    print(row)

Output:

['All Day', '', 'Holiday', 'French - Flower Festival', 'France']
['01:00', 'AUS', '', 'Australian Budget  (Dec)', '-5M', '', '-2M', '', 'Australia', 'Revised From -3M']

Obviously, you will need to manipulate rows to exclude unwanted elements. For example, to remove empty elements, you can change the last line to read:

print([element for element in row if element.strip()])

which will change the output to:

['All Day', 'Holiday', 'French - Flower Festival', 'France']
['01:00', 'AUS', 'Australian Budget  (Dec)', '-5M', '-2M', 'Australia', 'Revised From -3M']