How to scrape web table that has rows within rows?-CodePudding

I'm trying to scrape a table, and whoever set this up to put a bunch of information in a 1 column table, and inside each row, there are many rows.

I'd like to grab each for from inside the row and create a data frame with that as a row. I'd also like to set the information located in <strong> </strong> as a column for the entire data frame.

Is there a way to do this with python? I've been working with selenium and pandas read_html, but I think I've run into a wall here. Ultimately, I would like to concat all of this information together into a data frame.

The HTML looks like this.

<td>
    <strong>    Important Information1  </strong>
    <br>    Some information
    <br>    Some information
    <br>    Some information
    <br>    Some information
    <br>    Some information
    <br>    Some information
</td>
<td>
    <strong>    Important Information 2 </strong>
    <br>    Some information 2
    <br>    Some information 2
    <br>    Some information 2
    <br>    Some information 2
    <br>    Some information 2  
    <br>    Some information 2
    <br>    Some information 2  
    <br>    Some information 2
    <br>    Some information 2  
    <br>    Some information 2  
</td>
<td>
    <strong>    Important Information 3 </strong>
    <br>    Some information 3
    <br>    Some information 3
    <br>    Some information 3
    <br>    Some information 3  
</td>

Expected Outcome:

           Important Header Some Information Header
0   Important Information1         Some information
1   Important Information1         Some information
2   Important Information1         Some information
3   Important Information1         Some information
4   Important Information1         Some information
5   Important Information1         Some information
6    Important Information2      Some information 2
7    Important Information2      Some information 2
8    Important Information2      Some information 2
9    Important Information2      Some information 2
10   Important Information2      Some information 2
11   Important Information2      Some information 2
12   Important Information2      Some information 2
13   Important Information2      Some information 2
14   Important Information2      Some information 2
15   Important Information2      Some information 2
16   Important Information3      Some information 3
17   Important Information3      Some information 3
18   Important Information3      Some information 3
19   Important Information3      Some information 3

CodePudding user response：

If I understand you correctly, you want to convert the HTML document into a 3-column pandas DataFrame:

import pandas as pd
from bs4 import BeautifulSoup

html_doc = """
<td>
    <strong>    Important Information1  </strong>
    <br>    Some information
    <br>    Some information
    <br>    Some information
    <br>    Some information
    <br>    Some information
    <br>    Some information
</td>
<td>
    <strong>    Important Information 2 </strong>
    <br>    Some information 2
    <br>    Some information 2
    <br>    Some information 2
    <br>    Some information 2
    <br>    Some information 2  
    <br>    Some information 2
    <br>    Some information 2  
    <br>    Some information 2
    <br>    Some information 2  
    <br>    Some information 2  
</td>
<td>
    <strong>    Important Information 3 </strong>
    <br>    Some information 3
    <br>    Some information 3
    <br>    Some information 3
    <br>    Some information 3  
</td>
"""

soup = BeautifulSoup(html_doc, "html.parser")

cols = []
for td in soup.select("td"):
    col_name, *data = td.get_text(strip=True, separator="|").split("|")
    cols.append(pd.Series(data, name=col_name))

print(pd.concat(cols, axis=1))

Prints:

  Important Information1 Important Information 2 Important Information 3
0       Some information      Some information 2      Some information 3
1       Some information      Some information 2      Some information 3
2       Some information      Some information 2      Some information 3
3       Some information      Some information 2      Some information 3
4       Some information      Some information 2                     NaN
5       Some information      Some information 2                     NaN
6                    NaN      Some information 2                     NaN
7                    NaN      Some information 2                     NaN
8                    NaN      Some information 2                     NaN
9                    NaN      Some information 2                     NaN

CodePudding user response：

It's hard to tell exactly what would work best for you without an example of how you plan to scrape those elements, but if I assume you are starting from scratch I would suggest getting an element, then getting that element's children.

It will probably need error handling to be robust. Many people prefer using css selectors as identifiers, but I personally like xpaths.

It might look like:

elements_you_want = driver.find_elements_by_xpath('xpath to parent')
for child in element:
     # do something

Some logic will need to select each of the parent elements, but that will really depend on the particular page you want to scrape.

This is shown with greater detail in this stackoverflow post: Get all child elements

CodePudding user response：

Make sure to import the environment:

# >> Get Ready: Importing Programming Environment Package Are You Using
import os
from selenium import webdriver

# >> Setup chrome browser
chromedriver = "C:\Program Files\Python39\Scripts\chromedriver"
os.environ["webdriver.chrome.driver"] = chromedriver
driver = webdriver.Chrome(chromedriver)

Code snippet to scraping:

# - Programing: Scraping
element_list = driver.find_elements_by_tag_name('td')
_i_ = 0
Data = [[]]
for _item_ in element_list:
    _i_  = 1
    Title = _item_.find_element_by_xpath('//td[' str(_i_) ']/strong').text.strip()
    Data.append([_i_, Title])
    for _element_ in _item_.find_elements_by_xpath('//td[' str(_i_) ']/br'):
        Value = _element_.text.strip()
        Data[_i_   1].extend(Value) #or Try if the fill array data program not true: Data[_i_].extend(Value)
    
# - Show results:
print('- Data[1] = ', Data[0])
print('- Data[2] = ', Data[1])
print('- Data[3] = ', Data[2])

Update: Code export csv

import csv

def pad(data):
    max_n = max([len(x) for x in data.values()])
    for field in data:
        data[field]  = [''] * (max_n - len(data[field]))
    return data

def merge_dicts(*dict_args):
    """
    Given any number of dictionaries, shallow copy and merge into a new dict,
    precedence goes to key-value pairs in latter dictionaries.
    """
    result = {}
    for dictionary in dict_args:
        result.update(dictionary)
    return result

Data_1 = Data[0]
Data_2 = Data[0]
Data_3 = Data[0]

sdata_1 = {"Data_1":Data_1, "Data_2":Data_2}
sdata_2 = { "Data_3":Data_3}
data = merge_dicts(sdata_1, sdata_2)
print(data)

import pandas as pd
df = pd.DataFrame(pad(data))
df.to_csv("output.csv", index=False)

print('>> Finish export to CSV')