I'm trying to scrape a table, and whoever set this up to put a bunch of information in a 1 column table, and inside each row, there are many rows.
I'd like to grab each for from inside the row and create a data frame with that as a row. I'd also like to set the information located in <strong> </strong>
as a column for the entire data frame.
Is there a way to do this with python? I've been working with selenium and pandas read_html, but I think I've run into a wall here. Ultimately, I would like to concat all of this information together into a data frame.
The HTML looks like this.
<td>
<strong> Important Information1 </strong>
<br> Some information
<br> Some information
<br> Some information
<br> Some information
<br> Some information
<br> Some information
</td>
<td>
<strong> Important Information 2 </strong>
<br> Some information 2
<br> Some information 2
<br> Some information 2
<br> Some information 2
<br> Some information 2
<br> Some information 2
<br> Some information 2
<br> Some information 2
<br> Some information 2
<br> Some information 2
</td>
<td>
<strong> Important Information 3 </strong>
<br> Some information 3
<br> Some information 3
<br> Some information 3
<br> Some information 3
</td>
Expected Outcome:
Important Header Some Information Header
0 Important Information1 Some information
1 Important Information1 Some information
2 Important Information1 Some information
3 Important Information1 Some information
4 Important Information1 Some information
5 Important Information1 Some information
6 Important Information2 Some information 2
7 Important Information2 Some information 2
8 Important Information2 Some information 2
9 Important Information2 Some information 2
10 Important Information2 Some information 2
11 Important Information2 Some information 2
12 Important Information2 Some information 2
13 Important Information2 Some information 2
14 Important Information2 Some information 2
15 Important Information2 Some information 2
16 Important Information3 Some information 3
17 Important Information3 Some information 3
18 Important Information3 Some information 3
19 Important Information3 Some information 3
CodePudding user response:
If I understand you correctly, you want to convert the HTML document into a 3-column pandas DataFrame:
import pandas as pd
from bs4 import BeautifulSoup
html_doc = """
<td>
<strong> Important Information1 </strong>
<br> Some information
<br> Some information
<br> Some information
<br> Some information
<br> Some information
<br> Some information
</td>
<td>
<strong> Important Information 2 </strong>
<br> Some information 2
<br> Some information 2
<br> Some information 2
<br> Some information 2
<br> Some information 2
<br> Some information 2
<br> Some information 2
<br> Some information 2
<br> Some information 2
<br> Some information 2
</td>
<td>
<strong> Important Information 3 </strong>
<br> Some information 3
<br> Some information 3
<br> Some information 3
<br> Some information 3
</td>
"""
soup = BeautifulSoup(html_doc, "html.parser")
cols = []
for td in soup.select("td"):
col_name, *data = td.get_text(strip=True, separator="|").split("|")
cols.append(pd.Series(data, name=col_name))
print(pd.concat(cols, axis=1))
Prints:
Important Information1 Important Information 2 Important Information 3
0 Some information Some information 2 Some information 3
1 Some information Some information 2 Some information 3
2 Some information Some information 2 Some information 3
3 Some information Some information 2 Some information 3
4 Some information Some information 2 NaN
5 Some information Some information 2 NaN
6 NaN Some information 2 NaN
7 NaN Some information 2 NaN
8 NaN Some information 2 NaN
9 NaN Some information 2 NaN
CodePudding user response:
It's hard to tell exactly what would work best for you without an example of how you plan to scrape those elements, but if I assume you are starting from scratch I would suggest getting an element, then getting that element's children.
It will probably need error handling to be robust. Many people prefer using css selectors as identifiers, but I personally like xpaths.
It might look like:
elements_you_want = driver.find_elements_by_xpath('xpath to parent')
for child in element:
# do something
Some logic will need to select each of the parent elements, but that will really depend on the particular page you want to scrape.
This is shown with greater detail in this stackoverflow post: Get all child elements
CodePudding user response:
Make sure to import the environment:
# >> Get Ready: Importing Programming Environment Package Are You Using import os from selenium import webdriver # >> Setup chrome browser chromedriver = "C:\Program Files\Python39\Scripts\chromedriver" os.environ["webdriver.chrome.driver"] = chromedriver driver = webdriver.Chrome(chromedriver)
Code snippet to scraping:
# - Programing: Scraping element_list = driver.find_elements_by_tag_name('td') _i_ = 0 Data = [[]] for _item_ in element_list: _i_ = 1 Title = _item_.find_element_by_xpath('//td[' str(_i_) ']/strong').text.strip() Data.append([_i_, Title]) for _element_ in _item_.find_elements_by_xpath('//td[' str(_i_) ']/br'): Value = _element_.text.strip() Data[_i_ 1].extend(Value) #or Try if the fill array data program not true: Data[_i_].extend(Value) # - Show results: print('- Data[1] = ', Data[0]) print('- Data[2] = ', Data[1]) print('- Data[3] = ', Data[2])
Update: Code export csv
import csv
def pad(data):
max_n = max([len(x) for x in data.values()])
for field in data:
data[field] = [''] * (max_n - len(data[field]))
return data
def merge_dicts(*dict_args):
"""
Given any number of dictionaries, shallow copy and merge into a new dict,
precedence goes to key-value pairs in latter dictionaries.
"""
result = {}
for dictionary in dict_args:
result.update(dictionary)
return result
Data_1 = Data[0]
Data_2 = Data[0]
Data_3 = Data[0]
sdata_1 = {"Data_1":Data_1, "Data_2":Data_2}
sdata_2 = { "Data_3":Data_3}
data = merge_dicts(sdata_1, sdata_2)
print(data)
import pandas as pd
df = pd.DataFrame(pad(data))
df.to_csv("output.csv", index=False)
print('>> Finish export to CSV')