I'm new to bs4 and trying to scrape data from table
. Here is an HTML example:
<div >
<table ><tbody>
<tr>
<td ><b>First string</b></td>
<td ><b>Second string</b></td>
<td ><b>Third string</b></td>
<td ><b>Fourth string</b></td>
</tr>
<tr>
<td >d</td>
<td > <b>LLLM2001</b></td>
<td > <font color="#00875a"><b>12-July-2022</b></font> </td>
<td > </td>
</tr>
<tr>
<td >e</td>
<td > <b>MLLL0056</b></td>
<td > <font color="#00875a"><b>11-June-2022</b></font></td>
<td > </td>
</tr>
<tr>
<td >f</td>
<td > <del>AMMK0001</del><br>
<font color="#00875a"><b>MMKA0001</b></font></td>
<td ><font color="#00875a"> <b>12 July 2022</b></font></td>
<td > </td>
</tr>
<tr>
<td >i</td>
<td > <font color="#00875a"><b>ANJK1111</b></font></td>
<td > <font color="#00875a"><b>11-June-2022</b></font></td>
<td > </td>
</tr>
<tr>
<td >j</td>
<td > <font color="#00875a"><b>YMLC3939</b></font></td>
<td > <font color="#00875a"><b>11-June-2022</b></font></td>
<td > </td>
</tr>
</tbody></table>
</div>
I want to:
- Scrap all
font
values fromtable
. - Always scrap "First string"..."Fourth string" from table header (they are also in
td
, but always have the same position and values). - Ignore
del
intd
(not necessary) - Left blank only for IDs, that are not in
font
(by IDs I mean LLLM2001, MLLL0056 etc.).
Code:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, "html.parser")
table = soup.find_all("table", {"class": "ids_table"})
data = [[x for x in v.find_all('td')[1:-1]] for v in table]
data = [[x.text.strip() if x.find('font') else '' for x in c] for c in data]
data:
[['',
'',
'',
'',
'',
'12-July-2022',
'',
'',
'',
'11-June-2022',
'',
'',
'AMMK0001\n \xa0MMKA0001',
'12 July 2022',
'',
'',
'ANJK1111',
'11-June-2022',
'',
'',
'YMLC3939',
'11-June-2022']]
As a result I want to get:
[['First string',
'Second string',
'Third string',
'Fourth sting',
'd'
'',
'12-July-2022',
'e'
'',
'11-June-2022',
'f'
'MMKA0001',
'12 July 2022',
'i'
'ANJK1111',
'11-June-2022',
'j'
'YMLC3939',
'11-June-2022']]
Thank you in advance
CodePudding user response:
I don't really understand your 4th point, nonetheless:
from bs4 import BeautifulSoup
import pandas as pd
html = '''<div >
<table ><tbody>
<tr>
<td ><b>First string</b></td>
<td ><b>Second string</b></td>
<td ><b>Third string</b></td>
<td ><b>Fourth string</b></td>
</tr>
<tr>
<td >d</td>
<td > <b>LLLM2001</b></td>
<td > <font color="#00875a"><b>12-July-2022</b></font> </td>
<td > </td>
</tr>
<tr>
<td >e</td>
<td > <b>MLLL0056</b></td>
<td > <font color="#00875a"><b>11-June-2022</b></font></td>
<td > </td>
</tr>
<tr>
<td >f</td>
<td > <del>AMMK0001</del><br> <font color="#00875a"><b>MMKA0001</b></font></td>
<td ><font color="#00875a"> <b>12 July 2022</b></font></td>
<td > </td>
</tr>
<tr>
<td >i</td>
<td > <font color="#00875a"><b>ANJK1111</b></font></td>
<td > <font color="#00875a"><b>11-June-2022</b></font></td>
<td > </td>
</tr>
<tr>
<td >j</td>
<td > <font color="#00875a"><b>YMLC3939</b></font></td>
<td > <font color="#00875a"><b>11-June-2022</b></font></td>
<td > </td>
</tr>
</tbody></table>
</div>
'''
soup = BeautifulSoup(html, "html.parser")
# remove the <del> tag
to_be_deleted = soup.select('del')[0]
to_be_deleted.decompose()
# this is how you remove the values from the 'second string' column which are not wrapped in <font> tag
tds = soup.select('b')
for x in tds[4:]:
if x.parent.name == 'td':
x.decompose()
# this is how you get first row - headers
table_headers = [x.text.strip() for x in soup.select_one("table.ids_table").select('tr')[0].select('td')]
# this is how you get the fonts
fonts = [x.text.strip() for x in soup.select_one("table.ids_table").select('font')]
# this is how you display the data in an intelligible way
df = pd.read_html(str(soup))[0]
new_header = df.iloc[0]
df = df[1:]
df.columns = new_header
df
This returns:
First string Second string Third string Fourth string
1 d NaN 12-July-2022 NaN
2 e NaN 11-June-2022 NaN
3 f MMKA0001 12 July 2022 NaN
4 i ANJK1111 11-June-2022 NaN
5 j YMLC3939 11-June-2022 NaN