Home > Mobile >  bs4 scraping data from HTML page
bs4 scraping data from HTML page

Time:07-19

I'm new to bs4 and trying to scrape data from table . Here is an HTML example:

<div >
<table ><tbody>
<tr>
<td ><b>First string</b></td>
<td ><b>Second string</b></td>
<td ><b>Third string</b></td>
<td ><b>Fourth string</b></td>
</tr>
<tr>
<td >d</td>
<td >&nbsp;<b>LLLM2001</b></td>
<td >&nbsp;<font color="#00875a"><b>12-July-2022</b></font>&nbsp;</td>
<td >&nbsp;</td>
</tr>
<tr>
<td >e</td>
<td >&nbsp;<b>MLLL0056</b></td>
<td >&nbsp;<font color="#00875a"><b>11-June-2022</b></font></td>
<td >&nbsp;</td>
</tr>
<tr>
<td >f</td>
<td >&nbsp;<del>AMMK0001</del><br>
 &nbsp;<font color="#00875a"><b>MMKA0001</b></font></td>
<td ><font color="#00875a">&nbsp;<b>12 July 2022</b></font></td>
<td >&nbsp;</td>
</tr>
<tr>
<td >i</td>
<td >&nbsp;<font color="#00875a"><b>ANJK1111</b></font></td>
<td >&nbsp;<font color="#00875a"><b>11-June-2022</b></font></td>
<td >&nbsp;</td>
</tr>
<tr>
<td >j</td>
<td >&nbsp;<font color="#00875a"><b>YMLC3939</b></font></td>
<td >&nbsp;<font color="#00875a"><b>11-June-2022</b></font></td>
<td >&nbsp;</td>
</tr>
</tbody></table>
</div>

I want to:

  1. Scrap all font values from table.
  2. Always scrap "First string"..."Fourth string" from table header (they are also in td, but always have the same position and values).
  3. Ignore del in td (not necessary)
  4. Left blank only for IDs, that are not in font (by IDs I mean LLLM2001, MLLL0056 etc.).

Code:

from bs4 import BeautifulSoup

soup = BeautifulSoup(html, "html.parser") 
table = soup.find_all("table", {"class": "ids_table"})
data = [[x for x in v.find_all('td')[1:-1]] for v in table]
data = [[x.text.strip() if x.find('font') else '' for x in c] for c in data]

data:

[['',
  '',
  '',
  '',
  '',
  '12-July-2022',
  '',
  '',
  '',
  '11-June-2022',
  '',
  '',
  'AMMK0001\n \xa0MMKA0001',
  '12 July 2022',
  '',
  '',
  'ANJK1111',
  '11-June-2022',
  '',
  '',
  'YMLC3939',
  '11-June-2022']]

As a result I want to get:

[['First string',
  'Second string',
  'Third string',
  'Fourth sting',
  'd'
  '',
  '12-July-2022',
  'e'
  '',
  '11-June-2022',
  'f'
  'MMKA0001',
  '12 July 2022',
  'i'
  'ANJK1111',
  '11-June-2022',
  'j'
  'YMLC3939',
  '11-June-2022']]

Thank you in advance

CodePudding user response:

I don't really understand your 4th point, nonetheless:

from bs4 import BeautifulSoup
import pandas as pd

html = '''<div >
<table ><tbody>
<tr>
<td ><b>First string</b></td>
<td ><b>Second string</b></td>
<td ><b>Third string</b></td>
<td ><b>Fourth string</b></td>
</tr>
<tr>
<td >d</td>
<td >&nbsp;<b>LLLM2001</b></td>
<td >&nbsp;<font color="#00875a"><b>12-July-2022</b></font>&nbsp;</td>
<td >&nbsp;</td>
</tr>
<tr>
<td >e</td>
<td >&nbsp;<b>MLLL0056</b></td>
<td >&nbsp;<font color="#00875a"><b>11-June-2022</b></font></td>
<td >&nbsp;</td>
</tr>
<tr>
<td >f</td>
<td >&nbsp;<del>AMMK0001</del><br>&nbsp;<font color="#00875a"><b>MMKA0001</b></font></td>
<td ><font color="#00875a">&nbsp;<b>12 July 2022</b></font></td>
<td >&nbsp;</td>
</tr>
<tr>
<td >i</td>
<td >&nbsp;<font color="#00875a"><b>ANJK1111</b></font></td>
<td >&nbsp;<font color="#00875a"><b>11-June-2022</b></font></td>
<td >&nbsp;</td>
</tr>
<tr>
<td >j</td>
<td >&nbsp;<font color="#00875a"><b>YMLC3939</b></font></td>
<td >&nbsp;<font color="#00875a"><b>11-June-2022</b></font></td>
<td >&nbsp;</td>
</tr>
</tbody></table>
</div>
'''
soup = BeautifulSoup(html, "html.parser") 

# remove the <del> tag
to_be_deleted = soup.select('del')[0]
to_be_deleted.decompose()
# this is how you remove the values from the 'second string' column which are not wrapped in <font> tag
tds = soup.select('b')
for x in tds[4:]:
    if x.parent.name == 'td':
        x.decompose()
# this is how you get first row - headers
table_headers = [x.text.strip() for x in soup.select_one("table.ids_table").select('tr')[0].select('td')]
# this is how you get the fonts
fonts = [x.text.strip() for x  in soup.select_one("table.ids_table").select('font')]
# this is how you display the data in an intelligible way
df = pd.read_html(str(soup))[0]
new_header = df.iloc[0]
df = df[1:]
df.columns = new_header
df

This returns:

First string    Second string   Third string    Fourth string
1   d   NaN 12-July-2022    NaN
2   e   NaN 11-June-2022    NaN
3   f   MMKA0001    12 July 2022    NaN
4   i   ANJK1111    11-June-2022    NaN
5   j   YMLC3939    11-June-2022    NaN
  • Related