Home > Net >  Extracting data with Beautiful soup when div doesn't exist
Extracting data with Beautiful soup when div doesn't exist

Time:01-05

I'm trying to extract table data from a couple thousand html files or site data, but the tables don't have divs to make this easy and I'm pretty new to beautiful soup. Right now I'm manually editing all the converted html to csv and dropping them in my db to create tables, but I'd rather just scrape what I already have.

<
<body style="margin-top:140px;">    
<div id="container">
 <!-- Left div -->
 <div>
  &nbsp;
 </div>
 <!-- Center div -->
 <div>
  <!-- Image Link -->
  <a href="http://www.website.com"><img src="http://website.com/wp-content/uploads/2016/12/Blue-Transparent.png" style = "max-width:100%; max-height:120px;" alt="Center Banner"></a>
 </div>
 <!-- Right div -->
 <div>
  &nbsp;
 </div>
</div>
<A Name = "Top"></A>
<H1>5k Run</H1>
<H1>Overall Finish List</H1>
<H2>September 24, 2022</H2>
<HR noshade>
<B><I> </I></B>
<HR noshade>
<table border=0 cellpadding=0 cellspacing=0 >
  <tr>
    <td class=h01 colspan="9"><H2>1st Alarm 5k</H2></td>
  </tr>
  <tr>
    <td class=h11>Place</td>
    <td class=h12>Name</td>
    <td class=h12>City</td>
    <td class=h11>Bib No</td>
    <td class=h11>Age</td>
    <td class=h11>Gender</td>
    <td class=h11>Age Group</td>
    <td class=h11>Total Time</td>
    <td class=h11>Pace</td>
  </tr>
  <tr>
    <td class=d01>1</td>
    <td class=d02>Runner 1</td>
    <td class=d02>ANYTOWN  PA</td>
    <td class=d01>390</td>
    <td class=d01>52</td>
    <td class=d01>M</td>
    <td class=d01>1:Overall</td>
    <td class=d01>   18:43.93</td>
    <td class=d01>6:03/M</td>
  </tr>
  <tr>
    <td class=d01>2</td>
    <td class=d02>Runner 2</td>
    <td class=d02>ANYTOWN  PA</td>
    <td class=d01>380</td>
    <td class=d01>33</td>
    <td class=d01>M</td>
    <td class=d01>1:19-39</td>
    <td class=d01>   19:31.27</td>
    <td class=d01>6:18/M</td>
  </tr>
  <tr>
    <td class=d01>3</td>
    <td class=d02>Runner 3</td>
    <td class=d02>ANYTOWN  PA</td>
    <td class=d01>389</td>
    <td class=d01>65</td>
    <td class=d01>F</td>
    <td class=d01>1:Overall</td>
    <td class=d01>   45:45.20</td>
    <td class=d01>14:46/M</td>
  </tr>
  <tr>
    <td class=d01>4</td>
    <td class=d02>Runner 4</td>
    <td class=d02>ANYTOWN  PA</td>
    <td class=d01>381</td>
    <td class=d01>18</td>
    <td class=d01>F</td>
    <td class=d01>1: 1-18</td>
    <td class=d01>   53:28.84</td>
    <td class=d01>17:15/M</td>
  </tr>
  <tr>
    <td class=d01>5</td>
    <td class=d02>Runner 5</td>
    <td class=d02>ANYTOWN  PA</td>
    <td class=d01>382</td>
    <td class=d01>41</td>
    <td class=d01>F</td>
    <td class=d01>1:40-59</td>
    <td class=d01>   53:30.48</td>
    <td class=d01>17:16/M</td>
  </tr>
  <tr>
    <td class=d01>6</td>
    <td class=d02>Runner 6</td>
    <td class=d02>ANYTOWN  PA</td>
    <td class=d01>384</td>
    <td class=d01>14</td>
    <td class=d01>M</td>
    <td class=d01>1: 1-18</td>
    <td class=d01>   57:38.66</td>
    <td class=d01>18:36/M</td>
  </tr>
  <tr>
    <td class=d01>7</td>
    <td class=d02>Runner 7</td>
    <td class=d02>ANYTOWN  PA</td>
    <td class=d01>385</td>
    <td class=d01>72</td>
    <td class=d01>F</td>
    <td class=d01>1:60-99</td>
    <td class=d01>   57:40.11</td>
    <td class=d01>18:36/M</td>
  </tr>
</table>
 
<HR noshade>
<p>
<!-- 0c17  22.0 2e9 -->
</BODY>
</HTML>
>

I've tried adding divs, but haven't had much success.

CodePudding user response:

For your problem, probably using pandas would be more helpful then, doing data scraping (Assuming: You are only interested in parsing a table from a webpage) .

Here's the simple code to solve your problem:

import pandas as pd # pip install pandas

# Full HTML Page
html="""
<body style="margin-top:140px;">    
<div id="container">
 <!-- Left div -->
......
</table>
<HR noshade>
<p>
<!-- 0c17  22.0 2e9 -->
</BODY>
</HTML>
"""

df=pd.read_html(html)[0] #[0] for taking first table from html

# Got some unnecessary row
df=df.iloc[1:,:] # Removing Unnessary Row with "1st Alarm 5k" data
new_header = df.iloc[0] # Grab the first row for the header
df = df[1:] # Take the data less the header row
df.columns = new_header # Set the header row as the df header

print(df)

# Saving this dataframe to .csv
df.to_csv("[].csv",index=False) # Your HTML page already has Index Column "Place"

For more information about pandas visit documentation page

Some helpful links for above code:

  1. pandas.read_html()
  2. pandas.Series.iloc[]
  3. pandas.read_html()

CodePudding user response:

BeautifulSoup allows you to search on something other than div.

Let's say with the html you show that you want to retrieve what looks like runners, you could do something like this.

from bs4 import BeautifulSoup

file_path = 'scrap.html'

with open(file_path, 'r',
          encoding='utf-8') as file:  # We simulate a return from an html request by just opening an .html file
    html_content = file.read()

soup = BeautifulSoup(html_content, 'html.parser')
table = soup.find('table', {"class": "racetable"})  # We are looking for the table with the 'racetable' class
rows_table = table.find_all('tr')[1:]  # All lines in the table without the first one

columns_name = [
    row.get_text() for row in rows_table[0].find_all('td')
]  # We get the name of each column in a list

runners = []
for row in rows_table[1:]:  # We repeat on all the lines except the first one which is the one with the name of the columns
    data = [
        elem.get_text().strip() for elem in row.find_all('td')
    ]
    runner = {
        "place": data[columns_name.index("Place")],
        "name": data[columns_name.index("Name")],
        "city": data[columns_name.index("City")],
        "bib_no": data[columns_name.index("Bib No")],
        "age": data[columns_name.index("Age")],
        "gender": data[columns_name.index("Gender")],
        "age_group": data[columns_name.index("Age Group")],
        "total_time": data[columns_name.index("Total Time")],
        "pace": data[columns_name.index("Pace")]
    }
    print(runner)
    runners.append(runner)

the result of the print would look like this

{'place': '1', 'name': 'Runner 1', 'city': 'ANYTOWN  PA', 'bib_no': '390', 'age': '52', 'gender': 'M', 'age_group': '1:Overall', 'total_time': '18:43.93', 'pace': '6:03/M'}
{'place': '2', 'name': 'Runner 2', 'city': 'ANYTOWN  PA', 'bib_no': '380', 'age': '33', 'gender': 'M', 'age_group': '1:19-39', 'total_time': '19:31.27', 'pace': '6:18/M'}
{'place': '3', 'name': 'Runner 3', 'city': 'ANYTOWN  PA', 'bib_no': '389', 'age': '65', 'gender': 'F', 'age_group': '1:Overall', 'total_time': '45:45.20', 'pace': '14:46/M'}
{'place': '4', 'name': 'Runner 4', 'city': 'ANYTOWN  PA', 'bib_no': '381', 'age': '18', 'gender': 'F', 'age_group': '1: 1-18', 'total_time': '53:28.84', 'pace': '17:15/M'}
{'place': '5', 'name': 'Runner 5', 'city': 'ANYTOWN  PA', 'bib_no': '382', 'age': '41', 'gender': 'F', 'age_group': '1:40-59', 'total_time': '53:30.48', 'pace': '17:16/M'}
{'place': '6', 'name': 'Runner 6', 'city': 'ANYTOWN  PA', 'bib_no': '384', 'age': '14', 'gender': 'M', 'age_group': '1: 1-18', 'total_time': '57:38.66', 'pace': '18:36/M'}
{'place': '7', 'name': 'Runner 7', 'city': 'ANYTOWN  PA', 'bib_no': '385', 'age': '72', 'gender': 'F', 'age_group': '1:60-99', 'total_time': '57:40.11', 'pace': '18:36/M'}
  • Related