Home > Back-end >  Trouble scraping data rows - beautifulSoup
Trouble scraping data rows - beautifulSoup

Time:08-22

Beginner working with python and beautiful soup, attempting to scrape election results data from a state elections page. Went by the book 'learning to code with baseball' to learn all of my basics, including the 5th chapter which covers scraping.

I am working on scraping one table from the site, which looks like this:

Candidate Total Votes Pct
Abraham Lincoln 53990 42.1%
George Washington 37326 29.1%

After using BeautifulSoup to read the entire site and identify the tables. I was successful in isolating this table from the rest of the tables on the site and identifying the header row using:

gov_table = tables[3]
rows = gov_table.find_all('tr')
header_row = rows[0]

The trouble i ran into was with the data rows. I cannot seem to pick up the candidate's names, only their 'total votes' and 'pct'.

I try:

first_data_row = rows[1]
first_data_row.find_all('td')

which gives the HTML:

[<td  data-title="Candidate" scope="row">ABRAHAM LINCOLN <span >(DEM)</span> </td>,
 <td  width="25%">
 <ul >
 <li>Polling place: 51771</li>
 <li>Mail ballots: 2219</li>
 </ul>
 </td>,
 <td  data-title="Total votes">53990</td>,
 <td  data-title="Pct">42.1%</td>]

I then attempt to run a comprehension on all the td tags to isolate them in a list, which I will use as the rows to a DataFrame. But the trouble I have is, I cannot seem to pick up the candidates name:

In [82]: [str(x.string) for x in first_data_row.find_all('td')]
Out[82]: ['None', 'None', '53990', '42.1%']

I'm really stumped about the 'None' strings as they dont appear anywhere in the table rows themselves. I have tried narrowing in on it further using

In [83]: [str(x.string) for x in first_data_row.find_all('td', {'scope': 'row'})]
Out[83]: ['None']

or

In[87]: first_candidate_name = first_data_row.find_all('td')[0]
...first_candidate_name
...str(first_candidate_name.string)
Out[87]: 'None'

With similar results.

I am sure I am missing something relatively minor but my beginning eyes can't narrow it down. Please help!!

CodePudding user response:

You're using .string to access the content within the rows, and some of these rows have multiple children, which means .string will return None

On the other hand, .get_text() returns all the strings of the children concatenated into one string

> [str(x.string) for x in first_data_row.find_all('td')]
> ['None', 'None', '53990', '42.1%']
> [str(x.get_text()) for x in first_data_row.find_all('td')]
> ['Gina M. RAIMONDO (DEM) ', '\n\nPolling\xa0place:\xa051771\nMail\xa0ballots:\xa02219\n\n', '53990', '42.1%']

From the enter image description here

  • Related