Home > Software design >  Having trouble extracting inner text of <td> tags in beautifulsoup
Having trouble extracting inner text of <td> tags in beautifulsoup

Time:02-18

I am using bs4 to scrape a website with a list of years.

years = soup.find_all('td', class_='EndCellSpacer')

which returns an array of matching tags:

[<td >
                        2014
                </td>, <td >
                        2015
                </td>, <td >
                        2016
                </td>, <td >
                        2017
                </td>, <td >
                        2018
                </td>, <td >
                        2019
                </td>, <td >
                        2020
                </td>, <td >
                        2021
                </td>]

I want the array to only return the years without the <td> tags. I have tried to use

years = soup.find_all('td', class_='EndCellSpacer').text.strip()

but I am getting this error message:

"ResultSet object has no attribute 'text'. You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?"

If I call find(), it only returns the year from the first <td> tag, and I need all of them.

This might have something to do with the values being in an array but I can't seem to figure it out. I would greatly appreciate the help, this is my first time working in Python :/

CodePudding user response:

If you look at the result of

soup.find_all('td', class_='EndCellSpacer')

it is a list, so you need to iterate over it and get the text of each td tag:

out = [td.get_text().strip() for td in soup.find_all('td', class_='EndCellSpacer')]

Output:

['2014', '2015', '2016', '2017', '2018', '2019', '2020', '2021']
  • Related