Home > Software engineering >  Scrape table header from investing.com
Scrape table header from investing.com

Time:11-18

I'm trying to scrap data from investing.com. My code is working except from the table header. My "columns" variable has the names as: data-col-name = "abc", but I don't know how to extract them as column_names.


table_rows = soup.find("tbody").find_all("tr")

table = []
for i in table_rows:
    td = i.find_all("td")
    row = [cell.string for cell in td]
    table.append(row)
    
columns = soup.find("thead").find_all("th")
column_names = 

df_temp = pd.DataFrame(data=table, columns=column_names)
df_dji = df_dji.append(df_temp)

CodePudding user response:

You have to use .text instead of .string

columns = soup.find("thead").find_all("th")
#print(columns)

column_names = [cell.text for cell in columns]
print(column_names)

or use .get_text() or even .get_text(strip=True)

column_names = [cell.get_text() for cell in columns]
print(column_names)

Official documentation shows .string (.text is unofficial method in new versions but probably was official in older versions) but here .string doesn't work - maybe because there is another object <span> inside <th>. And get_text() get all strings from all elements in th and create one string.


EDIT:

If you want to get value form data-col-name= then use

  • cell['data-col-name']
  • cell.get('data-col-name')
  • cell.attrs['data-col-name']
  • cell.attrs.get('data-col-name')

(and the same is with cell['id'] or cell['class'])

column_names = [cell['data-col-name'] for cell in columns]

column_names = [cell.get('data-col-name') for cell in columns]

# etc.

attrs is normal dictionary so you can use attrs.get(key, default_value), attrs.keys(), attrs.items(), attrs.values() or use like dictionary with for-loop.

  • Related