I'm trying to scrap data from investing.com. My code is working except from the table header. My "columns" variable has the names as: data-col-name = "abc", but I don't know how to extract them as column_names.
table_rows = soup.find("tbody").find_all("tr")
table = []
for i in table_rows:
td = i.find_all("td")
row = [cell.string for cell in td]
table.append(row)
columns = soup.find("thead").find_all("th")
column_names =
df_temp = pd.DataFrame(data=table, columns=column_names)
df_dji = df_dji.append(df_temp)
CodePudding user response:
You have to use .text
instead of .string
columns = soup.find("thead").find_all("th")
#print(columns)
column_names = [cell.text for cell in columns]
print(column_names)
or use .get_text()
or even .get_text(strip=True)
column_names = [cell.get_text() for cell in columns]
print(column_names)
Official documentation shows .string
(.text
is unofficial method in new versions but probably was official in older versions) but here .string
doesn't work - maybe because there is another object <span>
inside <th>
. And get_text()
get all strings from all elements in th
and create one string.
EDIT:
If you want to get value form data-col-name=
then use
cell['data-col-name']
cell.get('data-col-name')
cell.attrs['data-col-name']
cell.attrs.get('data-col-name')
(and the same is with cell['id']
or cell['class']
)
column_names = [cell['data-col-name'] for cell in columns]
column_names = [cell.get('data-col-name') for cell in columns]
# etc.
attrs
is normal dictionary
so you can use attrs.get(key, default_value)
, attrs.keys()
, attrs.items()
, attrs.values()
or use like dictionary with for
-loop.