My question is somehow similiar to this one: How to save out in a new column the url which is reading pandas read_html() function?
I have a set of links that contain tables (4 tables each and I need only first three of them). The goal is to store the link of each table in the separate 'address' column.
links = ['www.link1.com', 'www.link2.com', ... , 'www.linkx.com']
details = []
for link in tqdm(links):
page = requests.get(link)
sauce = BeautifulSoup(page.content, 'lxml')
table = sauce.find_all('table')
# Only first 3 tables include data
for i in range(3):
details.append(pd.read_html(str(table))[i])
final_df = pd.concat(details, ignore_index=True)
final_df['address'] = link
time.sleep(2)
However, when I use this code, only the last link is assigned to every row in the 'address' column.
I'm probably missing a detail but spent last 2 hours figuring that out and simply can't make any progress - would really appreciate some help.
CodePudding user response:
You are close to your goal - Add df['address']
in each iteration to your DataFrame
before appending it to your list:
for i in table[:3]:
df = pd.read_html(str(i))[0]
df['address'] = link
details.append(df)
Note You could also slice your ResultSet
of tables table[:3]
so you do not have to use range
Move the concatination outside of your loop and call it ones if your iterations are over:
final_df = pd.concat(details, ignore_index=True)
Example
import pandas as pd
links = ['www.link1.com', 'www.link2.com','www.linkx.com']
details = []
for link in links:
# page = requests.get(link)
# sauce = BeautifulSoup(page.content, 'lxml')
# table = sauce.find_all('table')
table = ['<table><tr><td>table 1</td></tr></table>',
'<table><tr><td>table 2</td></tr></table>',
'<table><tr><td>table 3</td></tr></table>']
# Only first 3 tables include data
for i in table[:3]:
df = pd.read_html(str(i))[0]
df['address'] = link
details.append(df)
final_df = pd.concat(details, ignore_index=True)
Output
0 | address |
---|---|
table 1 | www.link1.com |
table 2 | www.link1.com |
table 3 | www.link1.com |
table 1 | www.link2.com |
table 2 | www.link2.com |
table 3 | www.link2.com |
table 1 | www.linkx.com |
table 2 | www.linkx.com |
table 3 | www.linkx.com |