Home > OS >  BeautifulSoup and pd.read_html - how to save the links into separate column in the final dataframe?
BeautifulSoup and pd.read_html - how to save the links into separate column in the final dataframe?

Time:03-31

My question is somehow similiar to this one: How to save out in a new column the url which is reading pandas read_html() function?

I have a set of links that contain tables (4 tables each and I need only first three of them). The goal is to store the link of each table in the separate 'address' column.

links = ['www.link1.com', 'www.link2.com', ... , 'www.linkx.com']
details = []

for link in tqdm(links):
    page = requests.get(link)
    sauce = BeautifulSoup(page.content, 'lxml')
    table = sauce.find_all('table')

    # Only first 3 tables include data
    for i in range(3):
        details.append(pd.read_html(str(table))[i])
        final_df = pd.concat(details, ignore_index=True)
        final_df['address'] = link
    time.sleep(2)

However, when I use this code, only the last link is assigned to every row in the 'address' column.

I'm probably missing a detail but spent last 2 hours figuring that out and simply can't make any progress - would really appreciate some help.

CodePudding user response:

You are close to your goal - Add df['address'] in each iteration to your DataFrame before appending it to your list:

for i in table[:3]:
    df = pd.read_html(str(i))[0]
    df['address'] = link
    details.append(df)

Note You could also slice your ResultSet of tables table[:3] so you do not have to use range

Move the concatination outside of your loop and call it ones if your iterations are over:

final_df = pd.concat(details, ignore_index=True)

Example

import pandas as pd

links = ['www.link1.com', 'www.link2.com','www.linkx.com']
details = []

for link in links:
    # page = requests.get(link)
    # sauce = BeautifulSoup(page.content, 'lxml')
    # table = sauce.find_all('table')
    table = ['<table><tr><td>table 1</td></tr></table>',
             '<table><tr><td>table 2</td></tr></table>',
             '<table><tr><td>table 3</td></tr></table>']
    # Only first 3 tables include data
    for i in table[:3]:
        df = pd.read_html(str(i))[0]
        df['address'] = link
        details.append(df)

final_df = pd.concat(details, ignore_index=True)

Output

0 address
table 1 www.link1.com
table 2 www.link1.com
table 3 www.link1.com
table 1 www.link2.com
table 2 www.link2.com
table 3 www.link2.com
table 1 www.linkx.com
table 2 www.linkx.com
table 3 www.linkx.com
  • Related