Home > OS >  How can I multiple convert paginated tables into one Pandas dataframe?
How can I multiple convert paginated tables into one Pandas dataframe?

Time:12-29

mendelg helped me scrape a javascript-generated table BeautifulSoup. However, as the table is paginated, the code only converts the 10 rows of the last page into the dataframe, instead of merging all the pages’ tables into one dataframe.

The original code is below:

import requests
import pandas as pd
from bs4 import BeautifulSoup


data = {
    "action": "geteCMSList",
    "keyword": "",
    "officeId": "0",
    "contractAwardTo": "",
    "contractStartDtFrom": "",
    "contractStartDtTo": "",
    "contractEndDtFrom": "",
    "contractEndDtTo": "",
    "departmentId": "",
    "tenderId": "",
    "procurementMethod": "",
    "procurementNature": "",
    "contAwrdSearchOpt": "Contains",
    "exCertSearchOpt": "Contains",
    "exCertificateNo": "",
    "tendererId": "",
    "procType": "",
    "statusTab": "eTenders",
    "pageNo": "1",
    "size": "10",
    "workStatus": "All",
}


_columns = [
    "S. No",
    "Ministry, Division, Organization, PE",
    "Procurement Nature, Type & Method",
    "Tender/Proposal ID, Ref No., Title..",
    "Contract Awarded To",
    "Company Unique ID",
    "Experience Certificate No  ",
    "Contract Amount",
    "Contract Start & End Date",
    "Work Status",
]

for page in range(1, 11):  # <--- Increase number of pages here
    print(f"Page: {page}")
    data["pageNo"] = page


    response = requests.post(
        "https://www.eprocure.gov.bd/AdvSearcheCMSServlet", data=data
    )
    # The HTML is missing a `table` tag, so we need to fix it
    soup = BeautifulSoup("<table>"   "".join(response.text)   "</table>", "html.parser")
    df = pd.read_html(
        str(soup),
    )[0]

    df.columns = _columns
    print(df.to_string())

When I change the number of pages in the for loop, the resultant df only contains the last 10 rows of the last (in the above case, 11th) page.

The output I want, instead, is a dataframe where all the tables from all pages will be contained.

CodePudding user response:

You can use pandas.concat.

Create a list outside the loop:

all_data = []

and within the loop append to it:

df = pd.read_html(
        str(soup),
    )[0]

    all_data.append(df)

then, again, outside the loop:

df = pd.concat(all_data)

print(df.to_string())

Full example:

import requests
import pandas as pd
from bs4 import BeautifulSoup


data = {
    "action": "geteCMSList",
    "keyword": "",
    "officeId": "0",
    "contractAwardTo": "",
    "contractStartDtFrom": "",
    "contractStartDtTo": "",
    "contractEndDtFrom": "",
    "contractEndDtTo": "",
    "departmentId": "",
    "tenderId": "",
    "procurementMethod": "",
    "procurementNature": "",
    "contAwrdSearchOpt": "Contains",
    "exCertSearchOpt": "Contains",
    "exCertificateNo": "",
    "tendererId": "",
    "procType": "",
    "statusTab": "eTenders",
    "pageNo": "1",
    "size": "10",
    "workStatus": "All",
}


_columns = [
    "S. No",
    "Ministry, Division, Organization, PE",
    "Procurement Nature, Type & Method",
    "Tender/Proposal ID, Ref No., Title..",
    "Contract Awarded To",
    "Company Unique ID",
    "Experience Certificate No  ",
    "Contract Amount",
    "Contract Start & End Date",
    "Work Status",
]

all_data = []
for page in range(1, 2):  # <--- Increase number of pages here
    print(f"Page: {page}")
    data["pageNo"] = page


    response = requests.post(
        "https://www.eprocure.gov.bd/AdvSearcheCMSServlet", data=data
    )
    # The HTML is missing a `table` tag, so we need to fix it
    soup = BeautifulSoup("<table>"   "".join(response.text)   "</table>", "html.parser")
    df = pd.read_html(
        str(soup),
    )[0]

    all_data.append(df)

_df = pd.concat(all_data)

print(_df.to_string())
  • Related