I got a ton of excel files and am trying to extract the second sheet of each of these excel files-CodePudding

The are two categories of the each excel's second sheet: Applicant-New and Applicant-Recurring. I used the code below to extract only the Applicant-New data but my df returned both Applicant-New and Applicant-Recurring data.

import pandas as pd
import os

path = os.getcwd()
files = os.listdir(path)
files_xlsx = [f for f in files if f[-4:] == 'xlsx']

for i in files_xlsx:
    df = pd.read_excel(i, sheet_name="Applicant-New")
    new_row = {"Application No.":df.iloc[0,0]}
    consolidated_df = consolidated_df.append(new_row, ignore_index = True)

How do I resolve this issue so that I only get Applicant-New data?

CodePudding user response：

Can you try to change following line:

df = pd.read_excel(i, sheet_name="Applicant-New")

to:

df = pd.read_excel(i, sheet_name="sheet2", usecols=[0])

in pd.read_excel, sheet_name is to choose which sheet you want, not which column or category you want. In your case it is "sheet2" I believe. Then you can specific which column you want to parse from that sheet. 0 represents the first column and 1 represents the second column.

CodePudding user response：

IIUC? Using pathlib and list comps:

from pathlib import Path

import pandas as pd


files = [file for file in Path.cwd().glob("*.xlsx")]

consolidated_df = pd.concat(
    [pd.DataFrame([[{"Application No.": pd.read_excel(file, sheet_name="Applicant-New").iloc[0, 0]}]]) for file in files]
).reset_index(drop=True).rename(columns={0: "applications"})

print(consolidated_df)

Output:

             applications
0  {'Application No.': 1}
1  {'Application No.': 4}
2  {'Application No.': 7}