The are two categories of the each excel's second sheet: Applicant-New and Applicant-Recurring. I used the code below to extract only the Applicant-New data but my df returned both Applicant-New and Applicant-Recurring data.
import pandas as pd
import os
path = os.getcwd()
files = os.listdir(path)
files_xlsx = [f for f in files if f[-4:] == 'xlsx']
for i in files_xlsx:
df = pd.read_excel(i, sheet_name="Applicant-New")
new_row = {"Application No.":df.iloc[0,0]}
consolidated_df = consolidated_df.append(new_row, ignore_index = True)
How do I resolve this issue so that I only get Applicant-New data?
CodePudding user response:
Can you try to change following line:
df = pd.read_excel(i, sheet_name="Applicant-New")
to:
df = pd.read_excel(i, sheet_name="sheet2", usecols=[0])
in pd.read_excel, sheet_name is to choose which sheet you want, not which column or category you want. In your case it is "sheet2" I believe. Then you can specific which column you want to parse from that sheet. 0 represents the first column and 1 represents the second column.
CodePudding user response:
IIUC? Using pathlib and list comps:
from pathlib import Path
import pandas as pd
files = [file for file in Path.cwd().glob("*.xlsx")]
consolidated_df = pd.concat(
[pd.DataFrame([[{"Application No.": pd.read_excel(file, sheet_name="Applicant-New").iloc[0, 0]}]]) for file in files]
).reset_index(drop=True).rename(columns={0: "applications"})
print(consolidated_df)
Output:
applications
0 {'Application No.': 1}
1 {'Application No.': 4}
2 {'Application No.': 7}