Home > OS >  Iterate through excel files and extract specific range of data and add it to a data frame
Iterate through excel files and extract specific range of data and add it to a data frame

Time:01-10

I am trying to do an automation in python and I need to iterate through all excel files in a folder and look for specific range of text in each excel and add it to a dataframe as a seperate column. I think I'm able to read the excel files and extract the required data but I got the result in 3 seperate pandas class dataframes.

filesList = ["sample1.xlsx", "sample2.xlsx", "sample3.xlsx"]

for file in filesList:
    wb = load_workbook(file)
    ws = wb.active
    data_rows = []
    for row in ws['C6':'C9']:
        data_cols = []
        for cell in row:
            data_cols.append(cell.value)
        data_rows.append(data_rows)
       

Output:

                0 
0             apple
1             sold
2             No stock
                0 
0             orange 
1             not sold
2             In stock
                0 
0             kiwi
1             not sold
2             In stock   

Desired output:

0          1           2
apple    sold       No stock
orange   not sold   In stock
kiwi     not sold   In stock

CodePudding user response:

You probably want to use the DataFrame.from_records() function. Your code seems to have two issues:

  • You're not distinguishing between columns and rows of the DataFrame.
  • You're resetting the DataFrame on every iteration through filesList.

Revised code:

filesList = ["sample1.xlsx", "sample2.xlsx", "sample3.xlsx"]

data_rows = [] # Stores the data that will be loaded into the DF
column_name_dict = { # Map Excel columns to DF columns
    'C6': '0',
    'C7': '1',
    'C8': '2',
}
for file in filesList:
    wb = load_workbook(file)
    ws = wb.active
    for cell_ref, df_col in column_name_dict.items():
        data_cols = {}
        data_cols[df_col] = ws[cell_ref].value
        data_rows.append(data_cols)
composed_df = DataFrame.from_records(data_rows)
print(composed_df)

CodePudding user response:

Using openpyxl:

files = ["sample1.xlsx", "sample2.xlsx", "sample3.xlsx"]

dfs = []
for file in files:
    work_book = load_workbook(filename=file, read_only=True, data_only=True)
    data = [[x.value for x in row] for row in work_book.active["C6:C9"]]

    df = pd.DataFrame(data).transpose()
    dfs.append(df)

df_mapping = {x: y for x, y in enumerate(dfs)}
final_df = pd.concat([pd.DataFrame(df_mapping.get(x)) for x in df_mapping]).reset_index(drop=True)
print(final_df)
  • Related