Home > Software design >  Loop through excels and extract values in excel using pandas
Loop through excels and extract values in excel using pandas

Time:02-24

I am learning python while doing small automation and I need help here. I have excel files in folder and I need to read all excel files and look for specific text in excel and get value from a specific column if required text is found. I was able to read excel and get values but I think I am getting those values in separate dataframe for each excel. After having dataframe, I am trying to combine but I get only one row in output. I need help in getting rows for all excels in one dataframe. Below is my code

import pandas as pd
import numpy as np
from pathlib import Path

p = Path(path to excel files)
filtered = [x for x in p.glob("**/*.xlsx") if not x.name.__contains__("R4")] #doing filter to 
exclude some excel 
ls = list(filtered)
L = len(ls)
for x in ls[0:L]:
    data = pd.read_excel(x, sheet_name=[1], header=1)
    df = data[1]
    a = np.where(df.values == 'Deductible Individual')
    b = [x[0] for x in a]
    val = b[0]
    d1 = np.where(df.values == 'Deductible Family')
    b = [x[0] for x in d1]
    val3 = b[0]
    INdata1 = df.loc[val, 'In-Network\nVALUE']
    INDDEDIN = pd.DataFrame({"INDIVIDUAL DEDUCTIBLE INN": [INdata1]})
    INdata3 = df.loc[val3, 'In-Network\nVALUE']
    FAMDEDIN = pd.DataFrame({"FAMILY DEDUCTIBLE INN": [INdata3]})

result = pd.concat([INDDEDIN, FAMDEDIN], axis=1)

print(INDDEDIN)
INDIVIDUAL DEDUCTIBLE INN
0                     1400.0
INDIVIDUAL DEDUCTIBLE INN
0                     1500.0

print(FAMDEDIN)
FAMILY DEDUCTIBLE INN
0                 4200.0
FAMILY DEDUCTIBLE INN
0                 2400.0

print(result)
INDIVIDUAL DEDUCTIBLE INN  FAMILY DEDUCTIBLE INN
0              1500.0                 2400.0
              

But I am expecting result to print like below:

 INDIVIDUAL DEDUCTIBLE INN  FAMILY DEDUCTIBLE INN
  0              1500.0                 2400.0
                 1400.0                 4200.0

Thank you ewz93. I tried below code and it is much easier and simple. But result is printed as below:

                              0         1
INDIVIDUAL DEDUCTIBLE INN  [1400.0]  [1500.0]
FAMILY DEDUCTIBLE INN      [4200.0]  [2400.0]

I need result as below, how can I do that?

INDIVIDUAL DEDUCTIBLE INN      FAMILY DEDUCTIBLE INN
  1400.0                         4200.0
  1500.0                         2400.0

CodePudding user response:

I can't really follow your code (you should try using more expressive variable names, not only for others sake but also for you if you later want to understand or change parts of the code), but I assume that the problem is mainly because of the indexes.

The way you construct this results in INDDEDIN and FAMDEDIN each consisting of two rows which both have the index 0, so using concat will cause problems. You could try something like pd.concat([INDDEDIN, FAMDEDIN], axis=1, ignore_index=True) if you are sure the order in them always is perfectly aligned anyway. A better way to go about this would be to clean up the way you create the DataFrame.

The way I see it you also assign values to INDDEDIN and FAMDEDIN within the loop and then use them for the first time outside the loop, resulting in only the last iterations value being used for the part after the loop.

Although there are many ways one way I like to create a DataFrame in an iteration is to create a nested dictionary and then using pandas.DataFrame.from_dict().

Edit: Here is a way you could do it (and make it more readable):

result = {}

from pathlib import Path

p = Path(path to excel files)
# filter files
filtered_files = [x for x in p.glob("**/*.xlsx") if not x.name.__contains__("R4")]
# iterate over filtered files
for file in filtered_files:
    # read file into a DataFrame
    full_df = pd.read_excel(file, sheet_name=[1], header=1)
    # get column 1 name
    col_1_name = full_df.columns[1]
    # get partial dataframe with rows filtered by value in column 1
    ded_ind_df = full_df[full_df[col_1_name]=='Deductible Individual'] 
    # get partial dataframe with rows filtered by value in column 1
    ded_fam_df = full_df[full_df[col_1_name]=='Deductible Family']
    # add results for this file to the results dictionary
    result[file] = {
        'INDIVIDUAL DEDUCTIBLE INN': list(ded_ind_df['In-Network\nVALUE']), 
        'FAMILY DEDUCTIBLE INN': list(ded_fam_df['In-Network\nVALUE'])
    }

# convert results dictionary to DataFrame
result = pd.DataFrame.from_dict(result)

This will result in a DataFrame where the indices are the file names, if you instead want indexes you can wrap filtered_files using for i, file in enumerate(filtered_files) and then use result[i] = instead of result[file] =.

  • Related