Finding and storing row number (int) for specific value in df-CodePudding

I am using the following code to detect a row number (the header) in multiple CSV files which I am processing:

combined_xlsx = pd.read_excel(xlsxfile, nrows=20)
out = np.where([combined_xlsx.values == 'PCA'])[1][0]
combined_xlsx = pd.read_excel(xlsxfile, header=out 1)
combined_xlsx.dropna(subset=['PCA'], inplace=True)

Based on the value 'PCA' occurring, the header row is decided and stored and used to read the whole file. I cannot use a fixed number with the header= method because the header row occurs in various rows in the original files.

In case the header row is at row position 0, the code doesn't work and I receive the following error:

IndexError: index 0 is out of bounds for axis 0 with size 0

How can I solve this issue and correctly determine the header row whether at row position 0 or not?

CodePudding user response：

Your usage of np.where is wrong. It's function signature is numpy.where(condition, x, y) where condition a boolean list. For True value, it yields x, otherwise yield y.

CodePudding user response：

Fixed with the following function:

def headerfinder(df, mystr):
    cols = df.columns.isin([mystr])
    if True in cols:
        out = 0
    else:
        out = np.where([df.values == mystr])[1][0]   1
    return(out)

Not a neat solution, but works in my case.