What is the best way to read just the columns that have headers with pandas?-CodePudding

I have to read sheets from an xlsx file, but some of them have useless data outside the range of interest. The name of the headers can vary, so I can't use usecols=listOfHeaders

For example: Input

At this example I would like to import just the columns 1,2 and 3, but pandas automatically import the fourth column as an unnamed column. I would like to import just the the first three.

The output is like this Output

CodePudding user response：

You can pass a Boolean function as usecols, e.g.

pd.read_excel(..., usecols=lambda x: not x.startswith('Unnamed'))

This will exclude the columns with empty headers.

CodePudding user response：

It is not enough to filter out "Unnamed" columns (the function passed as usecols is given the target column name (either the original column name or "Unnamed…" if the original is empty)).

Your input Excel file contains second row filled with "gram" and as I suppose it should be skipped.

So my solution is:

df = pd.read_excel('Input.xlsx', skiprows=[1],
    usecols=lambda colName: not colName.startswith('Unnamed'))

Note: There is no way to pass both header=[0,1] (to use two initial rows as column names) and usecols parameters, as in this case read_excel raises an exception. So the only way, as I wrote, is just to skip the second row.