panda read_excel index_col seems to skip a row-CodePudding

I am trying to get a multi level index and column pandas data frame from an excel file, but oddly it seems that it is skipping a row. Consider the following:

This is my code

df = pd.read_excel(r"https://buy-and-hold-strategy.s3.eu-central-1.amazonaws.com/example.xls",
                   skiprows=2,
                   sheet_name='Sheet1',
                   index_col=[0, 1],
                   header=[0, 1])
df.head()

returns

Out[46]: 
b        1          2     
sub      0    1     0    1
1 1                       
1 2  value  1.0   NaN  NaN
  3    NaN  NaN   NaN  NaN
2 1    NaN  NaN  text  2.0
  2    NaN  NaN   NaN  NaN
3 1    NaN  NaN   NaN  NaN

It seems that the first row of data 1.1 is stored as the index name:

 df.index.names
Out[49]: FrozenList([1, 1])

CodePudding user response：

I think your excel file has a wrong structure.. maybe you can adjust the structure of the Excel file like this:

Afterwards you can read your file like you aready did:

df = pd.read_excel((r"https://buy-and-hold-strategy.s3.eu-central-1.amazonaws.com/example.xls",
               skiprows=2,
               sheet_name='Sheet1',
               index_col=[0,1],
               header=[0,1]) 
df.head()

And you get the following result:

df.head()
b           1          2     
sub2        0    1     0    1
q sub1                       
1 1       NaN  NaN   NaN  NaN
2       value  1.0   NaN  NaN
3         NaN  NaN   NaN  NaN
2 1       NaN  NaN  text  2.0
2         NaN  NaN   NaN  NaN

CodePudding user response：

Since your problem comes from using index_col with multiple values (and that's what creates a multiindex), let's simplify the problem by not using an index. (Also, we don't need to modify the excel structure for this to work).

>>> df = pd.read_excel(r"https://buy-and-hold-strategy.s3.eu-central-1.amazonaws.com/example.xls",
                   skiprows=2,
                   sheet_name='Sheet1',
                   # index_col=[0],  # No index over here
                   header=[0, 1])

Now, you can create the new index:

>>> df = df.set_index(df.columns[:2].tolist())
>>> df
                                      1          2     
                                      0    1     0    1
(Unnamed: 0_level_0, q) (b, sub)                       
1                       1           NaN  NaN   NaN  NaN
                        2         value  1.0   NaN  NaN
                        3           NaN  NaN   NaN  NaN
2                       1           NaN  NaN  text  2.0
                        2           NaN  NaN   NaN  NaN