Home > Net >  Lexsorting Pandas Multi-Index with datetime index and string index
Lexsorting Pandas Multi-Index with datetime index and string index

Time:03-29

emphasized textI have created a number of dataframes, all with varied overlapping datetime indexes. I then append them to each other, and append a second index, the "name" column.

i try to get a slice of the dataframe by the entry in name, but it says it is not lexsorted. When i try to lexsort, it throws an error on the string in the name column, saying it is not a date...

df0 = pd.read_csv("file.csv", index_col='date', parse_dates=True)
df1 = pd.read_csv("file2.csv", index_col='date', parse_dates=True)

stack = pd.concat([df0, df1] ,axis=0)
stack.set_index('name', append=True,inplace=True)

print(stack.loc[datetime(2015,5,1)])
print(stack.loc['AAA'])

The .loc on date works and returns all names and values for the given date, but the .loc on the string, in the second index returns UnsortedIndexError: 'MultiIndex slicing requires the index to be lexsorted: slicing on levels [0], lexsort depth 0'

When I try to sort, based on what ive read here, using stack.sort_index(inplace=True) I get the error ValueError: Unknown datetime string format, unable to parse: AAA

the dataframe stack looks like this:

date       name       value
1/1/15     AAA         1
...        ...         ...
12/1/15    AAA         9
1/1/15     BBB         6
...        ...         ...
12/1/15    BBB         3

and i would like to be able to return all dates and values for AAA. I understand i can just keep it as a single index, and do groupby or filter, but I thought this was the purpose of the multiindex?

Edited again to clarify - I would like to know how to lexsort the second index. I am not sure why pandas is expecting it to also be a datetimeindex, which I think is what it is implying.

edited to add stack.head(), but its as i represented above.

                    value         
date       name
2015-01-02 AAA     164.71  
2015-01-05 AAA     163.00  
2015-01-06 AAA     160.82  
2015-01-07 AAA     159.90  
2015-01-08 AAA     160.65  

Edit3: When i sort the level name using stack.sort_index(level = [1], inplace=True,axis=1)

I get: pandas.errors.UnsortedIndexError: 'MultiIndex slicing requires the index to be lexsorted: slicing on levels [0], lexsort depth 0

CodePudding user response:

I dont like this solution but after lots of tinkering I arrived at :

stack.loc[(slice(None),'AAA'),:]

If there is another way than this or mozway's solution please reply.

CodePudding user response:

It looks like you're looking for a cross-section (xs):

df.xs('AAA', level='name', drop_level=False)

example:

df = (pd.DataFrame({'value': {(Timestamp('2015-01-02 00:00:00'), 'AAA'): 164.71,
                              (Timestamp('2015-01-05 00:00:00'), 'AAA'): 163.0,
                              (Timestamp('2015-01-06 00:00:00'), 'AAA'): 160.82,
                              (Timestamp('2015-01-02 00:00:00'), 'BBB'): 159.9,
                              (Timestamp('2015-01-05 00:00:00'), 'BBB'): 160.65}})
        .rename_axis(['date', 'name'])
        .sort_index(level=1)
     )

df.xs('AAA', level='name', drop_level=False)

                  value
date       name        
2015-01-02 AAA   164.71
2015-01-05 AAA   163.00
2015-01-06 AAA   160.82

Alternatively, using IndexSlice:

idx = pd.IndexSlice
df.loc[idx[:, 'AAA'], :]
  • Related