Home > Net >  How to update hierarchical index after drop in Pandas?
How to update hierarchical index after drop in Pandas?

Time:11-25

I set up my DataFrame like so:

cols = ['molecule_id', 'atom_id', 'atom_type', 'x', 'y', 'z']
data = pd.DataFrame()\
         .from_dict(data_dict, orient='index', columns=cols)\
         .set_index(['molceule_id', 'atom_id'])
print(data.head(8))

Here data_dict: dict[str, list]
It results in:

                    atom_type         x         y         z
molecule_id atom_id
0           0               C -2.893477 -2.893477 -2.893477
            1               S -3.293477 -2.893477 -2.893477
1           0               C -2.893477 -1.736086 -1.736086
            1               S -3.293477 -1.736086 -1.736086
2           0               C -1.736086 -2.893477 -1.736086
            1               S -2.136086 -2.893477 -1.736086
3           0               C -1.736086 -1.736086 -2.893477
            1               S -2.136086 -1.736086 -2.893477

Later in the code, I need to remove a molecule (say #1) from this frame, thus I do:

data.drop(labels=1, level='molecule_id', axis=0, inplace=True)
                    atom_type         x         y         z
molecule_id atom_id
0           0               C -1.736086 -2.893477 -1.736086
            1               S -2.136086 -2.893477 -1.736086
2           0               C -2.893477 -2.893477 -2.893477
            1               S -3.293477 -2.893477 -2.893477
3           0               C -2.893477 -1.736086 -1.736086
            1               S -3.293477 -1.736086 -1.736086
4           0               C -1.736086 -2.893477  0.578695
            1               S -2.136086 -2.893477  0.578695

At this point, I would want to adjust 'molecule_id' index to achieve the desired output:

                    atom_type         x         y         z
molecule_id atom_id
0           0               C -1.736086 -2.893477 -1.736086
            1               S -2.136086 -2.893477 -1.736086
1           0               C -2.893477 -2.893477 -2.893477
            1               S -3.293477 -2.893477 -2.893477
2           0               C -2.893477 -1.736086 -1.736086
            1               S -3.293477 -1.736086 -1.736086
3           0               C -1.736086 -2.893477  0.578695
            1               S -2.136086 -2.893477  0.578695

Upon setting an index Pandas seem to create FrozenList, thus I could not do something like:

data.index.levels[0] = new_id_level

The solution I have in mind is to rebuild MultiIndex from scratch and apply to the DataFrame with set_index():

atoms_per_molecule = 2
num_molecules = len(data)//atoms_per_molecule
molecule_id = np.repeat(range(num_molecules), atoms_per_molecule)
atom_id = np.tile(range(atoms_per_molecule), num_molecules)
tuples = list(zip(molecule_id, atom_id))
names = ['molecule_id', 'atom_id']
multi_id = pd.MultiIndex.from_tuples(tuples, names=names)
data.set_index(multi_id, inplace=True)

It works well but seems unreasonably complicated considering the number of drops I plan to perform.

Therefore, I am wondering if there are any other and more efficient ways to do so?
P.S: Maybe it is possible to create some kind of resettable index using a given pattern?

CodePudding user response:

Sample dataframe:

import pandas as pd

data = pd.DataFrame({
    'a': [0, 0, 0, 1, 2, 2, 3, 3],
    'b': [0, 1, 2, 0, 0, 1, 0, 1],
    'col_1': [3, 14, 15, 92, 65, 35, 89, 79],
    'col_2': ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h']
})
data = data.set_index(['a', 'b'])
data = data.drop(labels=1, level='a', axis=0, inplace=False)
print(data)

This gives:

     col_1 col_2
a b             
0 0      3     a
  1     14     b
  2     15     c
2 0     65     e
  1     35     f
3 0     89     g
  1     79     h

Modify index:

data.index = data.index.remove_unused_levels()
n = data.index.get_level_values(0).nunique()
data.index = data.index.set_levels(range(n), level=0)

When you drop rows from a dataframe, this does not change the underlying index. The first line modifies the index removing index levels that are no longer used in the dataframe. The second line counts the number of distinct values in the level 0 of the new index. Finally, the third line replaces level 0 values with consecutive integers.

The resulting dataframe looks as follows:

     col_1 col_2
a b             
0 0      3     a
  1     14     b
  2     15     c
1 0     65     e
  1     35     f
2 0     89     g
  1     79     h
  • Related