Home > other >  Correcting pandas index "duplicate" behavior
Correcting pandas index "duplicate" behavior

Time:01-22

I am trying to create a new dataframe using the structure of an existing index (the source of this is a excel file, so the empty spaces are important) and some new data. Typically I would create the index and assign it data values by constructing a new pd.DataFrame() object.

However, I am encountering odd behavior when creating the indices as None (again, needed as a placeholder. Is there a workaround for this? Or am I approaching this incorrectly? Thanks

import pandas as pd
import numpy as np

#index of final_df - note the None values in index are needed
#because the final output will be written to a excel file with those
#rows blank
test_index = ['row1', 'row2', None, None, 'rowN']

#some dummy source data - think of this as new sales data coming in weekly
test_update_df = pd.DataFrame(index=['row1', np.NaN, 'row2', np.NaN, 'rowN'],
                             data=[1,np.NaN, 64,np.NaN, 643.78])
test_update_df.columns = ['1/6/23']

#create the final df
#BUT here is where the problems lie
final_df = pd.DataFrame(index=test_index,
            data=test_update_df)

"""

ValueError: cannot reindex from a duplicate axis

I believe we are getting this due to the None values in the index?

"""

CodePudding user response:

This should do what your question asks given input where input rows with null index values have null data:

res = test_update_df[test_update_df.index.notna()].reindex(test_index)

Output:

      1/6/23
row1    1.00
row2   64.00
None     NaN
None     NaN
rowN  643.78

CodePudding user response:

Another apporach is:

test_update_df.dropna().reindex(test_index)

      1/6/23
row1    1.00
row2   64.00
None     NaN
None     NaN
rowN  643.78

Or:

test_update_df.loc[test_update_df.index.intersection(test_index)].reindex(test_index)

CodePudding user response:

I believe we are getting this due to the None values in the index?

The problem cause duplicate np.NaN index values int test_update_df dataframe ( ValueError: cannot reindex on an axis with duplicate labels ):

import pandas as pd
import numpy as np

test_index = ['row1', 'row2', None, None, 'rowN'] # THIS is OK

#some dummy source data - think of this as new sales data coming in weekly
test_update_df = pd.DataFrame(
    index=['row1', np.NaN, 'row2', 'rowN'], # <- duplicates HERE are the problem
    data=[1,np.NaN, 64, 643.78])
test_update_df.columns = ['1/6/23']

#create the final df
final_df = pd.DataFrame(index=test_index, data=test_update_df)

print(final_df) # runs OK

prints

      1/6/23
row1    1.00
row2   64.00
None     NaN
None     NaN
rowN  643.78
  • Related