Correcting pandas index "duplicate" behavior-CodePudding

I am trying to create a new dataframe using the structure of an existing index (the source of this is a excel file, so the empty spaces are important) and some new data. Typically I would create the index and assign it data values by constructing a new pd.DataFrame() object.

However, I am encountering odd behavior when creating the indices as None (again, needed as a placeholder. Is there a workaround for this? Or am I approaching this incorrectly? Thanks

import pandas as pd
import numpy as np

#index of final_df - note the None values in index are needed
#because the final output will be written to a excel file with those
#rows blank
test_index = ['row1', 'row2', None, None, 'rowN']

#some dummy source data - think of this as new sales data coming in weekly
test_update_df = pd.DataFrame(index=['row1', np.NaN, 'row2', np.NaN, 'rowN'],
                             data=[1,np.NaN, 64,np.NaN, 643.78])
test_update_df.columns = ['1/6/23']

#create the final df
#BUT here is where the problems lie
final_df = pd.DataFrame(index=test_index,
            data=test_update_df)

"""

ValueError: cannot reindex from a duplicate axis

I believe we are getting this due to the None values in the index?

"""

CodePudding user response：

This should do what your question asks given input where input rows with null index values have null data:

res = test_update_df[test_update_df.index.notna()].reindex(test_index)

Output:

      1/6/23
row1    1.00
row2   64.00
None     NaN
None     NaN
rowN  643.78

CodePudding user response：

Another apporach is:

test_update_df.dropna().reindex(test_index)

      1/6/23
row1    1.00
row2   64.00
None     NaN
None     NaN
rowN  643.78

Or:

test_update_df.loc[test_update_df.index.intersection(test_index)].reindex(test_index)

CodePudding user response：

I believe we are getting this due to the None values in the index?

The problem cause duplicate np.NaN index values int test_update_df dataframe ( ValueError: cannot reindex on an axis with duplicate labels ):

import pandas as pd
import numpy as np

test_index = ['row1', 'row2', None, None, 'rowN'] # THIS is OK

#some dummy source data - think of this as new sales data coming in weekly
test_update_df = pd.DataFrame(
    index=['row1', np.NaN, 'row2', 'rowN'], # <- duplicates HERE are the problem
    data=[1,np.NaN, 64, 643.78])
test_update_df.columns = ['1/6/23']

#create the final df
final_df = pd.DataFrame(index=test_index, data=test_update_df)

print(final_df) # runs OK

prints

      1/6/23
row1    1.00
row2   64.00
None     NaN
None     NaN
rowN  643.78