I am trying to create a new dataframe using the structure of an existing index (the source of this is a excel file, so the empty spaces are important) and some new data. Typically I would create the index and assign it data values by constructing a new pd.DataFrame() object.
However, I am encountering odd behavior when creating the indices as None (again, needed as a placeholder. Is there a workaround for this? Or am I approaching this incorrectly? Thanks
import pandas as pd
import numpy as np
#index of final_df - note the None values in index are needed
#because the final output will be written to a excel file with those
#rows blank
test_index = ['row1', 'row2', None, None, 'rowN']
#some dummy source data - think of this as new sales data coming in weekly
test_update_df = pd.DataFrame(index=['row1', np.NaN, 'row2', np.NaN, 'rowN'],
data=[1,np.NaN, 64,np.NaN, 643.78])
test_update_df.columns = ['1/6/23']
#create the final df
#BUT here is where the problems lie
final_df = pd.DataFrame(index=test_index,
data=test_update_df)
"""
ValueError: cannot reindex from a duplicate axis
I believe we are getting this due to the None values in the index?
"""
CodePudding user response:
This should do what your question asks given input where input rows with null index values have null data:
res = test_update_df[test_update_df.index.notna()].reindex(test_index)
Output:
1/6/23
row1 1.00
row2 64.00
None NaN
None NaN
rowN 643.78
CodePudding user response:
Another apporach is:
test_update_df.dropna().reindex(test_index)
1/6/23
row1 1.00
row2 64.00
None NaN
None NaN
rowN 643.78
Or:
test_update_df.loc[test_update_df.index.intersection(test_index)].reindex(test_index)
CodePudding user response:
I believe we are getting this due to the None values in the index?
The problem cause duplicate np.NaN index values int test_update_df
dataframe ( ValueError: cannot reindex on an axis with duplicate labels
):
import pandas as pd
import numpy as np
test_index = ['row1', 'row2', None, None, 'rowN'] # THIS is OK
#some dummy source data - think of this as new sales data coming in weekly
test_update_df = pd.DataFrame(
index=['row1', np.NaN, 'row2', 'rowN'], # <- duplicates HERE are the problem
data=[1,np.NaN, 64, 643.78])
test_update_df.columns = ['1/6/23']
#create the final df
final_df = pd.DataFrame(index=test_index, data=test_update_df)
print(final_df) # runs OK
prints
1/6/23
row1 1.00
row2 64.00
None NaN
None NaN
rowN 643.78