pandas : cannot reindex from a duplicate axis error-CodePudding

I am trying to generate some synthetic data using the below code:

df = pd.DataFrame({
    'year' : pd.Series([2014, 2018]).repeat(500),
    'gender': np.random.choice(['M','F'],1000,p=[0.55,0.45]),
    'age': np.random.choice(range(20,65),1000),
})

df.loc[df['gender']=='M', 'income'] = ( 2000   550*df['age'] - 5.25*df['age']**2 ) * np.random.lognormal() * 1.035**(df['year']-2014)

It was working fine until I added in year variable, and now I am getting the following error:

ValueError: cannot reindex from a duplicate axis

I don't understand why I am getting this error as I am not doing anything with the index. What does this particular error mean, and how would I amend the code to get this to work?

CodePudding user response：

Filter in both sides can help, because is assigned same Series from left side to right side:

np.random.seed(0)

df = pd.DataFrame({
    'year' : pd.Series([2014, 2018]).repeat(500),
    'gender': np.random.choice(['M','F'],1000,p=[0.55,0.45]),
    'age': np.random.choice(range(20,65),1000),
})

m = df['gender']=='M'
df.loc[m, 'income'] = ( 2000   550*df.loc[m, 'age'] - 5.25*df.loc[m, 'age']**2 ) * np.random.lognormal() * 1.035**(df.loc[m, 'year']-2014)

print (df)
    year gender  age        income
0   2014      M   37  11556.758435
0   2014      F   49           NaN
0   2014      F   26           NaN
0   2014      M   48  12426.597386
0   2014      M   30  10499.041891
..   ...    ...  ...           ...
1   2018      M   31  12248.836026
1   2018      M   49  14295.447090
1   2018      F   34           NaN
1   2018      M   39  13525.781391
1   2018      F   46           NaN

[1000 rows x 4 columns]

Reason is if not filtered right side are generated duplicated indices with size 1000:

print (( 2000   550*df['age'] - 5.25*df['age']**2 ) * np.random.lognormal() * 1.035**(df['year']-2014))
    0    11556.758435
0    12457.656258
0     9718.568650
0    12426.597386
0    10499.041891
    
1    12248.836026
1    14295.447090
1    12796.566872
1    13525.781391
1    14160.974248
Length: 1000, dtype: float64

If filtering lengt of duplicated values is smae like number of Trues values, so assig working well:

print (m.sum())
566 
               

print (( 2000   550*df.loc[m, 'age'] - 5.25*df.loc[m, 'age']**2 ) * np.random.lognormal() * 1.035**(df.loc[m, 'year']-2014))
0    11556.758435
0    12426.597386
0    10499.041891
0    12450.987175
0    12072.183268
    
1    10384.145945
1    12248.836026
1    12248.836026
1    14295.447090
1    13525.781391
Length: 566, dtype: float64

Another idea is create default index:

df = df.reset_index(drop=True)