I am trying to generate some synthetic data using the below code:
df = pd.DataFrame({
'year' : pd.Series([2014, 2018]).repeat(500),
'gender': np.random.choice(['M','F'],1000,p=[0.55,0.45]),
'age': np.random.choice(range(20,65),1000),
})
df.loc[df['gender']=='M', 'income'] = ( 2000 550*df['age'] - 5.25*df['age']**2 ) * np.random.lognormal() * 1.035**(df['year']-2014)
It was working fine until I added in year
variable, and now I am getting the following error:
ValueError: cannot reindex from a duplicate axis
I don't understand why I am getting this error as I am not doing anything with the index. What does this particular error mean, and how would I amend the code to get this to work?
CodePudding user response:
Filter in both sides can help, because is assigned same Series
from left side to right side:
np.random.seed(0)
df = pd.DataFrame({
'year' : pd.Series([2014, 2018]).repeat(500),
'gender': np.random.choice(['M','F'],1000,p=[0.55,0.45]),
'age': np.random.choice(range(20,65),1000),
})
m = df['gender']=='M'
df.loc[m, 'income'] = ( 2000 550*df.loc[m, 'age'] - 5.25*df.loc[m, 'age']**2 ) * np.random.lognormal() * 1.035**(df.loc[m, 'year']-2014)
print (df)
year gender age income
0 2014 M 37 11556.758435
0 2014 F 49 NaN
0 2014 F 26 NaN
0 2014 M 48 12426.597386
0 2014 M 30 10499.041891
.. ... ... ... ...
1 2018 M 31 12248.836026
1 2018 M 49 14295.447090
1 2018 F 34 NaN
1 2018 M 39 13525.781391
1 2018 F 46 NaN
[1000 rows x 4 columns]
Reason is if not filtered right side are generated duplicated indices with size 1000
:
print (( 2000 550*df['age'] - 5.25*df['age']**2 ) * np.random.lognormal() * 1.035**(df['year']-2014))
0 11556.758435
0 12457.656258
0 9718.568650
0 12426.597386
0 10499.041891
1 12248.836026
1 14295.447090
1 12796.566872
1 13525.781391
1 14160.974248
Length: 1000, dtype: float64
If filtering lengt of duplicated values is smae like number of True
s values, so assig working well:
print (m.sum())
566
print (( 2000 550*df.loc[m, 'age'] - 5.25*df.loc[m, 'age']**2 ) * np.random.lognormal() * 1.035**(df.loc[m, 'year']-2014))
0 11556.758435
0 12426.597386
0 10499.041891
0 12450.987175
0 12072.183268
1 10384.145945
1 12248.836026
1 12248.836026
1 14295.447090
1 13525.781391
Length: 566, dtype: float64
Another idea is create default index:
df = df.reset_index(drop=True)