I have a problem. I want to calculate some date questions. But unfortunately I got an error ValueError: cannot reindex from a duplicate axis
. I looked at What does `ValueError: cannot reindex from a duplicate axis` mean?. But nothing worked for me. How could I solve the problem?
I tried print(True in df.index.duplicated()) [OUT] False
# Did not work for me
#df[df.index.duplicated()]
#df = df.loc[:,~df.columns.duplicated()]
#df = df.reset_index()
Dataframe
customerId fromDate
0 1 2021-02-22
1 1 2021-03-18
2 1 2021-03-22
3 1 2021-02-10
4 1 2021-09-07
5 1 None
6 1 2022-01-18
7 2 2021-05-17
8 3 2021-05-17
9 3 2021-07-17
10 3 2021-02-22
11 3 2021-02-22
import pandas as pd
d = {'customerId': [1, 1, 1, 1, 1, 1, 1, 2, 3, 3, 3, 3],
'fromDate': ['2021-02-22', '2021-03-18', '2021-03-22',
'2021-02-10', '2021-09-07', None, '2022-01-18', '2021-05-17', '2021-05-17', '2021-07-17', '2021-02-22', '2021-02-22']
}
df = pd.DataFrame(data=d)
#display(df)
#converting to datetimes
df['fromDate'] = pd.to_datetime(df['fromDate'], errors='coerce')
#for correct add missing dates is sorting ascending by both columns
df = df.sort_values(['customerId','fromDate'])
#new column per customerId
df['lastInteractivity'] = pd.to_datetime('today').normalize() - df['fromDate']
#added missing dates per customerId, also count removed missing rows with NaNs
df = (df.dropna(subset=['fromDate'])
.set_index('fromDate')
.groupby('customerId')['lastInteractivity']
.apply(lambda x: x.asfreq('d'))
.reset_index())
[OUT]
ValueError: cannot reindex from a duplicate axis
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-36-3f715dc564ee> in <module>()
3 .set_index('fromDate')
4 .groupby('customerId')['lastInteractivity']
----> 5 .apply(lambda x: x.asfreq('d'))
6 .reset_index())
CodePudding user response:
Indeed I arrived at the same conclusion than @ALollz said in his comment, by using the drop_duplicates
, you have the expected result :
#added missing dates per customerId, also count removed missing rows with NaNs
df = (df.dropna(subset=['fromDate'])
.drop_duplicates(['fromDate', 'customerId'])
.set_index('fromDate')
.groupby('customerId')['lastInteractivity']
.apply(lambda x: x.asfreq('d'))
.reset_index())
Output :
customerId fromDate lastInteractivity
0 1 2021-02-10 468 days
1 1 2021-02-11 NaT
2 1 2021-02-12 NaT
3 1 2021-02-13 NaT
4 1 2021-02-14 NaT
...
485 3 2021-07-13 NaT
486 3 2021-07-14 NaT
487 3 2021-07-15 NaT
488 3 2021-07-16 NaT
489 3 2021-07-17 311 days