Cannot reindex from a duplicate axis while adding missing dates and count them-CodePudding

I have a problem. I want to calculate some date questions. But unfortunately I got an error ValueError: cannot reindex from a duplicate axis. I looked at What does `ValueError: cannot reindex from a duplicate axis` mean?. But nothing worked for me. How could I solve the problem?

I tried print(True in df.index.duplicated()) [OUT] False

# Did not work for me
#df[df.index.duplicated()]
#df = df.loc[:,~df.columns.duplicated()]
#df = df.reset_index()

Dataframe

    customerId    fromDate
0            1  2021-02-22
1            1  2021-03-18
2            1  2021-03-22
3            1  2021-02-10
4            1  2021-09-07
5            1        None
6            1  2022-01-18
7            2  2021-05-17
8            3  2021-05-17
9            3  2021-07-17
10           3  2021-02-22
11           3  2021-02-22

import pandas as pd

d = {'customerId': [1, 1, 1, 1, 1, 1, 1, 2, 3, 3, 3, 3],
     'fromDate': ['2021-02-22', '2021-03-18', '2021-03-22', 
'2021-02-10', '2021-09-07', None, '2022-01-18', '2021-05-17', '2021-05-17', '2021-07-17', '2021-02-22', '2021-02-22']
    }
df = pd.DataFrame(data=d)
#display(df)

#converting to datetimes
df['fromDate'] = pd.to_datetime(df['fromDate'], errors='coerce')
#for correct add missing dates is sorting ascending by both columns
df = df.sort_values(['customerId','fromDate'])

#new column per customerId
df['lastInteractivity'] = pd.to_datetime('today').normalize() - df['fromDate']

#added missing dates per customerId, also count removed missing rows with NaNs
df = (df.dropna(subset=['fromDate'])
        .set_index('fromDate')
        .groupby('customerId')['lastInteractivity']
        .apply(lambda x: x.asfreq('d'))
        .reset_index())

[OUT] 
ValueError: cannot reindex from a duplicate axis
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-36-3f715dc564ee> in <module>()
      3         .set_index('fromDate')
      4         .groupby('customerId')['lastInteractivity']
----> 5         .apply(lambda x: x.asfreq('d'))
      6         .reset_index())

CodePudding user response：

Indeed I arrived at the same conclusion than @ALollz said in his comment, by using the drop_duplicates, you have the expected result :

#added missing dates per customerId, also count removed missing rows with NaNs
df = (df.dropna(subset=['fromDate'])
        .drop_duplicates(['fromDate', 'customerId'])
        .set_index('fromDate')
        .groupby('customerId')['lastInteractivity']
        .apply(lambda x: x.asfreq('d'))
        .reset_index())

Output :

    customerId  fromDate    lastInteractivity
0   1           2021-02-10  468 days
1   1           2021-02-11  NaT
2   1           2021-02-12  NaT
3   1           2021-02-13  NaT
4   1           2021-02-14  NaT
...
485 3           2021-07-13  NaT
486 3           2021-07-14  NaT
487 3           2021-07-15  NaT
488 3           2021-07-16  NaT
489 3           2021-07-17  311 days