Find duplicated value with different time index-CodePudding

I'm new to pandas and would like to check if there are more efficient way to do what I would like to achieve.

I have a customer master data change log for 12 months looks like below (the full file have 150K rows and 15 columns)

date1	changed_by	customer	field_name
23/02/2021 18:59	101852	40001375	Industry
23/02/2021 18:59	101852	40001375	Industry code 1
23/02/2021 18:59	101852	40001375	Region
30/12/2021 02:31	119186	30045194	Postal Code
30/12/2021 02:31	119186	30045194	Industry code 1
30/12/2021 02:31	119186	30045194	Region
30/01/2021 03:31	109209	40001375	Postal Code
30/01/2021 03:31	109209	40001375	Industry code 1
30/01/2021 03:31	109209	40001375	Bank

I would like to check if there are any customers' data being change more than once within the period, i.e. same customer ID but different date, then extract those line for further analysis.

Below are the code I currently have and seems work ok.

pvz = df3.groupby(['date1','customer'])['field_name'].count().reset_index()
pvz

Gives me

date1	customer	field_name
23/02/2021 18:59	40001375	3
30/01/2021 03:31	40001375	3
30/12/2021 02:31	30045194	3

Then I:

pvz[pvz.duplicated(subset='customer',keep=False)]

Return the below that seems ok:

date1	customer	field_name
23/02/2021 18:59	40001375	3
30/01/2021 03:31	40001375	3

As a beginner I would like to check

if there are any suggestions on my code and if there are more efficient way to do what I would like to achieve
I thought of set the date column to datetime index. But I do not know how to handle it afterward, and whether it would make my task any easier ?

Thank you.

CodePudding user response：

You can reduce it to one line by doing the following:

Using as_index=False in your groupby call instead of using reset_index()
Passing a lambda function to .loc

filtered = df.groupby(['date1','customer'], as_index=False)['field_name'].count().loc[lambda x: x.duplicated(subset='customer',keep=False)]

Output:

>>> filtered
              date1  customer  field_name
0  23/02/2021 18:59  40001375           3
1  30/01/2021 03:31  40001375           3

CodePudding user response：

If you have one optimization you can do is to replace the groupyby_count by value_counts:

>>> df.value_counts(['date1', 'customer']).rename('count').reset_index() \
      .loc[lambda x: x.duplicated('customer', keep=False)]

              date1  customer  count
0  23/02/2021 18:59  40001375      3
1  30/01/2021 03:31  40001375      3