Home > OS >  Find duplicated value with different time index
Find duplicated value with different time index

Time:02-11

I'm new to pandas and would like to check if there are more efficient way to do what I would like to achieve.

I have a customer master data change log for 12 months looks like below (the full file have 150K rows and 15 columns)

date1 changed_by customer field_name
23/02/2021 18:59 101852 40001375 Industry
23/02/2021 18:59 101852 40001375 Industry code 1
23/02/2021 18:59 101852 40001375 Region
30/12/2021 02:31 119186 30045194 Postal Code
30/12/2021 02:31 119186 30045194 Industry code 1
30/12/2021 02:31 119186 30045194 Region
30/01/2021 03:31 109209 40001375 Postal Code
30/01/2021 03:31 109209 40001375 Industry code 1
30/01/2021 03:31 109209 40001375 Bank

I would like to check if there are any customers' data being change more than once within the period, i.e. same customer ID but different date, then extract those line for further analysis.

Below are the code I currently have and seems work ok.

pvz = df3.groupby(['date1','customer'])['field_name'].count().reset_index()
pvz

Gives me

date1 customer field_name
23/02/2021 18:59 40001375 3
30/01/2021 03:31 40001375 3
30/12/2021 02:31 30045194 3

Then I:

pvz[pvz.duplicated(subset='customer',keep=False)]

Return the below that seems ok:

date1 customer field_name
23/02/2021 18:59 40001375 3
30/01/2021 03:31 40001375 3

As a beginner I would like to check

  • if there are any suggestions on my code and if there are more efficient way to do what I would like to achieve
  • I thought of set the date column to datetime index. But I do not know how to handle it afterward, and whether it would make my task any easier ?

Thank you.

CodePudding user response:

You can reduce it to one line by doing the following:

  1. Using as_index=False in your groupby call instead of using reset_index()
  2. Passing a lambda function to .loc
filtered = df.groupby(['date1','customer'], as_index=False)['field_name'].count().loc[lambda x: x.duplicated(subset='customer',keep=False)]

Output:

>>> filtered
              date1  customer  field_name
0  23/02/2021 18:59  40001375           3
1  30/01/2021 03:31  40001375           3

CodePudding user response:

If you have one optimization you can do is to replace the groupyby_count by value_counts:

>>> df.value_counts(['date1', 'customer']).rename('count').reset_index() \
      .loc[lambda x: x.duplicated('customer', keep=False)]

              date1  customer  count
0  23/02/2021 18:59  40001375      3
1  30/01/2021 03:31  40001375      3
  • Related