I'm new to pandas and would like to check if there are more efficient way to do what I would like to achieve.
I have a customer master data change log for 12 months looks like below (the full file have 150K rows and 15 columns)
date1 | changed_by | customer | field_name |
---|---|---|---|
23/02/2021 18:59 | 101852 | 40001375 | Industry |
23/02/2021 18:59 | 101852 | 40001375 | Industry code 1 |
23/02/2021 18:59 | 101852 | 40001375 | Region |
30/12/2021 02:31 | 119186 | 30045194 | Postal Code |
30/12/2021 02:31 | 119186 | 30045194 | Industry code 1 |
30/12/2021 02:31 | 119186 | 30045194 | Region |
30/01/2021 03:31 | 109209 | 40001375 | Postal Code |
30/01/2021 03:31 | 109209 | 40001375 | Industry code 1 |
30/01/2021 03:31 | 109209 | 40001375 | Bank |
I would like to check if there are any customers' data being change more than once within the period, i.e. same customer ID but different date, then extract those line for further analysis.
Below are the code I currently have and seems work ok.
pvz = df3.groupby(['date1','customer'])['field_name'].count().reset_index()
pvz
Gives me
date1 | customer | field_name |
---|---|---|
23/02/2021 18:59 | 40001375 | 3 |
30/01/2021 03:31 | 40001375 | 3 |
30/12/2021 02:31 | 30045194 | 3 |
Then I:
pvz[pvz.duplicated(subset='customer',keep=False)]
Return the below that seems ok:
date1 | customer | field_name |
---|---|---|
23/02/2021 18:59 | 40001375 | 3 |
30/01/2021 03:31 | 40001375 | 3 |
As a beginner I would like to check
- if there are any suggestions on my code and if there are more efficient way to do what I would like to achieve
- I thought of set the date column to datetime index. But I do not know how to handle it afterward, and whether it would make my task any easier ?
Thank you.
CodePudding user response:
You can reduce it to one line by doing the following:
- Using
as_index=False
in yourgroupby
call instead of usingreset_index()
- Passing a lambda function to
.loc
filtered = df.groupby(['date1','customer'], as_index=False)['field_name'].count().loc[lambda x: x.duplicated(subset='customer',keep=False)]
Output:
>>> filtered
date1 customer field_name
0 23/02/2021 18:59 40001375 3
1 30/01/2021 03:31 40001375 3
CodePudding user response:
If you have one optimization you can do is to replace the groupyby_count
by value_counts
:
>>> df.value_counts(['date1', 'customer']).rename('count').reset_index() \
.loc[lambda x: x.duplicated('customer', keep=False)]
date1 customer count
0 23/02/2021 18:59 40001375 3
1 30/01/2021 03:31 40001375 3