Pandas substract above row-CodePudding

Basically this is the challenge I have

Data set with time range and unique ID, what I need to do is to find if ID is duplicated in date range.

123 transaction 1/1/2021
345 transaction 1/1/2021
123 transaction 1/2/2021
123 transaction 1/20/2021

Where I want to return 1 for ID 123 because the duplicate transaction is in range of 7 days.

I can do this with Excel and I added some more date ranges depending on day for exple Wednesday range up to 6 days, Thursday 5 days, Friday 4 days range. But I have no idea how to accomplish this with pandas...

The reason why I want to do this with pandas is because each data set has up to 1M rows and it takes forever with Excel to accomplish and on top of that I need to split by category and it's just a pain to do all that manual work.

Is there any recommendations or ideas in how to accomplish that task?

CodePudding user response：

The context and data you've provided about your situation are scanty, but you can probably do something like this:

>>> df
     id         type       date
0  123  transaction 2021-01-01
1  345  transaction 2021-01-01
2  123  transaction 2021-01-02
3  123  transaction 2021-01-20

>>> dupes = df.groupby(pd.Grouper(key='date', freq='W'))['id'].apply(pd.Series.duplicated)
>>> dupes
0    False
1    False
2     True
3    False
Name: id, dtype: bool

There, item 2 (the third item) is True because 123 already occured in the past week.

CodePudding user response：

As far as I can understand the question, I think this is what you need.

from datetime import datetime
import pandas as pd

df = pd.DataFrame({
    "id": [123, 345, 123, 123],
    "name": ["transaction", "transaction", "transaction", "transaction"],
    "date": ["01/01/2021", "01/01/2021", "01/02/2021", "01/10/2021"]
})

def dates_in_range(dates):
    num_days_frame = 6
    processed_dates = sorted([datetime.strptime(date, "%m/%d/%Y") for date in dates])
    difference_in_range = any(abs(processed_dates[i] - processed_dates[i-1]).days < num_days_frame  for i in range(1, len(processed_dates)))
    return difference_in_range and 1 or 0

group = df.groupby("id")
df_new = group.apply(lambda x: dates_in_range(x["date"]))

print(df_new)
"""
print(df_new)
id
123    1
345    0
"""

Here you first group by the id such that you get all dates for that particular id in the same row.

After which a row-wise function operation is applied to the aggregated dates such that, first they are sorted and afterward checked if the difference between consecutive items is greater than the defined range. The sorting makes sure that consecutive differences will actually result in a true or false outcome if dates are close by.

Finally if any such row exists for which the difference of consecutive sorted dates are less than num_days_frame (6), we return a 1 else we return a 0.

All that being said this might not be as performant as each row is being sorted. One way to avoid that is sort the entire df first and apply the group operation to ensure sorted dates.

CodePudding user response：

The df:

df = pd.read_csv(StringIO(
    """id,trans_date
    123,1/1/2021
    345,1/1/2021
    123,1/2/2021
    123,1/20/2021
    345,1/3/2021
    """
)) # added extra record for demo
df
    id  trans_date
0   123 1/1/2021
1   345 1/1/2021
2   123 1/2/2021
3   123 1/20/2021
4   345 1/3/2021

df['trans_date'] = pd.to_datetime(df['trans_date'])

As you have to look into each of the ids separately, you can group by id and then get the maximum and minimum dates and if the difference is greater than 7, then those would be 1. Otherwise, 0.

result = df.groupby('id')['trans_date'].apply(
    lambda x: True if (x.max()-x.min()).days > 7 else False)
result
id
123     True
345    False
Name: trans_date, dtype: bool

If you just need the required ids, then

result.index[result].values
array([123])