I need to calculate the number of consecutive days from today (2022-01-04) backwards a client logged in my application. I'm having a really hard time coming up with a logic for this because my table is HUGE (can get to pentabyte). Input
Name Date
John 2022-01-01
John 2022-01-01
Mary 2022-01-01
Steve 2022-01-03
Mary 2022-01-03
John 2022-01-02
John 2022-01-03
Mary 2022-01-04
John 2022-01-04
Output
Name consecutive_days
John 4
Mary 2
Can you guys help me out please?
CodePudding user response:
########## Recreating OP's DF ##############
Name = ["John", "John", "Mary", "Steve", "Mary", "John", "John", "Mary", "John"]
Date = ["2022-01-01", "2022-01-01", "2022-01-01", "2022-01-03", "2022-01-03",
"2022-01-02", "2022-01-03", "2022-01-04", "2022-01-04"]
df = pd.DataFrame({"Name":Name, "Date":Date})
df.Date = pd.to_datetime(df.Date)
############################################
# Sort the dataframe first by name then by date
df = df.sort_values(by=["Name", "Date"]).reset_index(drop=True)
# Loop through all of the unique names and create a sub-dataframe for them
allCount = [] # empty count list
for name in df.Name.unique():
nameDF = df[df.Name == name].reset_index(drop=True)
# Check if the next day is 1 day after the previous day (consecutive)
# If it is, add to the count value
count = 1
for y in range(len(nameDF)-1):
if nameDF.Date[y 1] == nameDF.Date[y] pd.Timedelta(days=1):
count = 1
allCount.append(count) # append all count values for all names
# Make a new DF with the consecutive day count
consecutiveDF = pd.DataFrame({"Name":df.Name.unique(), "consecutive_days":allCount})
# Drop any row that does not have at least 1 consecutive day
consecutiveDF = consecutiveDF[consecutiveDF.consecutive_days > 1].reset_index(drop=True)
consecutiveDF
Output:
Name consecutive_days
0 John 4
1 Mary 2
Maybe some list comprehension instead of a nest for loop can speed it up:
# Sort the dataframe first by name then by date
df = df.sort_values(by=["Name", "Date"]).reset_index(drop=True)
# Loop through all of the unique names and create a sub-dataframe for them
allCount = [] # empty count list
for name in df.Name.unique():
nameDF = df[df.Name == name].reset_index(drop=True)
# Check if the next day is 1 day after the previous day (consecutive)
# If it is, add to the count value
days = ["day" for y in range(len(nameDF)-1) if nameDF.Date[y 1] == nameDF.Date[y] pd.Timedelta(days=1)]
allCount.append(len(days) 1)
# Make a new DF with the consecutive day count
consecutiveDF = pd.DataFrame({"Name":df.Name.unique(), "consecutive_days":allCount})
# Drop any row that does not have at least 1 consecutive day
consecutiveDF = consecutiveDF[consecutiveDF.consecutive_days > 1].reset_index(drop=True)
consecutiveDF
CodePudding user response:
Preparation
df = pd.DataFrame({
'Name':['John', 'John', 'Mary', 'Steve', 'Mary', 'John', 'John', 'Mary', 'John'],
'Date':['2022-01-01','2022-01-01','2022-01-01','2022-01-03','2022-01-03','2022-01-02','2022-01-03','2022-01-04','2022-01-04']
})
df['Date'] = pd.to_datetime(df['Date'])
df = df.sort_values(['Name','Date']).reset_index(drop=True)
df
###
Name Date
0 John 2022-01-01
1 John 2022-01-01
2 John 2022-01-02
3 John 2022-01-03
4 John 2022-01-04
5 Mary 2022-01-01
6 Mary 2022-01-03
7 Mary 2022-01-04
8 Steve 2022-01-03
check df
's continuity
consecutive_check = df.groupby(['Name'])['Date'].diff().dt.days.ne(1).cumsum()
df['consecutive_days'] = df.groupby(consecutive_check).cumcount() 1
df
###
Name Date consecutive_days
0 John 2022-01-01 1
1 John 2022-01-01 1
2 John 2022-01-02 2
3 John 2022-01-03 3
4 John 2022-01-04 4
5 Mary 2022-01-01 1
6 Mary 2022-01-03 1
7 Mary 2022-01-04 2
8 Steve 2022-01-03 1
Filtration
result = df[['Name','consecutive_days']].groupby('Name').tail(1)
result = result[result['consecutive_days'] > 1].reset_index(drop=True)
result
###
Name consecutive_days
0 John 4
1 Mary 2