Something was active for some period of time (e.g. from 01-01-2022 to 01-02-2022). I have a list of those items, every one was active for a different period of time. I want to count how many items were active every day. I know only first and last day for each of them.
df = {'name': ['a','b','c','d','e','f'], 'first_day': ['01-01-2022', '15-02-2022', '20-01-2022', '5-02-2022', '19-02-2022', '21-02-2022'], 'last_day':['09-01-2022', '20-02-2022', '27-02-2022', '27-02-2022','19-02-2022', '28-02-2022']}
Result: on 01-01-2022 active was only 1 item (a), on 02-01-2022 = 1 (a), [...], on 26-02-2022 = 3 (c, d, f), [...], on 28-02-2022 = 1 (f). I care only about the count of items and I have no idea how to do it.
CodePudding user response:
Assuming this is for pandas dataframe: create a list of days for each element, explode list into multiple rows and finally use groupby:
df = pd.DataFrame({'name': ['a', 'b', 'c', 'd', 'e', 'f'],
'first_day': ['01-01-2022', '15-02-2022', '20-01-2022', '5-02-2022', '19-02-2022', '21-02-2022'],
'last_day': ['09-01-2022', '20-02-2022', '27-02-2022', '27-02-2022', '19-02-2022',
'28-02-2022']})
df['first'] = pd.to_datetime(df['first_day'], format="%d-%m-%Y")
df['last'] = pd.to_datetime(df['last_day'], format="%d-%m-%Y")
df['days'] = df.apply(lambda x: pd.date_range(x['first'], x['last'], freq='D').tolist(), axis=1)
result = df.explode('days').groupby('days').size()
CodePudding user response:
You can use datetime
, for example:
from datetime import datetime
df = {'name': ['a', 'b', 'c', 'd', 'e', 'f'],
'first_day': ['01-01-2022', '15-02-2022', '20-01-2022', '5-02-2022', '19-02-2022', '21-02-2022'],
'last_day': ['09-01-2022', '20-02-2022', '27-02-2022', '27-02-2022', '19-02-2022', '28-02-2022']}
first_day = [datetime.strptime(date, '%d-%m-%Y') for date in df['first_day']]
last_day = [datetime.strptime(date, '%d-%m-%Y') for date in df['last_day']]
time_delta = [last-first for first, last in zip(first_day, last_day)]
active_days = [delta.days for delta in time_delta]
print(active_days) # [8, 5, 38, 22, 0, 7]
EDIT: if you are producing the dictionary df
, format your date string as '2022-07-28'
so you can use numpy which is more efficient for big lists:
import numpy as np
df = {'name': ['a', 'b', 'c', 'd', 'e', 'f'],
'first_day': ['2022-01-01', '2022-02-15', '2022-01-20', '2022-02-05', '2022-02-19', '2022-02-21'],
'last_day': ['2022-01-09', '2022-02-20', '2022-02-27', '2022-02-27', '2022-02-19', '2022-02-28']}
first_day = np.array(df['first_day'], dtype='datetime64[D]')
last_day = np.array(df['last_day'], dtype='datetime64[D]')
active_days = last_day-first_day
print(active_days) # [ 8 5 38 22 0 7]