Count values by date in Python-CodePudding

Something was active for some period of time (e.g. from 01-01-2022 to 01-02-2022). I have a list of those items, every one was active for a different period of time. I want to count how many items were active every day. I know only first and last day for each of them.

df = {'name': ['a','b','c','d','e','f'], 'first_day': ['01-01-2022', '15-02-2022', '20-01-2022', '5-02-2022', '19-02-2022', '21-02-2022'], 'last_day':['09-01-2022', '20-02-2022', '27-02-2022', '27-02-2022','19-02-2022', '28-02-2022']}

Result: on 01-01-2022 active was only 1 item (a), on 02-01-2022 = 1 (a), [...], on 26-02-2022 = 3 (c, d, f), [...], on 28-02-2022 = 1 (f). I care only about the count of items and I have no idea how to do it.

CodePudding user response：

Assuming this is for pandas dataframe: create a list of days for each element, explode list into multiple rows and finally use groupby:

df = pd.DataFrame({'name': ['a', 'b', 'c', 'd', 'e', 'f'],
                   'first_day': ['01-01-2022', '15-02-2022', '20-01-2022', '5-02-2022', '19-02-2022', '21-02-2022'],
                   'last_day': ['09-01-2022', '20-02-2022', '27-02-2022', '27-02-2022', '19-02-2022',
                                '28-02-2022']})
df['first'] = pd.to_datetime(df['first_day'], format="%d-%m-%Y")
df['last'] = pd.to_datetime(df['last_day'], format="%d-%m-%Y")
df['days'] = df.apply(lambda x: pd.date_range(x['first'], x['last'], freq='D').tolist(), axis=1)
result = df.explode('days').groupby('days').size()

CodePudding user response：

You can use datetime, for example:

from datetime import datetime

df = {'name': ['a', 'b', 'c', 'd', 'e', 'f'],
      'first_day': ['01-01-2022', '15-02-2022', '20-01-2022', '5-02-2022', '19-02-2022', '21-02-2022'],
      'last_day': ['09-01-2022', '20-02-2022', '27-02-2022', '27-02-2022', '19-02-2022', '28-02-2022']}

first_day = [datetime.strptime(date, '%d-%m-%Y') for date in df['first_day']]
last_day = [datetime.strptime(date, '%d-%m-%Y') for date in df['last_day']]
time_delta = [last-first for first, last in zip(first_day, last_day)]
active_days = [delta.days for delta in time_delta]

print(active_days) # [8, 5, 38, 22, 0, 7]

EDIT: if you are producing the dictionary df, format your date string as '2022-07-28' so you can use numpy which is more efficient for big lists:

import numpy as np

df = {'name': ['a', 'b', 'c', 'd', 'e', 'f'],
      'first_day': ['2022-01-01', '2022-02-15', '2022-01-20', '2022-02-05', '2022-02-19', '2022-02-21'],
      'last_day': ['2022-01-09', '2022-02-20', '2022-02-27', '2022-02-27', '2022-02-19', '2022-02-28']}

first_day = np.array(df['first_day'], dtype='datetime64[D]')
last_day = np.array(df['last_day'], dtype='datetime64[D]')
active_days = last_day-first_day 

print(active_days) # [ 8  5 38 22  0  7]