What I start with is a large dataframe (more than a million entires) of this structure:
id datetime indicator other_values ...
1 2020-01-14 00:12:00 0 ...
1 2020-01-17 00:23:00 1 ...
...
1 2021-02-01 00:00:00 0 ...
2 2020-01-15 00:05:00 0 ...
2 2020-03-10 00:07:00 0 ...
...
2 2021-05-22 00:00:00 1 ...
...
There is no specific order other than a sort by id
and then datetime
. The dataset is not complete (there is not data for every day, but there can be multiple entires of the same day).
Now for each time where indicator==1
I want to collect every row with the same id
and a datetime
that is at most 10 days before. All other rows which are not in range of the indicator can be dropped. In the best case I want it to be saved as a dataset of time series which each will be later used in a Neural network. (There can be more than one indicator==1
case per id, other values
should be saved).
An example for one id: I want to convert this
id datetime indicator other_values ...
1 2020-01-14 00:12:00 0 ...
1 2020-01-17 00:23:00 1 ...
1 2020-01-17 00:13:00 0 ...
1 2020-01-20 00:05:00 0 ...
1 2020-03-10 00:07:00 0 ...
1 2020-05-19 00:00:00 0 ...
1 2020-05-20 00:00:00 1 ...
into this
id datetime group other_values ...
1 2020-01-14 00:12:00 A ...
1 2020-01-17 00:23:00 A ...
1 2020-01-17 00:13:00 A ...
1 2020-05-19 00:00:00 B ...
1 2020-05-20 00:00:00 B ...
or a similar way to group into group A, B, ... .
A naive python for-loop is not possible due to taking ages for a dataset like this.
There is propably a clever way to use df.groupby('id')
, df.groupby('id').agg(...)
, df.sort_values(...)
or df.apply()
, but I just do not see it.
CodePudding user response:
Here is a way to do it with pd.merge_asof()
. Let's create our data:
data = {'id': [1,1,1,1,1,1,1],
'datetime': ['2020-01-14 00:12:00',
'2020-01-17 00:23:00',
'2020-01-17 00:13:00',
'2020-01-20 00:05:00',
'2020-03-10 00:07:00',
'2020-05-19 00:00:00',
'2020-05-20 00:00:00'],
'ind': [0,1,0,0,0,0,1]
}
df = pd.DataFrame(data)
df['datetime'] = df['datetime'].astype('datetime64')
Data:
id datetime ind
0 1 2020-01-14 00:12:00 0
1 1 2020-01-17 00:23:00 1
2 1 2020-01-17 00:13:00 0
3 1 2020-01-20 00:05:00 0
4 1 2020-03-10 00:07:00 0
5 1 2020-05-19 00:00:00 0
6 1 2020-05-20 00:00:00 1
Next, let's add a date to the dataset and pull all dates where the indicator is 1.
df['date'] = df['datetime'].dt.date.astype('datetime64')
df2 = df.loc[df['ind'] == 1, ['id', 'date', 'ind']].rename({'ind': 'ind2'}, axis=1)
Which gives us this:
df:
id datetime ind date
0 1 2020-01-14 00:12:00 0 2020-01-14
1 1 2020-01-17 00:23:00 1 2020-01-17
2 1 2020-01-17 00:13:00 0 2020-01-17
3 1 2020-01-20 00:05:00 0 2020-01-20
4 1 2020-03-10 00:07:00 0 2020-03-10
5 1 2020-05-19 00:00:00 0 2020-05-19
6 1 2020-05-20 00:00:00 1 2020-05-20
df2:
id date ind2
1 1 2020-01-17 1
6 1 2020-05-20 1
Now let's join them using pd.merge_asof()
with direction=forward
and a tolerance of 10 days. This will join all data up to 10 days looking forward.
df = pd.merge_asof(df.drop('ind', axis=1), df2, by='id', on='date', tolerance=pd.Timedelta('10d'), direction='forward')
Which gives us this:
id datetime ind date ind2
0 1 2020-01-14 00:12:00 0 2020-01-14 1.0
1 1 2020-01-17 00:23:00 1 2020-01-17 1.0
2 1 2020-01-17 00:13:00 0 2020-01-17 1.0
3 1 2020-01-20 00:05:00 0 2020-01-20 NaN
4 1 2020-03-10 00:07:00 0 2020-03-10 NaN
5 1 2020-05-19 00:00:00 0 2020-05-19 1.0
6 1 2020-05-20 00:00:00 1 2020-05-20 1.0
Next, let's work on creating groups. There are three rules we want to use:
- The next value of ind2 is
NaN
- The next value of ID is not the current value of ID (we're at the last value in the group)
- The next day is 10 days greater than the current
With these rules, we can create a Boolean which we can then cumulatively sum to create our groups.
df['group_id'] = df['ind2'].eq( (df['ind2'].shift() == np.NaN)
| (df['id'].shift() != df['id'])
| (df['date'] - df['date'].shift() > pd.Timedelta('10d') )
).cumsum()
id datetime ind date ind2 group_id
0 1 2020-01-14 00:12:00 0 2020-01-14 1.0 1
1 1 2020-01-17 00:23:00 1 2020-01-17 1.0 1
2 1 2020-01-17 00:13:00 0 2020-01-17 1.0 1
3 1 2020-01-20 00:05:00 0 2020-01-20 NaN 1
4 1 2020-03-10 00:07:00 0 2020-03-10 NaN 1
5 1 2020-05-19 00:00:00 0 2020-05-19 1.0 2
6 1 2020-05-20 00:00:00 1 2020-05-20 1.0 2
Now we need to drop all the NaNs
from ind2
, remove date
and we're done.
df = df.dropna(subset='ind2').drop(['date', 'ind2'], axis=1)
Final output:
id datetime ind group_id
0 1 2020-01-14 00:12:00 0 1
1 1 2020-01-17 00:23:00 1 1
2 1 2020-01-17 00:13:00 0 1
5 1 2020-05-19 00:00:00 0 2
6 1 2020-05-20 00:00:00 1 2
CodePudding user response:
I'm not aware of a way to do this with df.agg
, but you can put your for loop inside the groupby using .apply()
. That way, your comparisons/lookups can be done on smaller tables, then groupby will handle the re-concatenation:
import pandas as pd
import datetime
import uuid
df = pd.DataFrame({
"id": [1, 1, 1, 2, 2, 2],
"datetime": [
'2020-01-14 00:12:00',
'2020-01-17 00:23:00',
'2021-02-01 00:00:00',
'2020-01-15 00:05:00',
'2020-03-10 00:07:00',
'2021-05-22 00:00:00',
],
"indicator": [0, 1, 0, 0, 0, 1]
})
df.datetime = pd.to_datetime(df.datetime)
timedelta = datetime.timedelta(days=10)
def consolidate(grp):
grp['Group'] = None
for time in grp[grp.indicator == 1]['datetime']:
grp['Group'][grp['datetime'].between(time - timedelta, time)] = uuid.uuid4()
return grp.dropna(subset=['Group'])
df.groupby('id').apply(consolidate)
If there are multiple rows with indicator == 1
in each id
grouping, then the for loop will apply in index order (so a later group might overwrite an earlier group). If you can be certain that there is only one indicator == 1
in each grouping, we can simplify the consolidate
function:
def consolidate(grp):
time = grp[grp.indicator == 1]['datetime'].iloc[0]
grp = grp[grp['datetime'].between(time - timedelta, time)]
grp['Group'] = uuid.uuid4()
return grp
CodePudding user response:
One way to approach this problem is to create a new column called group that will hold the group identifier for each row. You can then iterate through the rows of the dataframe and assign a group identifier to each row based on its datetime and the datetime and indicator values of the other rows with the same id.
Here's some code that demonstrates how this could be done:
import pandas as pd
# create a new column called 'group' with a default value of None
df['group'] = None
# iterate through the rows of the dataframe
for i, row in df.iterrows():
# if the current row has an indicator value of 1
if row['indicator'] == 1:
# get the rows with the same id as the current row
rows = df[df['id'] == row['id']]
# find the datetime of the current row
datetime = row['datetime']
# find the rows with datetimes that are at most 10 days before the current row's datetime
in_range = rows[rows['datetime'] >= datetime - pd.Timedelta(days=10)]
# assign the group identifier 'A' to the rows in the range
df.loc[in_range.index, 'group'] = 'A'
# drop the rows that don't have a group identifier
df = df[df['group'].notnull()]
This code will create a new column called group that contains a group identifier for each row. The rows with the same id and a datetime within 10 days of an indicator value of 1 will be assigned the group identifier 'A', and all other rows will be dropped.
You can modify this code to handle multiple indicator values for each id, by using a different group identifier for each indicator value and changing the group column to be a categorical column rather than a string column. You can also use the apply() function to apply a custom function to each group of rows with the same id, which can make the code more concise and easier to read.