I need to add a column to a python pandas data frame that contains a sequence of numbers. The numbers mostly are incremented by a number, but sometimes the pattern breaks and it is set to a fixed number…other times it increments by a much larger number. Bottom line is I clearly can’t use groupby
and rank
for that.
I want to create a custom function and use apply
to execute it, but how do I ensure the output of each row is used in the next row without looping? Example:
10
15
20
200
205
210
215
If I have to instead use itertuples
to loop, would the fastest way to do that be create a series
and then convert the series
to a column after the looping is over? The df
row count is about 160k
.
Here is a sample script of how I can loop through the rows. What I'd like to do is use the apply method (or vectorization) instead. Is it possible?
import pandas as pd
import re
# Sample (fake) function
def create_sequence(id, sub_id, previous_sequence):
# EDIT: Added reset to 5 if sub_id = 1 after first two answers were already given
if sub_id == 1:
sequence = 5
elif re.match('^4.*', id) and sub_id == 3:
sequence = previous_sequence 200
elif re.match('^T.*', id) and sub_id == 6:
sequence = previous_sequence 60
else:
sequence = previous_sequence 5
return sequence
df = pd.DataFrame([
{'main_id': '4G5KW5', 'irrelevant_field': 'WT', 'sub_id': 1},
{'main_id': '4G5KW5', 'irrelevant_field': 'WR', 'sub_id': 2},
{'main_id': '4G5KW5', 'irrelevant_field': 'WR', 'sub_id': 3},
{'main_id': '4G5KW5', 'irrelevant_field': 'WG', 'sub_id': 4},
{'main_id': 'T3OFB2', 'irrelevant_field': 'WT', 'sub_id': 1},
{'main_id': 'T3OFB2', 'irrelevant_field': 'WT', 'sub_id': 2},
{'main_id': 'T3OFB2', 'irrelevant_field': 'WG', 'sub_id': 3},
{'main_id': 'T3OFB2', 'irrelevant_field': 'WT', 'sub_id': 4},
{'main_id': 'T3OFB2', 'irrelevant_field': 'WS', 'sub_id': 5},
{'main_id': 'T3OFB2', 'irrelevant_field': 'WS', 'sub_id': 6},
{'main_id': 'T3OFB2', 'irrelevant_field': 'WS', 'sub_id': 7},
{'main_id': 'T3OFB2', 'irrelevant_field': 'WT', 'sub_id': 8},
{'main_id': 'T3OFB2', 'irrelevant_field': 'WR', 'sub_id': 9}
])
# Ideally I'd like to do something like this, although I know create_sequence
# would have to be modified to accept only row as an argument
# df[sequence] = df.apply(create_sequence, axis=1)
# Loop approach:
seq_df = pd.DataFrame(columns=['main_id', 'sub_id', 'sequence'])
prev_sequence = 0
for row in df.itertuples():
main_id = row[1]
sub_id = row[3]
sequence = create_sequence(main_id, sub_id, prev_sequence)
print(main_id, sub_id, sequence)
seq_df.loc[len(seq_df.index)] = [main_id, sub_id, sequence]
prev_sequence = sequence
final_df = pd.merge(df.astype(str, skipna=False), seq_df.astype(str, skipna=False), how='left', on=['main_id', 'sub_id'])
print(final_df)
Final results:
main_id irrelevant_field sub_id sequence
0 4G5KW5 WT 1 5
1 4G5KW5 WR 2 10
2 4G5KW5 WR 3 210
3 4G5KW5 WG 4 215
4 T3OFB2 WT 1 5
5 T3OFB2 WT 2 10
6 T3OFB2 WG 3 15
7 T3OFB2 WT 4 20
8 T3OFB2 WS 5 25
9 T3OFB2 WS 6 85
10 T3OFB2 WS 7 90
11 T3OFB2 WT 8 95
12 T3OFB2 WR 9 100
CodePudding user response:
I dont know if it is the fastest but maybe fast enough? Multiply the list and slice it so it matches the length of the df.
import pandas as pd
import numpy as np
from datetime import datetime as dt
df = pd.DataFrame(np.random.randint(0,100,(160000,2)), columns=['a','b'])
start = dt.now()
seq = [10, 15, 20, 200, 205, 210, 215]
df['sequence'] = (seq * int(np.ceil(df.shape[0]/len(seq))))[:df.shape[0]]
stop = dt.now()
print((stop-start).microseconds)
16448
CodePudding user response:
To get the previous value of sub_id
, use shift
– it shifts the whole column down by one row:
>>> df['prev_sub_id'] = df.sub_id.shift(fill_value=0)
>>> df.head()
main_id irrelevant_field sub_id prev_sub_id
0 4G5KW5 WT 1 0
1 4G5KW5 WR 2 1
2 4G5KW5 WR 3 2
3 4G5KW5 WG 4 3
4 T3OFB2 WT 1 4
But it seems you don't really need that here.
Instead, you can use a cumulative sum. Set 5
as the default value to add per row. Then, update the rows where the values to increment are different:
>>> df['sequence'] = 5
>>> df.loc[df.main_id.str.startswith('4') & df.sub_id.eq(3), 'sequence'] = 200
>>> df.loc[df.main_id.str.startswith('T') & df.sub_id.eq(6), 'sequence'] = 60
>>> df.head()
main_id irrelevant_field sub_id prev_sub_id sequence
0 4G5KW5 WT 1 0 5
1 4G5KW5 WR 2 1 5
2 4G5KW5 WR 3 2 200
3 4G5KW5 WG 4 3 5
4 T3OFB2 WT 1 4 5
Finally, add them up with cumsum
.
>>> df['sequence'] = df.sequence.cumsum()
>>> df
main_id irrelevant_field sub_id sequence
0 4G5KW5 WT 1 5
1 4G5KW5 WR 2 10
2 4G5KW5 WR 3 210
3 4G5KW5 WG 4 215
4 T3OFB2 WT 1 220
5 T3OFB2 WT 2 225
6 T3OFB2 WG 3 230
7 T3OFB2 WT 4 235
8 T3OFB2 WS 5 240
9 T3OFB2 WS 6 300
10 T3OFB2 WS 7 305
11 T3OFB2 WT 8 310
12 T3OFB2 WR 9 315
Edit: Just noted the result is off by 5. To fix that you can do df.loc[0, 'sequence'] = 10
to set the first value before cumsum
, or simply do df.sequence = 5
in the end.