How to use pandas apply to create custom sequence-CodePudding

I need to add a column to a python pandas data frame that contains a sequence of numbers. The numbers mostly are incremented by a number, but sometimes the pattern breaks and it is set to a fixed number…other times it increments by a much larger number. Bottom line is I clearly can’t use groupby and rank for that. I want to create a custom function and use apply to execute it, but how do I ensure the output of each row is used in the next row without looping? Example:

If I have to instead use itertuples to loop, would the fastest way to do that be create a series and then convert the series to a column after the looping is over? The df row count is about 160k.

Here is a sample script of how I can loop through the rows. What I'd like to do is use the apply method (or vectorization) instead. Is it possible?

import pandas as pd
import re

# Sample (fake) function
def create_sequence(id, sub_id, previous_sequence):
    # EDIT:  Added reset to 5 if sub_id = 1 after first two answers were already given
    if sub_id == 1:
        sequence = 5
    elif re.match('^4.*', id) and sub_id == 3:
        sequence = previous_sequence   200
    elif re.match('^T.*', id) and sub_id == 6:
        sequence = previous_sequence   60
    else:
        sequence = previous_sequence   5

    return sequence

df = pd.DataFrame([
    {'main_id': '4G5KW5', 'irrelevant_field': 'WT', 'sub_id': 1},
    {'main_id': '4G5KW5', 'irrelevant_field': 'WR', 'sub_id': 2},
    {'main_id': '4G5KW5', 'irrelevant_field': 'WR', 'sub_id': 3},
    {'main_id': '4G5KW5', 'irrelevant_field': 'WG', 'sub_id': 4},
    {'main_id': 'T3OFB2', 'irrelevant_field': 'WT', 'sub_id': 1},
    {'main_id': 'T3OFB2', 'irrelevant_field': 'WT', 'sub_id': 2},
    {'main_id': 'T3OFB2', 'irrelevant_field': 'WG', 'sub_id': 3},
    {'main_id': 'T3OFB2', 'irrelevant_field': 'WT', 'sub_id': 4},
    {'main_id': 'T3OFB2', 'irrelevant_field': 'WS', 'sub_id': 5},
    {'main_id': 'T3OFB2', 'irrelevant_field': 'WS', 'sub_id': 6},
    {'main_id': 'T3OFB2', 'irrelevant_field': 'WS', 'sub_id': 7},
    {'main_id': 'T3OFB2', 'irrelevant_field': 'WT', 'sub_id': 8},
    {'main_id': 'T3OFB2', 'irrelevant_field': 'WR', 'sub_id': 9}
])

# Ideally I'd like to do something like this, although I know create_sequence
# would have to be modified to accept only row as an argument
# df[sequence] = df.apply(create_sequence, axis=1)

# Loop approach:
seq_df = pd.DataFrame(columns=['main_id', 'sub_id', 'sequence'])
prev_sequence = 0
for row in df.itertuples():
    main_id = row[1]
    sub_id = row[3]
    sequence = create_sequence(main_id, sub_id, prev_sequence)
    print(main_id, sub_id, sequence)
    seq_df.loc[len(seq_df.index)] = [main_id, sub_id, sequence]
    prev_sequence = sequence


final_df = pd.merge(df.astype(str, skipna=False), seq_df.astype(str, skipna=False), how='left', on=['main_id', 'sub_id'])
print(final_df)

Final results:

   main_id irrelevant_field sub_id sequence
0   4G5KW5               WT      1        5
1   4G5KW5               WR      2       10
2   4G5KW5               WR      3      210
3   4G5KW5               WG      4      215
4   T3OFB2               WT      1        5
5   T3OFB2               WT      2       10
6   T3OFB2               WG      3       15
7   T3OFB2               WT      4       20
8   T3OFB2               WS      5       25
9   T3OFB2               WS      6       85
10  T3OFB2               WS      7       90
11  T3OFB2               WT      8       95
12  T3OFB2               WR      9      100

CodePudding user response：

I dont know if it is the fastest but maybe fast enough? Multiply the list and slice it so it matches the length of the df.

import pandas as pd
import numpy as np
from datetime import datetime as dt

df = pd.DataFrame(np.random.randint(0,100,(160000,2)), columns=['a','b'])
start = dt.now()
seq = [10, 15, 20, 200, 205, 210, 215]
df['sequence'] = (seq * int(np.ceil(df.shape[0]/len(seq))))[:df.shape[0]]
stop = dt.now()

print((stop-start).microseconds)
16448

CodePudding user response：

To get the previous value of sub_id, use shift – it shifts the whole column down by one row:

>>> df['prev_sub_id'] = df.sub_id.shift(fill_value=0)
>>> df.head()
   main_id irrelevant_field  sub_id  prev_sub_id
0   4G5KW5               WT       1            0
1   4G5KW5               WR       2            1
2   4G5KW5               WR       3            2
3   4G5KW5               WG       4            3
4   T3OFB2               WT       1            4

But it seems you don't really need that here.

Instead, you can use a cumulative sum. Set 5 as the default value to add per row. Then, update the rows where the values to increment are different:

>>> df['sequence'] = 5
>>> df.loc[df.main_id.str.startswith('4') & df.sub_id.eq(3), 'sequence'] = 200
>>> df.loc[df.main_id.str.startswith('T') & df.sub_id.eq(6), 'sequence'] = 60
>>> df.head()
   main_id irrelevant_field  sub_id  prev_sub_id  sequence
0   4G5KW5               WT       1            0         5
1   4G5KW5               WR       2            1         5
2   4G5KW5               WR       3            2       200
3   4G5KW5               WG       4            3         5
4   T3OFB2               WT       1            4         5

Finally, add them up with cumsum.

>>> df['sequence'] = df.sequence.cumsum()
>>> df
   main_id irrelevant_field  sub_id  sequence
0   4G5KW5               WT       1         5
1   4G5KW5               WR       2        10
2   4G5KW5               WR       3       210
3   4G5KW5               WG       4       215
4   T3OFB2               WT       1       220
5   T3OFB2               WT       2       225
6   T3OFB2               WG       3       230
7   T3OFB2               WT       4       235
8   T3OFB2               WS       5       240
9   T3OFB2               WS       6       300
10  T3OFB2               WS       7       305
11  T3OFB2               WT       8       310
12  T3OFB2               WR       9       315

Edit: Just noted the result is off by 5. To fix that you can do df.loc[0, 'sequence'] = 10 to set the first value before cumsum, or simply do df.sequence = 5 in the end.