How to rewrite a for loop to get a solution faster in pandas?-CodePudding

DataFrame df:

no column has a single int value 1-28 in increasing order, until the next group starts (dataframe is presorted). Some values are missing (e.g 1,2,3,5,7,1,2,3,28,1 ...)
intensity column has a single float value 0.0-1.0.
target has a single numpy array with indexes 0-27, which should contain all intensities of the group. Intensity with no 1 should be at 0th index of this numpy array, intensity with no 2 at 1st and so on. This array is filled with 0.0 at first

index	no	intensity	target
0	2	0.027350	`np.arange(0,27)`
1	4	0.074639	`np.arange(0,27)`
2	5	0.056452	`np.arange(0,27)`
3	2	0.011223	`np.arange(0,27)`

The result should be:

Let's say we look at row x: in this row numpy array of df['target'] column should be rewritten so that at it's position (df['no']-1)[x] 0.0 value should be rewritten as df['intensity'][x] value,

e.g. if df['no'][0] = 2 and df['intensity'][0] = 0.027350 then df['target'][0] = [0.0, 0.027350, 0.0 ... ] If there's no such no value in this group intensity should stay 0.0

target column from example table above should look like this:

array([[0.0, 0.027350, 0.0, 0.074639, 0.056452, 0.0(at indexes 5-27)],[0.0, 0.011223, ...]])

I got this result using for loop

for i in df.index: 
    df['target'][i][(df['no'][i])-1] = df['intensity'][i]
    if i >= df.index.max():
        break
    if df['no'][i] < df['no'][i 1]:
        df['target'][i 1] = df['target'][i]   df['target'][i 1]
        df.drop(i, inplace=True)

It's too slow though, because I have millions of rows. Is there a way to use .apply() function or anything else to speed up the process?

CodePudding user response：

You do not need an extra intensity column and modify row by row. You could do one operation for each group:

df['group'] = (df['no'].shift(1) > df['no']).cumsum()
def map_intensity(arr):
    new_arr = np.zeros(27)
    new_arr[arr['no'].values] = arr['intensity'].values
    return new_arr
df.groupby('group').apply(map_intensity)

CodePudding user response：

# Label your groups:
df['groups'] = df.no.lt(df.no.shift()).cumsum()

# Create a multi-index of your groups and desired range:
mux = pd.MultiIndex.from_product([df['groups'].unique(), np.arange(1,29)], names=('groups','no'))
 
out = (df.set_index(['groups', 'no']) # Match this new index,
         .reindex(mux, fill_value=0)  # Reindex with it (Filling with 0's!)
         .groupby(level=-2)['intensity']  # Groupby 'groups', taking intensity
         .apply(np.array)) # Make them into numpy arrays.
print(out)

Output:

groups
0    [0.0, 0.02735, 0.0, 0.074639, 0.056452, 0.0, 0...
1    [0.0, 0.011223, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...
Name: intensity, dtype: object

Now, if we want this in every row:

df = df.set_index('groups')
df['target'] = out
df = df.reset_index(drop=True)
print(df)

Output:

   no  intensity                                             target
0   2   0.027350  [0.0, 0.02735, 0.0, 0.074639, 0.056452, 0.0, 0...
1   4   0.074639  [0.0, 0.02735, 0.0, 0.074639, 0.056452, 0.0, 0...
2   5   0.056452  [0.0, 0.02735, 0.0, 0.074639, 0.056452, 0.0, 0...
3   2   0.011223  [0.0, 0.011223, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...