Home > Software design >  How to rewrite a for loop to get a solution faster in pandas?
How to rewrite a for loop to get a solution faster in pandas?

Time:10-18

DataFrame df:

  • no column has a single int value 1-28 in increasing order, until the next group starts (dataframe is presorted). Some values are missing (e.g 1,2,3,5,7,1,2,3,28,1 ...)
  • intensity column has a single float value 0.0-1.0.
  • target has a single numpy array with indexes 0-27, which should contain all intensities of the group. Intensity with no 1 should be at 0th index of this numpy array, intensity with no 2 at 1st and so on. This array is filled with 0.0 at first
index no intensity target
0 2 0.027350 np.arange(0,27)
1 4 0.074639 np.arange(0,27)
2 5 0.056452 np.arange(0,27)
3 2 0.011223 np.arange(0,27)

The result should be:

Let's say we look at row x: in this row numpy array of df['target'] column should be rewritten so that at it's position (df['no']-1)[x] 0.0 value should be rewritten as df['intensity'][x] value,

e.g. if df['no'][0] = 2 and df['intensity'][0] = 0.027350 then df['target'][0] = [0.0, 0.027350, 0.0 ... ] If there's no such no value in this group intensity should stay 0.0

target column from example table above should look like this:

array([[0.0, 0.027350, 0.0, 0.074639, 0.056452, 0.0(at indexes 5-27)],[0.0, 0.011223, ...]])

I got this result using for loop

for i in df.index: 
    df['target'][i][(df['no'][i])-1] = df['intensity'][i]
    if i >= df.index.max():
        break
    if df['no'][i] < df['no'][i 1]:
        df['target'][i 1] = df['target'][i]   df['target'][i 1]
        df.drop(i, inplace=True)

It's too slow though, because I have millions of rows. Is there a way to use .apply() function or anything else to speed up the process?

CodePudding user response:

You do not need an extra intensity column and modify row by row. You could do one operation for each group:

df['group'] = (df['no'].shift(1) > df['no']).cumsum()
def map_intensity(arr):
    new_arr = np.zeros(27)
    new_arr[arr['no'].values] = arr['intensity'].values
    return new_arr
df.groupby('group').apply(map_intensity)

CodePudding user response:

# Label your groups:
df['groups'] = df.no.lt(df.no.shift()).cumsum()

# Create a multi-index of your groups and desired range:
mux = pd.MultiIndex.from_product([df['groups'].unique(), np.arange(1,29)], names=('groups','no'))
 
out = (df.set_index(['groups', 'no']) # Match this new index,
         .reindex(mux, fill_value=0)  # Reindex with it (Filling with 0's!)
         .groupby(level=-2)['intensity']  # Groupby 'groups', taking intensity
         .apply(np.array)) # Make them into numpy arrays.
print(out)

Output:

groups
0    [0.0, 0.02735, 0.0, 0.074639, 0.056452, 0.0, 0...
1    [0.0, 0.011223, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...
Name: intensity, dtype: object

Now, if we want this in every row:

df = df.set_index('groups')
df['target'] = out
df = df.reset_index(drop=True)
print(df)

Output:

   no  intensity                                             target
0   2   0.027350  [0.0, 0.02735, 0.0, 0.074639, 0.056452, 0.0, 0...
1   4   0.074639  [0.0, 0.02735, 0.0, 0.074639, 0.056452, 0.0, 0...
2   5   0.056452  [0.0, 0.02735, 0.0, 0.074639, 0.056452, 0.0, 0...
3   2   0.011223  [0.0, 0.011223, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...
  • Related