DataFrame df:
no
column has a single int value 1-28 in increasing order, until the next group starts (dataframe is presorted). Some values are missing (e.g 1,2,3,5,7,1,2,3,28,1 ...)intensity
column has a single float value 0.0-1.0.target
has a single numpy array with indexes 0-27, which should contain all intensities of the group. Intensity with no 1 should be at 0th index of this numpy array, intensity with no 2 at 1st and so on. This array is filled with 0.0 at first
index | no | intensity | target |
---|---|---|---|
0 | 2 | 0.027350 | np.arange(0,27) |
1 | 4 | 0.074639 | np.arange(0,27) |
2 | 5 | 0.056452 | np.arange(0,27) |
3 | 2 | 0.011223 | np.arange(0,27) |
The result should be:
Let's say we look at row x: in this row numpy array of df['target']
column should be rewritten so that at it's position (df['no']-1)[x]
0.0 value should be rewritten as df['intensity'][x]
value,
e.g. if df['no'][0] = 2
and df['intensity'][0] = 0.027350
then
df['target'][0] = [0.0, 0.027350, 0.0 ... ]
If there's no such no
value in this group intensity should stay 0.0
target
column from example table above should look like this:
array([[0.0, 0.027350, 0.0, 0.074639, 0.056452, 0.0(at indexes 5-27)],[0.0, 0.011223, ...]])
I got this result using for loop
for i in df.index:
df['target'][i][(df['no'][i])-1] = df['intensity'][i]
if i >= df.index.max():
break
if df['no'][i] < df['no'][i 1]:
df['target'][i 1] = df['target'][i] df['target'][i 1]
df.drop(i, inplace=True)
It's too slow though, because I have millions of rows. Is there a way to use .apply() function or anything else to speed up the process?
CodePudding user response:
You do not need an extra intensity column and modify row by row. You could do one operation for each group:
df['group'] = (df['no'].shift(1) > df['no']).cumsum()
def map_intensity(arr):
new_arr = np.zeros(27)
new_arr[arr['no'].values] = arr['intensity'].values
return new_arr
df.groupby('group').apply(map_intensity)
CodePudding user response:
# Label your groups:
df['groups'] = df.no.lt(df.no.shift()).cumsum()
# Create a multi-index of your groups and desired range:
mux = pd.MultiIndex.from_product([df['groups'].unique(), np.arange(1,29)], names=('groups','no'))
out = (df.set_index(['groups', 'no']) # Match this new index,
.reindex(mux, fill_value=0) # Reindex with it (Filling with 0's!)
.groupby(level=-2)['intensity'] # Groupby 'groups', taking intensity
.apply(np.array)) # Make them into numpy arrays.
print(out)
Output:
groups
0 [0.0, 0.02735, 0.0, 0.074639, 0.056452, 0.0, 0...
1 [0.0, 0.011223, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...
Name: intensity, dtype: object
Now, if we want this in every row:
df = df.set_index('groups')
df['target'] = out
df = df.reset_index(drop=True)
print(df)
Output:
no intensity target
0 2 0.027350 [0.0, 0.02735, 0.0, 0.074639, 0.056452, 0.0, 0...
1 4 0.074639 [0.0, 0.02735, 0.0, 0.074639, 0.056452, 0.0, 0...
2 5 0.056452 [0.0, 0.02735, 0.0, 0.074639, 0.056452, 0.0, 0...
3 2 0.011223 [0.0, 0.011223, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...