Say I have a Pandas dataframe with 4 rows and 5 columns. For simplicity I will convert it into a Numpy array, which looks like this:
import numpy as np
A = np.array([[23, 43, 23, 110, 5],
[83, 32, 12, 123, 4],
[58, 41, 59, 189, 1],
[93, 77, 22, 170, 3]])
For each row, I want to insert some similar rows right after the row, with only the column 4 decreasing by 1 each time until 0. The expected output should look like this:
np.array([[23, 43, 23, 110, 5],
[23, 43, 23, 110, 4],
[23, 43, 23, 110, 3],
[23, 43, 23, 110, 2],
[23, 43, 23, 110, 1],
[23, 43, 23, 110, 0],
[83, 32, 12, 123, 4],
[83, 32, 12, 123, 3],
[83, 32, 12, 123, 2],
[83, 32, 12, 123, 1],
[83, 32, 12, 123, 0],
[58, 41, 59, 189, 1],
[58, 41, 59, 189, 0],
[93, 77, 22, 170, 3],
[93, 77, 22, 170, 2],
[93, 77, 22, 170, 1],
[93, 77, 22, 170, 0]])
Below is the code that I have come up with:
new_rows = []
for i, row in enumerate(A):
new = A[i, 4] - 1
while new >= 0:
new_row = row.copy()
new_row[4] = new
new_rows.append(new_row)
new -= 1
new_A = np.vstack([A, np.array(new_rows)])
print(new_A)
Output
[[ 23 43 23 110 5]
[ 83 32 12 123 4]
[ 58 41 59 189 1]
[ 93 77 22 170 3]
[ 23 43 23 110 4]
[ 23 43 23 110 3]
[ 23 43 23 110 2]
[ 23 43 23 110 1]
[ 23 43 23 110 0]
[ 83 32 12 123 3]
[ 83 32 12 123 2]
[ 83 32 12 123 1]
[ 83 32 12 123 0]
[ 58 41 59 189 0]
[ 93 77 22 170 2]
[ 93 77 22 170 1]
[ 93 77 22 170 0]]
Obviously, the code is not efficient since it doesn't use any Numpy vectorization. In reality, I have more than 4,000 original rows so a speed-up is definitely needed. Moreover, I cannot insert new rows right after each row. Is there any efficient way to do this in Numpy or Pandas?
CodePudding user response:
If you are interested in a solution not using any pure-Python loops but Numba instead, then here is one:
import numba as nb
@nb.njit('int32[:,::1](int32[:,::1])')
def compute(A):
n, m = A.shape
rows = A[:,-1].sum() n
res = np.empty((rows, m), dtype=np.int32)
row = 0
for i in range(n):
count = A[i, -1]
for j in range(count 1):
res[row j, 0:m-1] = A[i, 0:m-1]
res[row j, m-1] = count-j
row = count 1
return res
result = compute(A)
This solution is 12 times faster than the solution of @sammywemmy on my machine although A
is very small. It should be even faster on bigger inputs.
CodePudding user response:
arr = A[:, -1] 1
temp = np.repeat(A, arr, axis = 0)
# depending on your array size
# you can build the range here with a much faster implementation
from this link : # https://stackoverflow.com/a/47126435/7175713
arr = np.concatenate([np.arange(ent) for ent in arr])
temp[:, -1] = temp[:, -1] - arr
temp
array([[ 23, 43, 23, 110, 5],
[ 23, 43, 23, 110, 4],
[ 23, 43, 23, 110, 3],
[ 23, 43, 23, 110, 2],
[ 23, 43, 23, 110, 1],
[ 23, 43, 23, 110, 0],
[ 83, 32, 12, 123, 4],
[ 83, 32, 12, 123, 3],
[ 83, 32, 12, 123, 2],
[ 83, 32, 12, 123, 1],
[ 83, 32, 12, 123, 0],
[ 58, 41, 59, 189, 1],
[ 58, 41, 59, 189, 0],
[ 93, 77, 22, 170, 3],
[ 93, 77, 22, 170, 2],
[ 93, 77, 22, 170, 1],
[ 93, 77, 22, 170, 0]])
CodePudding user response:
I've finally figured out a solution:
rep = np.repeat(A, A[:, -1] 1, axis=0)
rep[:, -1] = np.concatenate([np.arange(0, n 1)[::-1] for n in A[:, -1]])
Output:
>>> rep
array([[ 23, 43, 23, 110, 5],
[ 23, 43, 23, 110, 4],
[ 23, 43, 23, 110, 3],
[ 23, 43, 23, 110, 2],
[ 23, 43, 23, 110, 1],
[ 23, 43, 23, 110, 0],
[ 83, 32, 12, 123, 4],
[ 83, 32, 12, 123, 3],
[ 83, 32, 12, 123, 2],
[ 83, 32, 12, 123, 1],
[ 83, 32, 12, 123, 0],
[ 58, 41, 59, 189, 1],
[ 58, 41, 59, 189, 0],
[ 93, 77, 22, 170, 3],
[ 93, 77, 22, 170, 2],
[ 93, 77, 22, 170, 1],
[ 93, 77, 22, 170, 0]])