I have two different ndarray that looks like this
rec_np = np.array([[1, 4], [2, 4], [6, 1], [7, 3]])
test_np = np.array([[2, 5], [3, 1], [11, 3]])
print(rec_np)
print(test_np)
output :
[[1 4]
[2 4]
[6 1]
[7 3]]
[[ 2 5]
[ 3 1]
[11 3]]
I need to combine it based on the first column. it should be become like this
[[1 4 Nan]
[2 4 5]
[3 Nan 1]
[6 1 Nan]
[7 3 Nan]
[11 Nan 3]]
does anyone have a suggestion without changing it into pandas dataframe first?
CodePudding user response:
pandas
may be the easiest; numpy
does not have such a merger tool (that I know of). But here's an approach that works (at least for your test case).
In [101]: rec_np = np.array([[1, 4], [2, 4], [6, 1], [7, 3]])
...: test_np = np.array([[2, 5], [3, 1], [11, 3]])
In [102]: rec_np, test_np
Out[102]:
(array([[1, 4],
[2, 4],
[6, 1],
[7, 3]]),
array([[ 2, 5],
[ 3, 1],
[11, 3]]))
First find the unique indices:
In [103]: firstcol = np.hstack((rec_np[:,0],test_np[:,0]))
In [104]: u = np.unique(firstcol)
In [105]: u
Out[105]: array([ 1, 2, 3, 6, 7, 11])
Make a recipient array with the right size:
In [106]: res = np.zeros((len(u),3),float)
Insert the indices and fillers:
In [107]: res[:,0] = u
In [108]: res[:,[1,2]] =np.nan
In [109]: res
Out[109]:
array([[ 1., nan, nan],
[ 2., nan, nan],
[ 3., nan, nan],
[ 6., nan, nan],
[ 7., nan, nan],
[11., nan, nan]])
Find where the unique indices occur in rec_np
. This assumes rec_np
indices are in order:
In [112]: mask1 = np.in1d(u,rec_np[:,0])
In [113]: mask1
Out[113]: array([ True, True, False, True, True, False])
Assign the rec
values:
In [114]: res[mask1,1]=rec_np[:,1]
In [115]: res
Out[115]:
array([[ 1., 4., nan],
[ 2., 4., nan],
[ 3., nan, nan],
[ 6., 1., nan],
[ 7., 3., nan],
[11., nan, nan]])
Same for test_np
:
In [116]: mask2 = np.in1d(u,test_np[:,0])
In [117]: mask2
Out[117]: array([False, True, True, False, False, True])
In [118]: res[mask2,2]=test_np[:,1]
In [119]: res
Out[119]:
array([[ 1., 4., nan],
[ 2., 4., 5.],
[ 3., nan, 1.],
[ 6., 1., nan],
[ 7., 3., nan],
[11., nan, 3.]])
Looking at the in1d
docs, I see I should have used isin
, and also used the assume_unique
parameter. in1d
makes use of unique
and sorting
.
numpy.lib.recfunctions
In [141]: import numpy.lib.recfunctions as rf
make structured arrays from your two arrays:
In [142]: arr1 = np.array([tuple(i) for i in rec_np],[('key','i'),('val1','f')])
In [143]: arr2 = np.array([tuple(i) for i in test_np],[('key','i'),('val2','f')])
In [144]: arr1
Out[144]:
array([(1, 4.), (2, 4.), (6, 1.), (7, 3.)],
dtype=[('key', '<i4'), ('val1', '<f4')])
In [145]: arr2
Out[145]:
array([( 2, 5.), ( 3, 1.), (11, 3.)],
dtype=[('key', '<i4'), ('val2', '<f4')])
Then use the join_by
function - after playing a lot with parameters:
In [146]: new =rf.join_by('key',arr1,arr2,jointype='outer',usemask=False, defaults={'val1':np.nan, 'val2':np.nan})
In [147]: new
Out[147]:
array([( 1, 4., nan), ( 2, 4., 5.), ( 3, nan, 1.), ( 6, 1., nan),
( 7, 3., nan), (11, nan, 3.)],
dtype=[('key', '<i4'), ('val1', '<f4'), ('val2', '<f4')])
and convert that to unstructured with:
In [148]: rf.structured_to_unstructured(new)
Out[148]:
array([[ 1., 4., nan],
[ 2., 4., 5.],
[ 3., nan, 1.],
[ 6., 1., nan],
[ 7., 3., nan],
[11., nan, 3.]])
Probably not any faster; it makes use of a 'builtin' functionality, but not a familiar one (and I may know the rf
functions as well as anyone on SO).