How to concatenate two different ndarray with columns as index-CodePudding

I have two different ndarray that looks like this

rec_np = np.array([[1, 4], [2, 4], [6, 1], [7, 3]])
test_np = np.array([[2, 5], [3, 1], [11, 3]])
print(rec_np)
print(test_np)

output :

[[1 4]
 [2 4]
 [6 1]
 [7 3]]

[[ 2  5]
 [ 3  1]
 [11  3]]

I need to combine it based on the first column. it should be become like this

[[1 4 Nan]
 [2 4 5]
 [3 Nan 1]
 [6 1 Nan]
 [7 3 Nan]
 [11 Nan 3]]

does anyone have a suggestion without changing it into pandas dataframe first?

CodePudding user response：

pandas may be the easiest; numpy does not have such a merger tool (that I know of). But here's an approach that works (at least for your test case).

In [101]: rec_np = np.array([[1, 4], [2, 4], [6, 1], [7, 3]])
     ...: test_np = np.array([[2, 5], [3, 1], [11, 3]])

In [102]: rec_np, test_np
Out[102]: 
(array([[1, 4],
        [2, 4],
        [6, 1],
        [7, 3]]),
 array([[ 2,  5],
        [ 3,  1],
        [11,  3]]))

First find the unique indices:

In [103]: firstcol = np.hstack((rec_np[:,0],test_np[:,0]))
In [104]: u = np.unique(firstcol)
In [105]: u
Out[105]: array([ 1,  2,  3,  6,  7, 11])

Make a recipient array with the right size:

In [106]: res = np.zeros((len(u),3),float)

Insert the indices and fillers:

In [107]: res[:,0] = u    
In [108]: res[:,[1,2]] =np.nan    
In [109]: res
Out[109]: 
array([[ 1., nan, nan],
       [ 2., nan, nan],
       [ 3., nan, nan],
       [ 6., nan, nan],
       [ 7., nan, nan],
       [11., nan, nan]])

Find where the unique indices occur in rec_np. This assumes rec_np indices are in order:

In [112]: mask1 = np.in1d(u,rec_np[:,0])
In [113]: mask1
Out[113]: array([ True,  True, False,  True,  True, False])

Assign the rec values:

In [114]: res[mask1,1]=rec_np[:,1]    
In [115]: res
Out[115]: 
array([[ 1.,  4., nan],
       [ 2.,  4., nan],
       [ 3., nan, nan],
       [ 6.,  1., nan],
       [ 7.,  3., nan],
       [11., nan, nan]])

Same for test_np:

In [116]: mask2 = np.in1d(u,test_np[:,0])    
In [117]: mask2
Out[117]: array([False,  True,  True, False, False,  True])    
In [118]: res[mask2,2]=test_np[:,1]    
In [119]: res
Out[119]: 
array([[ 1.,  4., nan],
       [ 2.,  4.,  5.],
       [ 3., nan,  1.],
       [ 6.,  1., nan],
       [ 7.,  3., nan],
       [11., nan,  3.]])

Looking at the in1d docs, I see I should have used isin, and also used the assume_unique parameter. in1d makes use of unique and sorting.

`numpy.lib.recfunctions`

In [141]: import numpy.lib.recfunctions as rf

make structured arrays from your two arrays:

In [142]: arr1 = np.array([tuple(i) for i in rec_np],[('key','i'),('val1','f')])    
In [143]: arr2 = np.array([tuple(i) for i in test_np],[('key','i'),('val2','f')])    
In [144]: arr1
Out[144]: 
array([(1, 4.), (2, 4.), (6, 1.), (7, 3.)],
      dtype=[('key', '<i4'), ('val1', '<f4')])    
In [145]: arr2
Out[145]: 
array([( 2, 5.), ( 3, 1.), (11, 3.)],
      dtype=[('key', '<i4'), ('val2', '<f4')])

Then use the join_by function - after playing a lot with parameters:

In [146]: new =rf.join_by('key',arr1,arr2,jointype='outer',usemask=False, defaults={'val1':np.nan, 'val2':np.nan})

In [147]: new
Out[147]: 
array([( 1,  4., nan), ( 2,  4.,  5.), ( 3, nan,  1.), ( 6,  1., nan),
       ( 7,  3., nan), (11, nan,  3.)],
      dtype=[('key', '<i4'), ('val1', '<f4'), ('val2', '<f4')])

and convert that to unstructured with:

In [148]: rf.structured_to_unstructured(new)
Out[148]: 
array([[ 1.,  4., nan],
       [ 2.,  4.,  5.],
       [ 3., nan,  1.],
       [ 6.,  1., nan],
       [ 7.,  3., nan],
       [11., nan,  3.]])

Probably not any faster; it makes use of a 'builtin' functionality, but not a familiar one (and I may know the rf functions as well as anyone on SO).