Update numpy array based on nearest value in pandas DataFrame column-CodePudding

How can I update an array based on the nearest value in a pandas DataFrame column? For example, I'd like to update the following array based on the "Time" column in the pandas DataFrame so that the array now contains the "X" values:

Input array:

a = np.array([
    [122.25, 225.00, 201.00],
    [125.00, 151.50, 160.62],
    [99.99, 142.25, 250.01],
])

Input DataFrame:

df = pd.DataFrame({
    'Time': [100, 125, 150, 175, 200, 225],
    'X': [26100, 26200, 26300, 26000, 25900, 25800],
})

Expected output array:

([
    [26200, 25800, 25900],
    [26200, 26300, 26300],
    [26100, 26300, 25800],
])

CodePudding user response：

Use merge_asof:

# Convert Time to float since your input array is float.
# merge_asof requires both sides to have the same data types
df['Time'] = df['Time'].astype('float')

# merge_asof also requires both data frames to be sorted by the join key (Time)
# So we need to flatten the input array and make note of the original order
# before going into the merge
a_ = np.ravel(a)
o_ = np.arange(len(a_))

tmp = pd.DataFrame({
    'Time': a_,
    'Order': o_
})

# Merge the two data frames and extract X in the original order
result = (
    pd.merge_asof(tmp.sort_values('Time'), df.sort_values('Time'), on='Time', direction='nearest')
        .sort_values('Order')
        ['X'].to_numpy()
        .reshape(a.shape)
)

CodePudding user response：

This is one way to solve this:

import pandas as pd
import numpy as np

df = pd.DataFrame({
        'Time': [100, 125, 150, 175, 200, 225],
        'X': [26100, 26200, 26300, 26000, 25900, 25800],
    })

a = np.array([
        [122.25, 225.00, 201.00],
        [125.00, 151.50, 160.62],
        [99.99, 142.25, 250.01],
    ])

result = []
for row in a:
    lst = []
    for i in range(0,3):  # For each item in the row of a
        tim = row[i]  # We'll search for the nearest to this time in the df
        # Here's the meat of the answer:
        # df['Time'].sub(tim): Calculate the diff. between the given time
        #                      and the Time column
        # abs():  the the absolute value of the diffs.
        # idxmin():  return the index of the min abs difference
        idx = df['Time'].sub(tim).abs().idxmin()  #
        x = df['X'][idx]  # Get the "x" value at the same row.
        lst.append(x)
    result.append(lst)
print(result)

Which outputs:

   Time      X
0   100  26100
1   125  26200
2   150  26300
3   175  26000
4   200  25900
5   225  25800
[26200, 25800, 25900]
[26200, 26300, 26300]
[26100, 26300, 25800]