I have a dataframe that has 7 columns. I want to create an 8th column with the output of the equation. The equation is: from columns 2-7, for those numbers that are above Col 1, output the minimum value to Col 8.
In excel, I would use this formula in Col8: =MINIFS($Col2:$Col7,$Col2:$Col7,">"&$Col1)
Date_Time Col1 Col2 Col3 Col4 Col5 Col6 Col7
20 3/14/2022 0:35 0.68053 0.68048 0.68094 0.6811 0.68111 0.68126 0.68179
21 3/14/2022 0:36 0.6805 0.6805 0.68091 0.68109 0.68111 0.68125 0.68178
22 3/14/2022 0:37 0.68052 0.68051 0.68089 0.68107 0.6811 0.68124 0.68177
23 3/14/2022 0:38 0.68057 0.6805 0.68086 0.68106 0.6811 0.68123 0.68176
24 3/14/2022 0:39 0.68055 0.6805 0.68083 0.68104 0.68109 0.68122 0.68175
The output of the above data sample would be:
Col8
0.68094
0.68091
0.68089
0.68086
0.68083
I have read many posts, and tried with np.where:
ColList = [Col2,Col3,Col4,Col5,Col6,Col7]
df['Col8'] = np.where(df[ColList ].min(axis=1) > df['Col1'],df[ColList ].min(axis=1),np.nan)
But it is finding the minimum number in the list and then checking if that minimum number is greater than Col1. What I am looking for is it to do it in reverse: for it to only look at the numbers that are greater than Col1 and return the minimum out of those numbers (or return np.nan if there are no numbers greater than Col1).
I prefer a vectorized solution, as the dataset is quite large. Otherwise, I would just create a function/loop to do this.
CodePudding user response:
IIUC, you want the positively closest values. One way using masking:
s = df[["Col1"]].values
tmp = df.loc[:, "Col2":] - s
tmp[tmp<=0] = np.inf
df["Col8"] = tmp.min(axis=1) s.ravel()
Output:
print(df["Col8"])
20 0.68094
21 0.68091
22 0.68089
23 0.68086
24 0.68083
Name: Col8, dtype: float64
Logic:
- We first calculate the distance between
Col1
and others. - Then any negative distances (i.e. those smaller than
Col1
) get replaced bynumpy.inf
to exclude in minimum comparison.
Performance:
df = df.sample(n=100000, replace=True).reset_index(drop=True)
%%timeit
s = df[["Col1"]].values
tmp = df.loc[:, "Col2":] - s
tmp[tmp<=0] = np.inf
df["Col8"] = tmp.min(axis=1) s.ravel()
# 13.1 ms ± 379 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
CodePudding user response:
One option is to compare within numpy, then use pd.where
and min(axis=1)
to get your desired output:
left = df.filter(regex=r"[2-7]")
right = df.Col1.to_numpy()[:, None]
booleans = left.to_numpy() > right
left.where(booleans).min(axis = 1)
0.68094
0.68091
0.68089
0.68086
0.68083
I could not copy your data well (issues with the date column), hence the series output.