Home > OS >  How to do MinIFS in Python Dataframe
How to do MinIFS in Python Dataframe

Time:03-23

I have a dataframe that has 7 columns. I want to create an 8th column with the output of the equation. The equation is: from columns 2-7, for those numbers that are above Col 1, output the minimum value to Col 8.

In excel, I would use this formula in Col8: =MINIFS($Col2:$Col7,$Col2:$Col7,">"&$Col1)

    Date_Time         Col1   Col2     Col3    Col4   Col5    Col6    Col7
20  3/14/2022 0:35  0.68053 0.68048 0.68094 0.6811  0.68111 0.68126 0.68179
21  3/14/2022 0:36  0.6805  0.6805  0.68091 0.68109 0.68111 0.68125 0.68178
22  3/14/2022 0:37  0.68052 0.68051 0.68089 0.68107 0.6811  0.68124 0.68177
23  3/14/2022 0:38  0.68057 0.6805  0.68086 0.68106 0.6811  0.68123 0.68176
24  3/14/2022 0:39  0.68055 0.6805  0.68083 0.68104 0.68109 0.68122 0.68175

The output of the above data sample would be:

Col8
0.68094
0.68091
0.68089
0.68086
0.68083

I have read many posts, and tried with np.where:

ColList = [Col2,Col3,Col4,Col5,Col6,Col7]
df['Col8'] = np.where(df[ColList ].min(axis=1) > df['Col1'],df[ColList ].min(axis=1),np.nan)

But it is finding the minimum number in the list and then checking if that minimum number is greater than Col1. What I am looking for is it to do it in reverse: for it to only look at the numbers that are greater than Col1 and return the minimum out of those numbers (or return np.nan if there are no numbers greater than Col1).

I prefer a vectorized solution, as the dataset is quite large. Otherwise, I would just create a function/loop to do this.

CodePudding user response:

IIUC, you want the positively closest values. One way using masking:

s = df[["Col1"]].values
tmp = df.loc[:, "Col2":] - s
tmp[tmp<=0] = np.inf
df["Col8"] = tmp.min(axis=1)   s.ravel()

Output:

print(df["Col8"])

20    0.68094
21    0.68091
22    0.68089
23    0.68086
24    0.68083
Name: Col8, dtype: float64

Logic:

  1. We first calculate the distance between Col1 and others.
  2. Then any negative distances (i.e. those smaller than Col1) get replaced by numpy.inf to exclude in minimum comparison.

Performance:

df = df.sample(n=100000, replace=True).reset_index(drop=True)

%%timeit

s = df[["Col1"]].values
tmp = df.loc[:, "Col2":] - s
tmp[tmp<=0] = np.inf
df["Col8"] = tmp.min(axis=1)   s.ravel()

# 13.1 ms ± 379 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

CodePudding user response:

One option is to compare within numpy, then use pd.where and min(axis=1) to get your desired output:

left = df.filter(regex=r"[2-7]")
right = df.Col1.to_numpy()[:, None]
booleans = left.to_numpy() > right
left.where(booleans).min(axis = 1)

    0.68094
    0.68091
    0.68089
    0.68086
    0.68083

I could not copy your data well (issues with the date column), hence the series output.

  • Related