Home > database >  Matching values to columns using specific range criteria
Matching values to columns using specific range criteria

Time:05-19

I am working with the following table trying to match "given" values to "year" column in a dynamic way, before finding the number of years until it crosses the 70% mark.

data = {
    'Given' : [0.45, 0.39, 0.99, 0.58, None],
    'Year 1' : [0.25, 0.15, 0.3, 0.23, 0.25],
    'Year 2' : [0.39, 0.27, 0.55, 0.3, 0.4],
    'Year 3' : [0.43, 0.58, 0.78, 0.64, 0.69],
    'Year 4' : [0.65, 0.83, 0.95, 0.73, 0.85],
    'Year 5' : [0.74, 0.87, 0.99, 0.92, 0.95]
}
   
df = pd.DataFrame(data)

print(df)

Output:

   Given  Year 1  Year 2  Year 3  Year 4  Year 5
0   0.45    0.25    0.39    0.43    0.65    0.74
1   0.39    0.15    0.27    0.58    0.83    0.87
2   0.99    0.30    0.55    0.78    0.95    0.99
3   0.58    0.23    0.30    0.64    0.73    0.92
4    NaN    0.25    0.40    0.69    0.85    0.95


I am trying to match the "given" value to the lower year if the value is less than 75% of the distance between the two years on either side of "given".

Unorthodox visual aid:

lower year ------- 75% --> upper year

Example: if "given" is 0.17, "year 1" is 0.1, "year 2" is 0.2, then it will still be mapped to "year 1" since 0.17 < 0.175 (75% of the way between the two), outputting "1".

If "given" >= 70%, then output "full". And if "given" is NaN, then output the first year above 70%.

Sample output:

   Given  Year 1  Year 2  Year 3  Year 4  Year 5 Output
0   0.45    0.25    0.39    0.43    0.65    0.74    2.0
1   0.39    0.15    0.27    0.58    0.83    0.87    2.0
2   0.99    0.30    0.55    0.78    0.95    0.99   full
3   0.58    0.23    0.30    0.64    0.73    0.92    1.0
4    NaN    0.25    0.40    0.69    0.85    0.95      4

Here is the answer that I am trying to edit in order to match the first criteria (it still gives the same output, but does not follow the 75% fence):

import pandas as pd
import numpy as np

pct_70 = (df.T.reset_index(drop=True).T > .7).idxmax(axis=1)

nearest_col = ((df.iloc[:,1:].T.reset_index(drop=True).T 
 - pd.concat([df.iloc[:,0]] * len(df.columns[1:]), axis=1)
  .T.reset_index(drop=True).T)).abs().idxmin(axis=1) 

output = pct_70 - nearest_col - 1
# Conditionally apply the output series
df['Output'] = np.select([output.gt(0),output.lt(0),output.isnull()],
                          [output, 'full', pct_70],np.nan)

I am trying to make changes to the line

nearest_col = ((df.iloc[:,1:].T.reset_index(drop=True).T 
 - pd.concat([df.iloc[:,0]] * len(df.columns[1:]), axis=1)
  .T.reset_index(drop=True).T)).abs().idxmin(axis=1)

Mathematically speaking I know we will incorporate

0.75*(df.iloc[:,1:].T.reset_index(drop=True).T 
 - pd.concat([df.iloc[:,0]])

but am not sure how this will be edited into the code - as a python beginner, I am at a loss at the moment..

All help appreciated.

In summary, three cases:

Case 1: Given is above 70% -> output "full" Case 2: Given is below 70% -> match to lower year if less than 3/4 the range between the lower and upper year -> output number of years until 70% or higher Case 3: Given is NaN -> output number of years until 70% or higher

CodePudding user response:

Here is a way to do it with numpy broadcasting:

import numpy as np

# 75% rule.
thresholds = df   df.diff(-1, axis=1).abs() * 0.75
below_75 = (df['Given'].to_numpy()[:, None] - thresholds.to_numpy()) < 0
min_year = thresholds.where(below_75).drop(columns=['Given']).idxmin(axis=1).str.replace('Year ', '').astype(float)
min_year = df.where(df > 0.7).drop(columns=['Given']).idxmin(axis=1).str.replace('Year ', '').astype(float) - min_year

# 70% rule.
min_year.loc[df['Given'] > 0.7] = 'full'

# NaN rule.
min_year.loc[df['Given'].isna()] = df.where(df > 0.7).drop(columns=['Given']).idxmin(axis=1).str.replace('Year ', '').astype(float)

df['Output'] = min_year

print(df)
   Given  Year 1  Year 2  Year 3  Year 4  Year 5 Output
0   0.45    0.25    0.39    0.43    0.65    0.74    2.0
1   0.39    0.15    0.27    0.58    0.83    0.87    2.0
2   0.99    0.30    0.55    0.78    0.95    0.99   full
3   0.58    0.23    0.30    0.64    0.73    0.92    1.0
4    NaN    0.25    0.40    0.69    0.85    0.95    4.0

Another dataset:

Since your example data has the same output also if you don't follow the 75% rule, here is another example row (output should be 2):

   Given  Year 1  Year 2  Year 3  Year 4  Year 5
0    0.31    0.23     0.3    0.64    0.73    0.92

Output with your approach:

   Given  Year 1  Year 2  Year 3  Year 4  Year 5 Output
0    0.31    0.23     0.3    0.64    0.73    0.92      1

Output with solution in this answer:

   Given  Year 1  Year 2  Year 3  Year 4  Year 5 Output
0    0.31    0.23     0.3    0.64    0.73    0.92    2.0
  • Related