I am working with the following table trying to match "given" values to "year" column in a dynamic way, before finding the number of years until it crosses the 70% mark.
data = {
'Given' : [0.45, 0.39, 0.99, 0.58, None],
'Year 1' : [0.25, 0.15, 0.3, 0.23, 0.25],
'Year 2' : [0.39, 0.27, 0.55, 0.3, 0.4],
'Year 3' : [0.43, 0.58, 0.78, 0.64, 0.69],
'Year 4' : [0.65, 0.83, 0.95, 0.73, 0.85],
'Year 5' : [0.74, 0.87, 0.99, 0.92, 0.95]
}
df = pd.DataFrame(data)
print(df)
Output:
Given Year 1 Year 2 Year 3 Year 4 Year 5
0 0.45 0.25 0.39 0.43 0.65 0.74
1 0.39 0.15 0.27 0.58 0.83 0.87
2 0.99 0.30 0.55 0.78 0.95 0.99
3 0.58 0.23 0.30 0.64 0.73 0.92
4 NaN 0.25 0.40 0.69 0.85 0.95
I am trying to match the "given" value to the lower year if the value is less than 75% of the distance between the two years on either side of "given".
Unorthodox visual aid:
lower year ------- 75% --> upper year
Example: if "given" is 0.17, "year 1" is 0.1, "year 2" is 0.2, then it will still be mapped to "year 1" since 0.17 < 0.175 (75% of the way between the two), outputting "1".
If "given" >= 70%, then output "full". And if "given" is NaN, then output the first year above 70%.
Sample output:
Given Year 1 Year 2 Year 3 Year 4 Year 5 Output
0 0.45 0.25 0.39 0.43 0.65 0.74 2.0
1 0.39 0.15 0.27 0.58 0.83 0.87 2.0
2 0.99 0.30 0.55 0.78 0.95 0.99 full
3 0.58 0.23 0.30 0.64 0.73 0.92 1.0
4 NaN 0.25 0.40 0.69 0.85 0.95 4
Here is the answer that I am trying to edit in order to match the first criteria (it still gives the same output, but does not follow the 75% fence):
import pandas as pd
import numpy as np
pct_70 = (df.T.reset_index(drop=True).T > .7).idxmax(axis=1)
nearest_col = ((df.iloc[:,1:].T.reset_index(drop=True).T
- pd.concat([df.iloc[:,0]] * len(df.columns[1:]), axis=1)
.T.reset_index(drop=True).T)).abs().idxmin(axis=1)
output = pct_70 - nearest_col - 1
# Conditionally apply the output series
df['Output'] = np.select([output.gt(0),output.lt(0),output.isnull()],
[output, 'full', pct_70],np.nan)
I am trying to make changes to the line
nearest_col = ((df.iloc[:,1:].T.reset_index(drop=True).T
- pd.concat([df.iloc[:,0]] * len(df.columns[1:]), axis=1)
.T.reset_index(drop=True).T)).abs().idxmin(axis=1)
Mathematically speaking I know we will incorporate
0.75*(df.iloc[:,1:].T.reset_index(drop=True).T
- pd.concat([df.iloc[:,0]])
but am not sure how this will be edited into the code - as a python beginner, I am at a loss at the moment..
All help appreciated.
In summary, three cases:
Case 1: Given is above 70% -> output "full" Case 2: Given is below 70% -> match to lower year if less than 3/4 the range between the lower and upper year -> output number of years until 70% or higher Case 3: Given is NaN -> output number of years until 70% or higher
CodePudding user response:
Here is a way to do it with numpy broadcasting:
import numpy as np
# 75% rule.
thresholds = df df.diff(-1, axis=1).abs() * 0.75
below_75 = (df['Given'].to_numpy()[:, None] - thresholds.to_numpy()) < 0
min_year = thresholds.where(below_75).drop(columns=['Given']).idxmin(axis=1).str.replace('Year ', '').astype(float)
min_year = df.where(df > 0.7).drop(columns=['Given']).idxmin(axis=1).str.replace('Year ', '').astype(float) - min_year
# 70% rule.
min_year.loc[df['Given'] > 0.7] = 'full'
# NaN rule.
min_year.loc[df['Given'].isna()] = df.where(df > 0.7).drop(columns=['Given']).idxmin(axis=1).str.replace('Year ', '').astype(float)
df['Output'] = min_year
print(df)
Given Year 1 Year 2 Year 3 Year 4 Year 5 Output
0 0.45 0.25 0.39 0.43 0.65 0.74 2.0
1 0.39 0.15 0.27 0.58 0.83 0.87 2.0
2 0.99 0.30 0.55 0.78 0.95 0.99 full
3 0.58 0.23 0.30 0.64 0.73 0.92 1.0
4 NaN 0.25 0.40 0.69 0.85 0.95 4.0
Another dataset:
Since your example data has the same output also if you don't follow the 75% rule, here is another example row (output should be 2):
Given Year 1 Year 2 Year 3 Year 4 Year 5
0 0.31 0.23 0.3 0.64 0.73 0.92
Output with your approach:
Given Year 1 Year 2 Year 3 Year 4 Year 5 Output
0 0.31 0.23 0.3 0.64 0.73 0.92 1
Output with solution in this answer:
Given Year 1 Year 2 Year 3 Year 4 Year 5 Output
0 0.31 0.23 0.3 0.64 0.73 0.92 2.0