I am trying to take a "given" value and match it to a "year" in the same row using the following dataframe:
data = {
'Given' : [0.45, 0.39, 0.99, 0.58, None],
'Year 1' : [0.25, 0.15, 0.3, 0.23, 0.25],
'Year 2' : [0.39, 0.27, 0.55, 0.3, 0.4],
'Year 3' : [0.43, 0.58, 0.78, 0.64, 0.69],
'Year 4' : [0.65, 0.83, 0.95, 0.73, 0.85],
'Year 5' : [0.74, 0.87, 0.99, 0.92, 0.95]
}
df = pd.DataFrame(data)
print(df)
Output:
Given Year 1 Year 2 Year 3 Year 4 Year 5
0 0.45 0.25 0.39 0.43 0.65 0.74
1 0.39 0.15 0.27 0.58 0.83 0.87
2 0.99 0.30 0.55 0.78 0.95 0.99
3 0.58 0.23 0.30 0.64 0.73 0.92
4 NaN 0.25 0.40 0.69 0.85 0.95
However, the matching process has a few caveats. I am trying to match to the closest year to the given value before calculating the time to the first "year" above 70%. So row 0 would match to "year 3", and we can see in the same row that it will take two years until "year 5", which is the first occurence in the row above 70%.
For any "given" value already above 70%, we can just output "full", and for any "given" values that don't contain data, we can just output the first year above 70%. The output will look like the following:
Given Year 1 Year 2 Year 3 Year 4 Year 5 Output
0 0.45 0.25 0.39 0.43 0.65 0.74 2
1 0.39 0.15 0.27 0.58 0.83 0.87 2
2 0.99 0.30 0.55 0.78 0.95 0.99 full
3 0.58 0.23 0.30 0.64 0.73 0.92 1
4 NaN 0.25 0.40 0.69 0.85 0.95 4
It has taken me a horrendously long time to clean up this data so at the moment I can't think of a way to begin other than some use of .abs()
to begin the matching process. All help appreciated.
CodePudding user response:
Here you go!
import numpy as np
def output(df):
output = []
for i in df.iterrows():
row = i[1].to_list()
given = row[0]
compare = np.array(row[1:])
first_70 = np.argmax(compare > 0.7)
if np.isnan(given):
output.append(first_70 1)
continue
if given > 0.7:
output.append('full')
continue
diff = np.abs(np.array(compare) - np.array(given))
closest_year = diff.argmin()
output.append(first_70 - closest_year)
return output
df['output'] = output(df)
CodePudding user response:
Vectorized Pandas Approach:
reset_index()
of the column names and.T
, so that you can have the same column names and subtract dataframes from each other in a vectorized way.pd.concat()
with*
creates a dataframe that duplicates the first column, so that you can get the absolute difference of the dataframes in a more vectorized way instead of looping through columns.Use
idxmax
andidxmin
to identify the column numbers according to your criteria.Use
np.select
according to your criteria.import pandas as pd import numpy as np # identify 70% columns pct_70 = (df.T.reset_index(drop=True).T > .7).idxmax(axis=1) # identify column number of lowest absolute difference to Given nearest_col = ((df.iloc[:,1:].T.reset_index(drop=True).T - pd.concat([df.iloc[:,0]] * len(df.columns[1:]), axis=1) .T.reset_index(drop=True).T)).abs().idxmin(axis=1) # Generate an output series output = pct_70 - nearest_col - 1 # Conditionally apply the output series df['Output'] = np.select([output.gt(0),output.lt(0),output.isnull()], [output, 'full', pct_70],np.nan) df Out[1]: Given Year 1 Year 2 Year 3 Year 4 Year 5 Output 0 0.45 0.25 0.39 0.43 0.65 0.74 2.0 1 0.39 0.15 0.27 0.58 0.83 0.87 2.0 2 0.99 0.30 0.55 0.78 0.95 0.99 full 3 0.58 0.23 0.30 0.64 0.73 0.92 1.0 4 NaN 0.25 0.40 0.69 0.85 0.95 4