Matching to a specific year column in pandas-CodePudding

I am trying to take a "given" value and match it to a "year" in the same row using the following dataframe:

data = {
    'Given' : [0.45, 0.39, 0.99, 0.58, None],
    'Year 1' : [0.25, 0.15, 0.3, 0.23, 0.25],
    'Year 2' : [0.39, 0.27, 0.55, 0.3, 0.4],
    'Year 3' : [0.43, 0.58, 0.78, 0.64, 0.69],
    'Year 4' : [0.65, 0.83, 0.95, 0.73, 0.85],
    'Year 5' : [0.74, 0.87, 0.99, 0.92, 0.95]
}
   
df = pd.DataFrame(data)

print(df)

Output:

   Given  Year 1  Year 2  Year 3  Year 4  Year 5
0   0.45    0.25    0.39    0.43    0.65    0.74
1   0.39    0.15    0.27    0.58    0.83    0.87
2   0.99    0.30    0.55    0.78    0.95    0.99
3   0.58    0.23    0.30    0.64    0.73    0.92
4    NaN    0.25    0.40    0.69    0.85    0.95

However, the matching process has a few caveats. I am trying to match to the closest year to the given value before calculating the time to the first "year" above 70%. So row 0 would match to "year 3", and we can see in the same row that it will take two years until "year 5", which is the first occurence in the row above 70%.

For any "given" value already above 70%, we can just output "full", and for any "given" values that don't contain data, we can just output the first year above 70%. The output will look like the following:

   Given  Year 1  Year 2  Year 3  Year 4  Year 5 Output
0   0.45    0.25    0.39    0.43    0.65    0.74      2
1   0.39    0.15    0.27    0.58    0.83    0.87      2
2   0.99    0.30    0.55    0.78    0.95    0.99   full
3   0.58    0.23    0.30    0.64    0.73    0.92      1
4    NaN    0.25    0.40    0.69    0.85    0.95      4

It has taken me a horrendously long time to clean up this data so at the moment I can't think of a way to begin other than some use of .abs() to begin the matching process. All help appreciated.

CodePudding user response：

Here you go!

import numpy as np

def output(df):
    output = []
    for i in df.iterrows():
        row = i[1].to_list()
        given = row[0]
        compare = np.array(row[1:])
        first_70 = np.argmax(compare > 0.7)
        
        if np.isnan(given):
            output.append(first_70   1)
            continue
            
        if given > 0.7:
            output.append('full')
            continue
            
        diff = np.abs(np.array(compare) - np.array(given))
        closest_year = diff.argmin()

        output.append(first_70 - closest_year)
    
    return output

df['output'] = output(df)

CodePudding user response：

Vectorized Pandas Approach:

reset_index() of the column names and .T, so that you can have the same column names and subtract dataframes from each other in a vectorized way. pd.concat() with * creates a dataframe that duplicates the first column, so that you can get the absolute difference of the dataframes in a more vectorized way instead of looping through columns.
Use idxmax and idxmin to identify the column numbers according to your criteria.

Use np.select according to your criteria.

import pandas as pd
import numpy as np
# identify 70% columns
pct_70 = (df.T.reset_index(drop=True).T > .7).idxmax(axis=1)
# identify column number of lowest absolute difference to Given
nearest_col = ((df.iloc[:,1:].T.reset_index(drop=True).T 
 - pd.concat([df.iloc[:,0]] * len(df.columns[1:]), axis=1)
  .T.reset_index(drop=True).T)).abs().idxmin(axis=1) 
# Generate an output series
output = pct_70 - nearest_col - 1
# Conditionally apply the output series
df['Output'] = np.select([output.gt(0),output.lt(0),output.isnull()],
                          [output, 'full', pct_70],np.nan)
df
Out[1]: 
   Given  Year 1  Year 2  Year 3  Year 4  Year 5 Output
0   0.45    0.25    0.39    0.43    0.65    0.74    2.0
1   0.39    0.15    0.27    0.58    0.83    0.87    2.0
2   0.99    0.30    0.55    0.78    0.95    0.99   full
3   0.58    0.23    0.30    0.64    0.73    0.92    1.0
4    NaN    0.25    0.40    0.69    0.85    0.95      4