I want to create new features from a pandas dataset by an arbitrary process-CodePudding

The following data sets are currently being used.

import pandas as pd
import io

csv_data = '''
ID,age,get_sick,year
4567,76,0,2014
4567,78,0,2016
4567,79,1,2017
12168,65,0,2014
12168,68,0,2017
12168,69,0,2018
12168,70,1,2019
20268,65,0,2014
20268,66,0,2015
20268,67,0,2016
20268,68,0,2017
20268,69,1,2018
22818,65,0,2008
22818,73,1,2016
'''
df = pd.read_csv(io.StringIO(csv_data), index_col=['ID', 'age'])

           get_sick  year
ID    age                
4567  76          0  2014
      78          0  2016
      79          1  2017
12168 65          0  2014
      68          0  2017
      69          0  2018
      70          1  2019
20268 65          0  2014
      66          1  2015
      67          1  2016
      68          1  2017
      69          1  2018
22818 65          0  2008
      73          1  2016

For each individual, get_sick is 1 if the person's age at the time of the physical exam, the year of the year measured, and if the person has ever had an illness.

We are now trying to build a model that predicts the likelihood that a person with get_sick=0 will develop a disease in the future.

We want to check if the person with get_sick=0 has changed from 0 to 1 within 5 years, and if so, we want to store 1 in the new column 'history', and if 0 to 0, we want to store 0.

We only target data with get_sick=0, since data with get_sick=1 is not used for training.

Tried

N = 3
idx = df.groupby('ID').apply(lambda x: x.query("(year - @x.year.min()) <= @N")['get_sick'].max())
df_1 = df.reset_index().assign(history=df.reset_index()['ID'].map(idx)).set_index(['ID', 'age'])
df_1

This process did not give us the ideal treatment because we were comparing only the first year.

The ideal output result would be the following

           get_sick  year  history
ID    age                
4567  76          0  2014       1
      78          0  2016       1
      79          1  2017     Nan
12168 65          0  2014       1
      68          0  2017       1
      69          0  2018       1
      70          1  2019     Nan
20268 65          0  2014       1
      66          1  2015     Nan
      67          1  2016     Nan
      68          1  2017     Nan
      69          1  2018     Nan
22818 65          0  2008       0
      73          1  2016     Nan

If anyone is familiar with Pandas operation, I would appreciate it if you could let me know.

Thank you in advance.

※The following results are obtained for certain data frames.

import pandas as pd
import io

csv_data = '''
ID,age,get_sick,year
33868,76,0,2014
33868,78,1,2016
33868,79,1,2017
33868,80,1,2018
'''
df_1 = pd.read_csv(io.StringIO(csv_data), index_col=['ID', 'age'])

                get_sick  year  
    ID     age                 
    33868  76          0  2014       
           78          1  2016       
           79          1  2017 
           80          1  2018


df_mer_1 = df_1[df_1.get_sick == 1].reset_index()[['ID', 'year']]

df_1 = df_1.reset_index().merge(df_mer_1, on = 'ID', suffixes=('', '_max'))
df_1.loc[(df_1.get_sick == 0) & (df_1.year_max - df_1.year <= 5), 'history'] = 1
df_1.loc[(df_1.get_sick == 0) & (df_1.year_max - df_1.year > 5), 'history'] = 0

df_1 = df_1.set_index(['ID', 'age']).drop(columns='year_max')

The results are as follows

            get_sick  year  history
ID     age                
33868  76          0  2014       1
       76          0  2014       1
       76          0  2014       1 
       78          1  2016     Nan
       78          1  2016     Nan
       78          1  2016     Nan
       79          1  2017     Nan
       79          1  2017     Nan
       79          1  2017     Nan
       80          1  2018     Nan
       80          1  2018     Nan
       80          1  2018     Nan

Do you know why multiple identical rows are generated in this way? I would be glad if you could help me. Thank you in advance.

CodePudding user response：

First I created a column with the year for which get_sick = 1.

df_mer = df[df.get_sick == 1].reset_index()[['ID', 'year']].drop_duplicates(subset = 'ID')

df = df.reset_index().merge(df_mer, on = 'ID', suffixes=('', '_max'))

Then you can use year_max to compute the difference in years and assign a 1/0.

df.loc[(df.get_sick == 0) & (df.year_max - df.year <= 5), 'history'] = 1
df.loc[(df.get_sick == 0) & (df.year_max - df.year > 5), 'history'] = 0

df = df.set_index(['ID', 'age']).drop(columns='year_max')

Output:

           get_sick  year  history
ID    age                         
4567  76          0  2014      1.0
      78          0  2016      1.0
      79          1  2017      NaN
12168 65          0  2014      1.0
      68          0  2017      1.0
      69          0  2018      1.0
      70          1  2019      NaN
20268 65          0  2014      1.0
      66          0  2015      1.0
      67          0  2016      1.0
      68          0  2017      1.0
      69          1  2018      NaN
22818 65          0  2008      0.0
      73          1  2016      NaN