Home > Software engineering >  How to get other column value with the first non null value of a column?
How to get other column value with the first non null value of a column?

Time:10-01

I have this df:

         CODE   YEAR MONTH DAY      PP
0        000659  1964  12  01       NaN
1        000659  1964  12  02       NaN
2        000659  1964  12  03       NaN
3        000659  1964  12  04       0.0
4        000659  1964  12  05       0.0
        ...   ...  ..  ..       ...
6052770  000543  2020  03  27       NaN
6052771  000543  2020  03  28       NaN
6052772  000543  2020  03  29       NaN
6052773  000543  2020  03  30       NaN
6052774  000543  2020  03  31       NaN

[6052775 rows x 5 columns]

I want to know the year of the first non NaN value in PP column by each code. For example: For code 000659, 1964 is the year for the first non NaN value in PP column.

Expected result:

     CODE   START YEAR
0    000659     1964
1    000543     1970
..    etc....    etc....

So i did this code:

dic={}
for code, data in df.groupby('CODE'):
    number=data['PP'].first_valid_index().astype(int)
    starting_year=data['YEAR'].iloc[number]
    dic[code] = starting_year
starting_years=pd.DataFrame(dic.items(), columns=['CODE', 'STARTING YEAR'])

But i got this error: IndexError: single positional indexer is out-of-bounds

Maybe i did something wrong but when i print the code and starting_year i don't see any issue.

Would you mind to help me?

Thanks in advance.

CodePudding user response:

You could also try,

df.dropna(subset=['PP']).drop_duplicates(subset='CODE', keep='first')

   CODE  YEAR  MONTH  DAY   PP
3   659  1964     12    4  0.0
...

CodePudding user response:

We can set_index

out = df.set_index('YEAR').groupby('CODE').PP.agg(pd.Series.first_valid_index)
CODE
543       NaN
659    1964.0
Name: PP, dtype: float64
  • Related