I have this df:
CODE YEAR MONTH DAY PP
0 000659 1964 12 01 NaN
1 000659 1964 12 02 NaN
2 000659 1964 12 03 NaN
3 000659 1964 12 04 0.0
4 000659 1964 12 05 0.0
... ... .. .. ...
6052770 000543 2020 03 27 NaN
6052771 000543 2020 03 28 NaN
6052772 000543 2020 03 29 NaN
6052773 000543 2020 03 30 NaN
6052774 000543 2020 03 31 NaN
[6052775 rows x 5 columns]
I want to know the year of the first non NaN value in PP column by each code. For example: For code 000659, 1964 is the year for the first non NaN value in PP column.
Expected result:
CODE START YEAR
0 000659 1964
1 000543 1970
.. etc.... etc....
So i did this code:
dic={}
for code, data in df.groupby('CODE'):
number=data['PP'].first_valid_index().astype(int)
starting_year=data['YEAR'].iloc[number]
dic[code] = starting_year
starting_years=pd.DataFrame(dic.items(), columns=['CODE', 'STARTING YEAR'])
But i got this error: IndexError: single positional indexer is out-of-bounds
Maybe i did something wrong but when i print the code and starting_year i don't see any issue.
Would you mind to help me?
Thanks in advance.
CodePudding user response:
You could also try,
df.dropna(subset=['PP']).drop_duplicates(subset='CODE', keep='first')
CODE YEAR MONTH DAY PP
3 659 1964 12 4 0.0
...
CodePudding user response:
We can set_index
out = df.set_index('YEAR').groupby('CODE').PP.agg(pd.Series.first_valid_index)
CODE
543 NaN
659 1964.0
Name: PP, dtype: float64