Home > other >  How to eliminate leap years in pandas data frame
How to eliminate leap years in pandas data frame

Time:11-05

I have daily temperature data from 1901-1940. I want to exclude leap years i.e. remove any temperature data that falls on 2/29. My data is currently one long array. I am reshaping it so that every year is a row and every column is a day. I'm trying to remove the leap years with the last line of code here:

import requests
from datetime import date
params = {"sid": "PHLthr", "sdate":"1900-12-31", "edate":"2020-12-31", "elems": [{"name": "maxt", "interval": "dly",  "duration": "dly", "prec": 6}]}
baseurl = "http://data.rcc-acis.org/StnData"
#get the data
resp = requests.post(baseurl, json=params)
#package into the dataframe
df = pd.DataFrame(columns=['date', 'tmax'], data=resp.json()['data'])
#convert the date column to datetimes
df['date']=pd.to_datetime(df['date'])
#select years
mask = (df['date'] >= '1900-01-01') & (df['date'] <= '1940-12-31')
Baseline=df.loc[mask]
#get rid of leap years:
Baseline=Baseline.loc[(Baseline['date'].dt.day!=29) & (Baseline['date'].dt.month!=2)]

but when I reshape the array I notice that there are 366 columns instead of 365 so I don't think I'm actually getting rid of february 29th data. How would I completely eliminate any temperature data that is recorded on 2/29 throughout my data set. I only want 365 data points for each year.

daily=pd.DataFrame(data={'date':Baseline.date,'tmax':Baseline.tmax})
daily['day']=daily.date.dt.dayofyear
daily['year']=daily.date.dt.year
daily.pivot(index='year', columns='day', values='tmax')

CodePudding user response:

The easy way is to eliminate those items before building the array.

import requests
from datetime import date

params = {"sid": "PHLthr", "sdate":"1900-12-31", "edate":"2020-12-31", "elems": [{"name": "maxt", "interval": "dly",  "duration": "dly", "prec": 6}]}
baseurl = "http://data.rcc-acis.org/StnData"
#get the data
resp = requests.post(baseurl, json=params)
vals = resp.json()
rows = [row for row in vals['data'] if '02-29' not in row[0]]
print(rows)

CodePudding user response:

You get 366 columns because of using dayofyear. That will calculate the day per the actual calendar (i.e. without removing 29 Feb).

To see this:

>>> daily.iloc[1154:1157]

           date       tmax  day  year
1154 1904-02-28  38.000000   59  1904
1156 1904-03-01  39.000000   61  1904
1157 1904-03-02  37.000000   62  1904

Notice the day goes from 59 to 61 (the 60th day was 29 February 1904).

CodePudding user response:

The source of your problem is that you used daily.date.dt.dayofyear.

Each day in a year, including Feb 29 has its own number.

To make thing worse, e.g. Mar 1 has dayofyear:

  • 61 in leap years,
  • 60 in non-leap years.

One of possible solutions is to set the day column to a string representation of month and day. To provide proper sort in the pivoted table, the month part should come first.

So, after you convert date column to datetime, to create both additional columns run:

daily['year'] = daily.date.dt.year
daily['day'] = daily.date.dt.strftime('%m-%d')

Then you can filter out Feb 29 and generate the pivot table in one go:

result = daily[daily.day != '02-29'].pivot(index='year', columns='day',
    values='tmax')

For some limited source data sample, other than yours, I got:

day   02-27  02-28  03-01  03-02
year                            
2020     11     10     14     15
2021     11     21     22     24
  • Related