I have daily temperature data from 1901-1940. I want to exclude leap years i.e. remove any temperature data that falls on 2/29. My data is currently one long array. I am reshaping it so that every year is a row and every column is a day. I'm trying to remove the leap years with the last line of code here:
import requests
from datetime import date
params = {"sid": "PHLthr", "sdate":"1900-12-31", "edate":"2020-12-31", "elems": [{"name": "maxt", "interval": "dly", "duration": "dly", "prec": 6}]}
baseurl = "http://data.rcc-acis.org/StnData"
#get the data
resp = requests.post(baseurl, json=params)
#package into the dataframe
df = pd.DataFrame(columns=['date', 'tmax'], data=resp.json()['data'])
#convert the date column to datetimes
df['date']=pd.to_datetime(df['date'])
#select years
mask = (df['date'] >= '1900-01-01') & (df['date'] <= '1940-12-31')
Baseline=df.loc[mask]
#get rid of leap years:
Baseline=Baseline.loc[(Baseline['date'].dt.day!=29) & (Baseline['date'].dt.month!=2)]
but when I reshape the array I notice that there are 366 columns instead of 365 so I don't think I'm actually getting rid of february 29th data. How would I completely eliminate any temperature data that is recorded on 2/29 throughout my data set. I only want 365 data points for each year.
daily=pd.DataFrame(data={'date':Baseline.date,'tmax':Baseline.tmax})
daily['day']=daily.date.dt.dayofyear
daily['year']=daily.date.dt.year
daily.pivot(index='year', columns='day', values='tmax')
CodePudding user response:
The easy way is to eliminate those items before building the array.
import requests
from datetime import date
params = {"sid": "PHLthr", "sdate":"1900-12-31", "edate":"2020-12-31", "elems": [{"name": "maxt", "interval": "dly", "duration": "dly", "prec": 6}]}
baseurl = "http://data.rcc-acis.org/StnData"
#get the data
resp = requests.post(baseurl, json=params)
vals = resp.json()
rows = [row for row in vals['data'] if '02-29' not in row[0]]
print(rows)
CodePudding user response:
You get 366 columns because of using dayofyear
. That will calculate the day per the actual calendar (i.e. without removing 29 Feb).
To see this:
>>> daily.iloc[1154:1157]
date tmax day year
1154 1904-02-28 38.000000 59 1904
1156 1904-03-01 39.000000 61 1904
1157 1904-03-02 37.000000 62 1904
Notice the day goes from 59 to 61 (the 60th day was 29 February 1904).
CodePudding user response:
The source of your problem is that you used daily.date.dt.dayofyear.
Each day in a year, including Feb 29 has its own number.
To make thing worse, e.g. Mar 1 has dayofyear:
- 61 in leap years,
- 60 in non-leap years.
One of possible solutions is to set the day column to a string representation of month and day. To provide proper sort in the pivoted table, the month part should come first.
So, after you convert date column to datetime, to create both additional columns run:
daily['year'] = daily.date.dt.year
daily['day'] = daily.date.dt.strftime('%m-%d')
Then you can filter out Feb 29 and generate the pivot table in one go:
result = daily[daily.day != '02-29'].pivot(index='year', columns='day',
values='tmax')
For some limited source data sample, other than yours, I got:
day 02-27 02-28 03-01 03-02
year
2020 11 10 14 15
2021 11 21 22 24