I have an excel file with the following dataset:
date y
1/1/12 0:00 86,580
1/1/12 0:30 86,580
1/1/12 1:00 nan
1/1/12 1:30 86,910
1/1/12 2:00 87,240
1/1/12 2:30 87,130
1/1/12 3:00 nan
1/1/12 3:30 nan
1/1/12 4:00 87,570
1/1/12 4:30 91,400
1/1/12 5:00 91,880
1/1/12 5:30 92,600
1/1/12 6:00 nan
1/1/12 6:30 nan
1/1/12 7:00 nan
1/1/12 7:30 94,160
1/1/12 8:00 94,280
1/1/12 8:30 94,640
The data contains some NaN values. I need to extract the start and end date of each NaN groups. Here is what I tried:
import pandas as pd
import numpy as np
from datetime import date, datetime, time, timedelta
import re
df=pd.read_excel(r'test_nan.xlsx',
sheet_name='Sheet1', header=0)
nan_index = df.y.index[df.y.apply(np.isnan)]
start=df.y.apply(str).str.findall(r'\d\nnan')
end=begin=df.y.apply(str).str.findall(r'nan\n\d')
Here is what I want to extract:
start end
1/1/12 0:30 1/1/12 1:30
1/1/12 2:30 1/1/12 4:00
1/1/12 5:30 1/1/12 7:30
Please find the attached excel file: test_nan.xlsx
CodePudding user response:
Based on @jezrael's answer here we can continue to get your task done.
m = df['y'].isna()
df = pd.concat([df[m.shift(fill_value=False)],
df[m.shift(-1, fill_value=False)]]).sort_index()
m.shift()
will get the rows after a NaN
(and some NaN's where there is a group of NaN's) and m.shift(-1)
will get the rows before a NaN
(and also here some extra NaN's where there is a group of NaN's). Then we get rid of all NaN
with:
df = df.dropna(subset='y')
Now we have rows with alternating values, start and end.
res = pd.DataFrame({
'start' : df['x'][::2].values,
'end' : df['x'][1::2].values
})
Output res
:
start end
0 2012-01-01 00:30:00 2012-01-01 01:30:00
1 2012-01-01 02:30:00 2012-01-01 04:00:00
2 2012-01-01 05:30:00 2012-01-01 07:30:00