Home > Software design >  Excel data extraction using python
Excel data extraction using python

Time:08-03

I have an excel file with the following dataset:

date         y
1/1/12 0:00 86,580
1/1/12 0:30 86,580
1/1/12 1:00 nan
1/1/12 1:30 86,910
1/1/12 2:00 87,240
1/1/12 2:30 87,130
1/1/12 3:00 nan
1/1/12 3:30 nan
1/1/12 4:00 87,570
1/1/12 4:30 91,400
1/1/12 5:00 91,880
1/1/12 5:30 92,600
1/1/12 6:00 nan
1/1/12 6:30 nan
1/1/12 7:00 nan
1/1/12 7:30 94,160
1/1/12 8:00 94,280
1/1/12 8:30 94,640

The data contains some NaN values. I need to extract the start and end date of each NaN groups. Here is what I tried:

import pandas as pd
import numpy as np
from datetime import date, datetime, time, timedelta
import re


df=pd.read_excel(r'test_nan.xlsx', 
                 sheet_name='Sheet1', header=0)

nan_index = df.y.index[df.y.apply(np.isnan)]
start=df.y.apply(str).str.findall(r'\d\nnan')
end=begin=df.y.apply(str).str.findall(r'nan\n\d')

Here is what I want to extract:

start        end
1/1/12 0:30  1/1/12 1:30
1/1/12 2:30  1/1/12 4:00
1/1/12 5:30  1/1/12 7:30

Please find the attached excel file: test_nan.xlsx

CodePudding user response:

Based on @jezrael's answer here we can continue to get your task done.

m = df['y'].isna()
df = pd.concat([df[m.shift(fill_value=False)],
                df[m.shift(-1, fill_value=False)]]).sort_index()

m.shift() will get the rows after a NaN (and some NaN's where there is a group of NaN's) and m.shift(-1) will get the rows before a NaN (and also here some extra NaN's where there is a group of NaN's). Then we get rid of all NaN with:

df = df.dropna(subset='y')

Now we have rows with alternating values, start and end.

res = pd.DataFrame({
    'start' : df['x'][::2].values,
    'end' : df['x'][1::2].values
})

Output res:

                start                 end
0 2012-01-01 00:30:00 2012-01-01 01:30:00
1 2012-01-01 02:30:00 2012-01-01 04:00:00
2 2012-01-01 05:30:00 2012-01-01 07:30:00
  • Related