I have a table of data that had to be recorded every 30 seconds but some of it was recorded with a larger time step.
So I want to write a program with pandas to check the time steps and if they are larger than 30, insert specific number of NaN rows, then fill the NaN cells with interpolation. but I don't know how to write a code to do it for seveal times at different positions.
My data looks something like this (with many more rows and columns):
T1 T2 time_step
0 15 30 30
1 19 40 90
2 18 30 30
3 16 50 90
4 16 70 30
...
and I want it to look like this before interpolation:
T1 T2 time_step
0 15 30 30
1 NaN NaN 30
2 NaN NaN 30
3 19 40 90
4 18 30 30
5 NaN NaN 30
6 NaN NaN 30
7 16 50 90
8 16 70 30
...
I have found a code on this site: https://www.geeksforgeeks.org/insert-row-at-given-position-in-pandas-dataframe/ which inserts a row at a given position but my problem is I can not write a program to insert rows several times at different positions for a dataframe.
Is it even possible with pandas? Is there any other way to do it with python?
I want to write something like this but I don't know how to write a correct code for it:
### The function close to what I found in the link above:
def Insert_row(row_number, df, m, row_value):
start_upper = 0
end_upper = row_number
start_lower = row_number
end_lower = df.shape[0]
upper_half = [*range(start_upper, end_upper, 1)]
lower_half = [*range(start_lower, end_lower, 1)]
lower_half = [x.__add__(m) for x in lower_half]
index_ = upper_half lower_half
df.index = index_
for i in range(row_number, row_number m):
df.loc[i] = row_value
df = df.sort_index()
return df
**### The main problem:
for j in range(dm.shape[0]):
if dm['time_step'][j] != 30:
row_number = j
m = dm['time_step'][j]/ 30 - 1
row_value = [np.Nan, np.NaN, 30]
dm_new = Insert_row(row_number, dm, m, row_value)**
dm_new = dm_new.interpolate()
(I know the range is wrong and it is wrong to modify what I'm iterating over but I don't know how to write it correctly.)
CodePudding user response:
There will be a lot of ways to solve this. If I understand correctly, you want to insert two rows everytime when time_step
is 90. If you have more general cases as well, we would need to modify the solution a bit.
This is a very imperative requirement. For this case, I would work very directly with the indices of the data and therefore use the underlying numpy array instead of pandas.
In [54]: a = df.copy().values
In [55]: a
Out[55]:
array([[15, 30, 30],
[19, 40, 90],
[18, 30, 30],
[16, 50, 90],
[16, 70, 30]])
In [56]: index = np.squeeze(np.argwhere(a[:,2] > 30))
In [57]: index
Out[57]: array([1, 3])
In [58]: np.insert(a.astype(np.float64), obj=np.repeat(index, 2), values=np.array([np.nan, np.nan, 30]), axis=0)
Out[58]:
array([[15., 30., 30.],
[nan, nan, 30.],
[nan, nan, 30.],
[19., 40., 90.],
[18., 30., 30.],
[nan, nan, 30.],
[nan, nan, 30.],
[16., 50., 90.],
[16., 70., 30.]])
Then you can create a new DataFrame with these values.
In [74]: pd.DataFrame(np.insert(a.astype(np.float64), obj=np.repeat(index, 2), values=np.array([np.nan, np.nan, 30]), axis=0), columns=df.columns)
Out[74]:
T1 T2 time_step
0 15.0 30.0 30.0
1 NaN NaN 30.0
2 NaN NaN 30.0
3 19.0 40.0 90.0
4 18.0 30.0 30.0
5 NaN NaN 30.0
6 NaN NaN 30.0
7 16.0 50.0 90.0
8 16.0 70.0 30.0
The index
are the row numbers where your second column ('time_step'
) is larger than 30
. You could still do this with pandas. However to insert rows in the middle of your DataFrame, there simply is no API function. Therefore we go to numpy. We need to convert to float, because np.nan is not a valid integer. Repeating the index
twice, makes it insert the row twice. And finally the row to insert is np.array([np.nan, np.nan, 30])
and axis=0
means we insert rows.