I have a question about how to quickly fillna with a sequence in Python(pandas).I have a dataset like following(the true dataset is longer),
Time | Number |
---|---|
t0 | NA |
t1 | NA |
t2 | NA |
t3 | 0 |
t4 | NA |
t5 | NA |
t6 | NA |
t7 | NA |
t8 | 0 |
t9 | NA |
My requirement is to add numbers to N lines before and after non-blank lines, and the sequence range is range(-N,N 1).The interval between any two non-empty rows in the dataset is greater than C(constant), our N will be less than C, so there is no need to consider the coverage problem for the time being. Assuming N=2, the result I need is as follows :
Time | Number |
---|---|
t0 | NA |
t1 | -2 |
t2 | -1 |
t3 | 0 |
t4 | 1 |
t5 | 2 |
t6 | -2 |
t7 | -1 |
t8 | 0 |
t9 | 1 |
At present, the only way I can think of is to use a loop, but the efficiency is low. Does pandas have such a method to do it quickly?
CodePudding user response:
There are still some unknowns in your question, like what happens if the intervals overlap. Here I will consider that a further interval overwrites the previous one (you can do the other way around with a change of code).
Using rolling
, groupby.cumcount
, and a mask
:
s = df['Number'].notna().shift(-N)
m = s.rolling(2*N 1, min_periods=1).max().astype(bool)
df['Number2'] = df.groupby(s.cumsum()).cumcount().sub(N).where(m)
NB. I used a slightly different example to show the overlap.
output:
Time Number Number2
0 t0 NaN NaN
1 t1 NaN -2.0
2 t2 NaN -1.0
3 t3 0.0 0.0
4 t4 NaN 1.0
5 t5 NaN -2.0 # here we have an overlap, use latter value
6 t6 NaN -1.0
7 t7 0.0 0.0
8 t8 NaN -2.0
9 t9 NaN -1.0
CodePudding user response:
Could we have a bit more information, please?
I notice in your end result t0
is still NA, why is that? Do you still want to keep some values as NaN?
In the example, t3
was 0
originally, and in the result it was still 0
, which made sense within the sequence. But what if it was already populated, and not with 0
but another value like 5
? Would you want to keep the 5
or assign the 0
from the sequence in that position? And if you kept the 5
, would then t4
be 0
?
Here's a few solutions just in case:
To fill in the NaN values with a sequence (skipping populated values)
#!/usr/bin/env python
import pandas as pd
df = pd.DataFrame([
{'Time': 't0', 'Number': None},
{'Time': 't1', 'Number': None},
{'Time': 't2', 'Number': None},
{'Time': 't3', 'Number': 0},
{'Time': 't4', 'Number': None},
{'Time': 't5', 'Number': None},
])
len_null = df['Number'].isna().sum()
starting_number = -2
df.loc[df['Number'].isna(), 'Number'] = list(range(starting_number, len_null-abs(starting_number)))
print(df)
Will give you this:
Time Number
0 t0 -2.0
1 t1 -1.0
2 t2 0.0
3 t3 0.0
4 t4 1.0
5 t5 2.0
To fill in the entire column with a sequence (overwriting populated values)
#!/usr/bin/env python
import pandas as pd
df = pd.DataFrame([
{'Time': 't0', 'Number': None},
{'Time': 't1', 'Number': None},
{'Time': 't2', 'Number': None},
{'Time': 't3', 'Number': 0},
{'Time': 't4', 'Number': None},
{'Time': 't5', 'Number': None},
])
starting_number = -2
df['Number'] = list(range(starting_number, len(df)-abs(starting_number)))
print(df)
Will give you:
Time Number
0 t0 -2
1 t1 -1
2 t2 0
3 t3 1
4 t4 2
5 t5 3
To fill in the NAN values with a sequence (skipping the sequence where values are populated)
#!/usr/bin/env python
import pandas as pd
df = pd.DataFrame([
{'Time': 't0', 'Number': None},
{'Time': 't1', 'Number': None},
{'Time': 't2', 'Number': None},
{'Time': 't3', 'Number': 0},
{'Time': 't4', 'Number': None},
{'Time': 't5', 'Number': None},
])
starting_number = -2
new_series = pd.Series(list(range(starting_number, len(df)-abs(starting_number))))
df['Number'] = df['Number'].fillna(new_series)
print(df)
Will give you:
Time Number
0 t0 -2.0
1 t1 -1.0
2 t2 0.0
3 t3 0.0
4 t4 2.0
5 t5 3.0
(Notice that 1
would have gone into the t3
slot, but since there was a 0
there already, it just skipped it)