Home > OS >  How to quickly fillna with a sequence
How to quickly fillna with a sequence

Time:08-25

I have a question about how to quickly fillna with a sequence in Python(pandas).I have a dataset like following(the true dataset is longer),

Time Number
t0 NA
t1 NA
t2 NA
t3 0
t4 NA
t5 NA
t6 NA
t7 NA
t8 0
t9 NA

My requirement is to add numbers to N lines before and after non-blank lines, and the sequence range is range(-N,N 1).The interval between any two non-empty rows in the dataset is greater than C(constant), our N will be less than C, so there is no need to consider the coverage problem for the time being. Assuming N=2, the result I need is as follows :

Time Number
t0 NA
t1 -2
t2 -1
t3 0
t4 1
t5 2
t6 -2
t7 -1
t8 0
t9 1

At present, the only way I can think of is to use a loop, but the efficiency is low. Does pandas have such a method to do it quickly?

CodePudding user response:

There are still some unknowns in your question, like what happens if the intervals overlap. Here I will consider that a further interval overwrites the previous one (you can do the other way around with a change of code).

Using rolling, groupby.cumcount, and a mask:

s = df['Number'].notna().shift(-N)
m = s.rolling(2*N 1, min_periods=1).max().astype(bool)

df['Number2'] = df.groupby(s.cumsum()).cumcount().sub(N).where(m)

NB. I used a slightly different example to show the overlap.

output:

  Time  Number  Number2
0   t0     NaN      NaN
1   t1     NaN     -2.0
2   t2     NaN     -1.0
3   t3     0.0      0.0
4   t4     NaN      1.0
5   t5     NaN     -2.0  # here we have an overlap, use latter value
6   t6     NaN     -1.0
7   t7     0.0      0.0
8   t8     NaN     -2.0
9   t9     NaN     -1.0

CodePudding user response:

Could we have a bit more information, please?

I notice in your end result t0 is still NA, why is that? Do you still want to keep some values as NaN?

In the example, t3 was 0 originally, and in the result it was still 0, which made sense within the sequence. But what if it was already populated, and not with 0 but another value like 5? Would you want to keep the 5 or assign the 0 from the sequence in that position? And if you kept the 5, would then t4 be 0?

Here's a few solutions just in case:

To fill in the NaN values with a sequence (skipping populated values)

#!/usr/bin/env python 
import pandas as pd

df = pd.DataFrame([
    {'Time': 't0', 'Number': None},
    {'Time': 't1', 'Number': None},
    {'Time': 't2', 'Number': None},
    {'Time': 't3', 'Number': 0},
    {'Time': 't4', 'Number': None},
    {'Time': 't5', 'Number': None},
])

len_null = df['Number'].isna().sum()
starting_number = -2

df.loc[df['Number'].isna(), 'Number'] = list(range(starting_number, len_null-abs(starting_number)))

print(df)

Will give you this:

  Time  Number
0   t0    -2.0
1   t1    -1.0
2   t2     0.0
3   t3     0.0
4   t4     1.0
5   t5     2.0

To fill in the entire column with a sequence (overwriting populated values)

#!/usr/bin/env python

import pandas as pd

df = pd.DataFrame([
    {'Time': 't0', 'Number': None},
    {'Time': 't1', 'Number': None},
    {'Time': 't2', 'Number': None},
    {'Time': 't3', 'Number': 0},
    {'Time': 't4', 'Number': None},
    {'Time': 't5', 'Number': None},
])

starting_number = -2
df['Number'] = list(range(starting_number, len(df)-abs(starting_number)))
print(df)

Will give you:

  Time  Number
0   t0      -2
1   t1      -1
2   t2       0
3   t3       1
4   t4       2
5   t5       3

To fill in the NAN values with a sequence (skipping the sequence where values are populated)

#!/usr/bin/env python

import pandas as pd

df = pd.DataFrame([
    {'Time': 't0', 'Number': None},
    {'Time': 't1', 'Number': None},
    {'Time': 't2', 'Number': None},
    {'Time': 't3', 'Number': 0},
    {'Time': 't4', 'Number': None},
    {'Time': 't5', 'Number': None},
])

starting_number = -2
new_series = pd.Series(list(range(starting_number, len(df)-abs(starting_number))))
df['Number'] = df['Number'].fillna(new_series)
print(df)

Will give you:

  Time  Number
0   t0    -2.0
1   t1    -1.0
2   t2     0.0
3   t3     0.0
4   t4     2.0
5   t5     3.0

(Notice that 1 would have gone into the t3 slot, but since there was a 0 there already, it just skipped it)

  • Related