Home > Net >  How do I fill missing sequential data in a dataframe?
How do I fill missing sequential data in a dataframe?

Time:11-03

I have a dataset that is formatted like this:

index   string
1   2008
1   2009
1   2010
2   
2   
2   
3   2008
3   2009
3   2010
4   2008
4   2009
4   2010
5   
5   
5   

I would like to fill in the missing data with the same sequence like this:

index   string      
1   2008        
1   2009        
1   2010        
2       <-2008
2       <-2009
2       <-2010
3   2008        
3   2009        
3   2010        
4   2008        
4   2009        
4   2010        
5       <-2008
5       <-2009
5       <-2010

So the final result looks like this:

index   string
1   2008
1   2009
1   2010
2   2008
2   2009
2   2010
3   2008
3   2009
3   2010
4   2008
4   2009
4   2010
5   2008
5   2009
5   2010

I am currently doing this in excel and it is an impossible task because of the number of rows that need to be filled.

I tried using fillna(method = 'ffill', limit = 2, inplace = True), but this will only fill data with what is in the previous cell. Any help is appreciated.

CodePudding user response:

Try this:

# Find where is Nan
m = df['string'].isna()

# Compute how many Nan with 'm.sum()'
# Replace 'Nan's with [2008, 2009, 2010]*(sum_of_Nan / 3) -> [2008,2009,2010,2008,2009,2010,...]
df.loc[m, 'string'] = [2008, 2009, 2010]*(m.sum()//3)

Output:

       string
index        
1        2008
1        2009
1        2010
2        2008
2        2009
2        2010
3        2008
3        2009
3        2010
4        2008
4        2009
4        2010
5        2008
5        2009
5        2010

CodePudding user response:

You can try:

l = [2008, 2009, 2010]

# is the row NaN?
m = df['string'].isna()

# update with 2008, 2009, etc. in a defined order
df.loc[m, 'string'] = (df.groupby('index').cumcount()
                       .map(dict(enumerate(l)))
                      )

# convert dtype if needed
df['string'] = df['string'].convert_dtypes()

Alternative just defining a start year:

start = 2008

m = df['string'].isna()

df.loc[m, 'string'] = df.groupby('index').cumcount().add(start)
                       
df['string'] = df['string'].convert_dtypes()

Output:

    index  string
0       1    2008
1       1    2009
2       1    2010
3       2    2008
4       2    2009
5       2    2010
6       3    2008
7       3    2009
8       3    2010
9       4    2008
10      4    2009
11      4    2010
12      5    2008
13      5    2009
14      5    2010
  • Related