I have a dataset that is formatted like this:
index string
1 2008
1 2009
1 2010
2
2
2
3 2008
3 2009
3 2010
4 2008
4 2009
4 2010
5
5
5
I would like to fill in the missing data with the same sequence like this:
index string
1 2008
1 2009
1 2010
2 <-2008
2 <-2009
2 <-2010
3 2008
3 2009
3 2010
4 2008
4 2009
4 2010
5 <-2008
5 <-2009
5 <-2010
So the final result looks like this:
index string
1 2008
1 2009
1 2010
2 2008
2 2009
2 2010
3 2008
3 2009
3 2010
4 2008
4 2009
4 2010
5 2008
5 2009
5 2010
I am currently doing this in excel and it is an impossible task because of the number of rows that need to be filled.
I tried using fillna(method = 'ffill', limit = 2, inplace = True), but this will only fill data with what is in the previous cell. Any help is appreciated.
CodePudding user response:
Try this:
# Find where is Nan
m = df['string'].isna()
# Compute how many Nan with 'm.sum()'
# Replace 'Nan's with [2008, 2009, 2010]*(sum_of_Nan / 3) -> [2008,2009,2010,2008,2009,2010,...]
df.loc[m, 'string'] = [2008, 2009, 2010]*(m.sum()//3)
Output:
string
index
1 2008
1 2009
1 2010
2 2008
2 2009
2 2010
3 2008
3 2009
3 2010
4 2008
4 2009
4 2010
5 2008
5 2009
5 2010
CodePudding user response:
You can try:
l = [2008, 2009, 2010]
# is the row NaN?
m = df['string'].isna()
# update with 2008, 2009, etc. in a defined order
df.loc[m, 'string'] = (df.groupby('index').cumcount()
.map(dict(enumerate(l)))
)
# convert dtype if needed
df['string'] = df['string'].convert_dtypes()
Alternative just defining a start year:
start = 2008
m = df['string'].isna()
df.loc[m, 'string'] = df.groupby('index').cumcount().add(start)
df['string'] = df['string'].convert_dtypes()
Output:
index string
0 1 2008
1 1 2009
2 1 2010
3 2 2008
4 2 2009
5 2 2010
6 3 2008
7 3 2009
8 3 2010
9 4 2008
10 4 2009
11 4 2010
12 5 2008
13 5 2009
14 5 2010