I have a dataset that has four columns: ID, Step, col1 and col2. The step columns have a row that has a NaN value, this is where the data for col1 and col2 are.
I want to fill the missing data of col1 and col2 for each unique ID.
the original data frame looks like:
ID Step col1 col2
7001 Nan 1.0 6.0
7001 0 Nan Nan
7001 1 Nan Nan
6500 Nan 12.0 3.0
6500 0 Nan Nan
6500 1 Nan Nan
I want this result:
ID Step col1 col2
7001 Nan 1.0 6.0
7001 0 1.0 6.0
7001 1 1.0 6.0
6500 Nan 12.0 3.0
6500 0 12.0 3.0
6500 1 12.0 3.0
I can't seem to find a good way to do this that is not too long as I have a lot of data to process (10 GB)
CodePudding user response:
If your file is sorted, you can do a fillna
with mode="ffill"
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.fillna.html
Note that you will have to apply per column, as otherwise the step would be (wrongly) propagated
y=pd.read_csv(StringIO(x),sep='\s ',na_values='Nan')
y['col1']=y['col1'].fillna(method='ffill')
y['col2']=y['col2'].fillna(method='ffill')
gives:
ID Step col1 col2
0 7001 NaN 1.0 6.0
1 7001 0.0 1.0 6.0
2 7001 1.0 1.0 6.0
3 6500 NaN 12.0 3.0
4 6500 0.0 12.0 3.0
5 6500 1.0 12.0 3.0
If your data is not ID sorted, you can always sort it first:
y_sorted= y.sort_values(["ID"])
CodePudding user response:
First, It doesn't look like those are proper NaN
values, let's make them so:
df.replace('Nan', np.nan, inplace=True)
Then we can ffill()
the desired columns:
df[['col1', 'col2']] = df[['col1', 'col2']].ffill()
Ouput:
ID Step col1 col2
0 7001 NaN 1.0 6.0
1 7001 0 1.0 6.0
2 7001 1 1.0 6.0
3 6500 NaN 12.0 3.0
4 6500 0 12.0 3.0
5 6500 1 12.0 3.0