Copy the value of a specific row and paste it into multiple rows that have the same id-CodePudding

I have a dataset that has four columns: ID, Step, col1 and col2. The step columns have a row that has a NaN value, this is where the data for col1 and col2 are.

I want to fill the missing data of col1 and col2 for each unique ID.

the original data frame looks like:

ID        Step   col1       col2      
7001      Nan    1.0        6.0
7001      0      Nan        Nan
7001      1      Nan        Nan
6500      Nan    12.0       3.0
6500      0      Nan        Nan
6500      1      Nan        Nan

I want this result:

ID        Step   col1      col2
7001      Nan    1.0        6.0
7001      0      1.0        6.0
7001      1      1.0        6.0
6500      Nan    12.0       3.0
6500      0      12.0       3.0
6500      1      12.0       3.0

I can't seem to find a good way to do this that is not too long as I have a lot of data to process (10 GB)

CodePudding user response：

If your file is sorted, you can do a fillna with mode="ffill" https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.fillna.html

Note that you will have to apply per column, as otherwise the step would be (wrongly) propagated

y=pd.read_csv(StringIO(x),sep='\s ',na_values='Nan')
y['col1']=y['col1'].fillna(method='ffill')
y['col2']=y['col2'].fillna(method='ffill')

gives:

     ID  Step  col1  col2
0  7001   NaN   1.0   6.0
1  7001   0.0   1.0   6.0
2  7001   1.0   1.0   6.0
3  6500   NaN  12.0   3.0
4  6500   0.0  12.0   3.0
5  6500   1.0  12.0   3.0

If your data is not ID sorted, you can always sort it first: y_sorted= y.sort_values(["ID"])

CodePudding user response：

First, It doesn't look like those are proper NaN values, let's make them so:

df.replace('Nan', np.nan, inplace=True)

Then we can ffill() the desired columns:

df[['col1', 'col2']] = df[['col1', 'col2']].ffill()

Ouput:

     ID Step  col1 col2
0  7001  NaN   1.0  6.0
1  7001    0   1.0  6.0
2  7001    1   1.0  6.0
3  6500  NaN  12.0  3.0
4  6500    0  12.0  3.0
5  6500    1  12.0  3.0