Home > Blockchain >  filter rows based on a categorical column and then fill NA rows in a different column of the filtere
filter rows based on a categorical column and then fill NA rows in a different column of the filtere

Time:10-07

I have a dataframe of one categorical column and 2 numerical columns. All the categories in the categorical column have no missing value. However for some rows in the categorical column, the adjacent row in the first numerical column have NA's. My issue is I would like to fill NA's rows of the first numerical column with the corresponding row value of the second numerical column, but, I want to do this only for the category rows that are adjacent to the rows in the first column that have NA's. I want to do this operation without changing the shape of the original dataframe. Example dataset df below: dataframe example to fill NA

Cat_col num_col1 num_col2 SS 22 54 PP NA 89 CC 128 34 XX NA 56 SS 67 56 XX NA 90 CC 47 10 BB NA 29

From the above table, I want to fill NA values of num_col1 with corresponding row values in num_col2, but only for PP and XX row categories in Cat_col. Also do this without changing shape of the dataframe.

CodePudding user response:

First of all You should provide piece of your code showing your effort to solve the problem. If I understand correctly your question solution could look like as follows:

data = '''Cat_col num_col1 num_col2 SS 22 54 PP NA 89 CC 128 34 XX NA 56 SS 67 56 XX NA 90 CC 47 10 BB NA 29'''.split(' ')

Preparing the data into column/row format:

n=3
result = [data[i:i n] for i in range(0, len(data), n)]

Create a dataframe and filter for categories:

df = pd.DataFrame(result[1:],columns=result[0])
cat_filter = ['PP', 'XX']
na_filter = dff['num_col1'] == 'NA' #filter for NA values, however instead of NA it would be much easier if missing values would be stated as np.Nan. Then df['num_col1].isna() could be applied
row_mask = dff['Cat_col'].isin(cat_filter) & na_filter #mask will gives the row

Assign values from num_col2 to num_col1:

df.loc[row_mask,'num_col1'] = df.loc[row_mask, 'num_col2']

Output:

Cat_col num_col1    num_col2
0   SS  22  54
1   PP  89  89
2   CC  128 34
3   XX  56  56
4   SS  67  56
5   XX  90  90
6   CC  47  10
7   BB  NA  29
  • Related