Pandas dataframe add rows with NaN in one column based on values in other columns-CodePudding

For example, there are three columns in a dataframe, x, y, z. x and y have 3 different values with 0.5 intervals. They are coordinates so they map with each other and there will be 3*3=9 rows with some z values. But the actual dataframe has only let say 7 rows. How to add two missing rows with NaN value in z column? Below are the example input and output. Thank you!

Input:

DataFrame:
     x    y   z
0  -0.5   0   5
1  -0.5 -0.5  10
2    0  -0.5  7
3    0   0.5  6
4    0    0   12
5   0.5   0   8
6   0.5  0.5  2

Output:

DataFrame:
     x    y   z
0  -0.5   0   5
1  -0.5 -0.5  10
2    0  -0.5  7
3    0   0.5  6
4    0    0   12
5   0.5   0   8
6   0.5  0.5  2
7  -0.5  0.5  NaN   // missing row
8   0.5 -0.5  NaN   // missing row

CodePudding user response：

One option is with complete from pyjanitor, to add the missing rows, based on a combination of x and y :

# pip install pyjanitor
import pandas as pd
import janitor

df.complete('x', 'y')

     x    y     z
0 -0.5  0.0   5.0
1 -0.5 -0.5  10.0
2 -0.5  0.5   NaN
3  0.0  0.0  12.0
4  0.0 -0.5   7.0
5  0.0  0.5   6.0
6  0.5  0.0   8.0
7  0.5 -0.5   NaN
8  0.5  0.5   2.0

complete is just an efficient helper (wrapper around pandas functions); if your data does not have duplicates that can throw off pivot, then use it:

 df.pivot('x', 'y', 'z').stack(dropna=False).rename('z').reset_index()

     x    y     z
0 -0.5 -0.5  10.0
1 -0.5  0.0   5.0
2 -0.5  0.5   NaN
3  0.0 -0.5   7.0
4  0.0  0.0  12.0
5  0.0  0.5   6.0
6  0.5 -0.5   NaN
7  0.5  0.0   8.0
8  0.5  0.5   2.0

CodePudding user response：

import numpy as np

new_row = {"x": -0.5, "y": 0.5, "z": np.nan}

df.append(new_row, ignore_index=True)

I don't know exactly how you calculate "x" and "y" values, but I guess your question is directed towards the NaN value in the "z" column.

CodePudding user response：

You can find the product of [-0.5,0,0.5] with itertools.product

import itertools

lst = pd.concat([df['x'], df['y']]).unique().tolist()
p = list(itertools.product(lst, repeat=2))

print(p)

[(-0.5, -0.5), (-0.5, 0.0), (-0.5, 0.5), (0.0, -0.5), (0.0, 0.0), (0.0, 0.5), (0.5, -0.5), (0.5, 0.0), (0.5, 0.5)]

Then fill the missing index:

out = df.set_index(['x', 'y']).reindex(p).reset_index()

print(out)

     x    y     z
0 -0.5 -0.5  10.0
1 -0.5  0.0   5.0
2 -0.5  0.5   NaN
3  0.0 -0.5   7.0
4  0.0  0.0  12.0
5  0.0  0.5   6.0
6  0.5 -0.5   NaN
7  0.5  0.0   8.0
8  0.5  0.5   2.0