How to compare pandas data frames and then "fill in" any missing rows with the other data-CodePudding

I have the following two arrays: Data and Baseline

Data
   Phase  ENSO  EJO
0      1    -1    2
1      1     0    2
2      1     1    2
3      2    -1    7
4      2     1    1

Baseline
   Phase  ENSO  EJO
0      1    -1    0.0
1      1     0    0.0
2      1     1    0.0
3      2    -1    0.0
4      2     0    0.0
5      2     1    0.0

I want to alter the 'Data' data frame such that the "missing row" gets filled in by the Baseline data. Final result would look like this

Data
   Phase  ENSO  EJO
0      1    -1    2
1      1     0    2
2      1     1    2
3      2    -1    7
4      2     0    0
5      2     1    1

CodePudding user response：

You could reindex Data by Baseline indices and fill the missing values with Baseline values using fillna.

Data = Data.reindex(Baseline.index).fillna(Baseline)

Another option is to use combine_first:

Data = Data.combine_first(Baseline)

Output:

   Phase  ENSO  EJO
0    1.0  -1.0  2.0
1    1.0   0.0  2.0
2    1.0   1.0  2.0
3    2.0  -1.0  7.0
4    2.0   1.0  1.0
5    2.0   1.0  0.0

CodePudding user response：

enke's method is great - here's a solution that preserves the original datatypes:

Data = pd.concat([Data, Baseline[~Baseline.index.isin(Data.index)]])

Output:

>>> Data
   Phase  ENSO  EJO
0      1    -1  2.0
1      1     0  2.0
2      1     1  2.0
3      2    -1  7.0
4      2     1  1.0
5      2     1  0.0

CodePudding user response：

Here's an ultra-short solution:

>>> tmp = baseline.copy()
>>> tmp.update(data)
>>> tmp
   Phase  ENSO  EJO
0    1.0  -1.0  2.0
1    1.0   0.0  2.0
2    1.0   1.0  2.0
3    2.0  -1.0  7.0
4    2.0   1.0  1.0
5    2.0   1.0  0.0

Or (Python 3.8 )

>>> (tmp := baseline.copy()).update(data)
>>> tmp
   Phase  ENSO  EJO
0    1.0  -1.0  2.0
1    1.0   0.0  2.0
2    1.0   1.0  2.0
3    2.0  -1.0  7.0
4    2.0   1.0  1.0
5    2.0   1.0  0.0

Simplest:

>>> baseline.update(data)
>>> baseline
   Phase  ENSO  EJO
0    1.0  -1.0  2.0
1    1.0   0.0  2.0
2    1.0   1.0  2.0
3    2.0  -1.0  7.0
4    2.0   1.0  1.0
5    2.0   1.0  0.0

CodePudding user response：

I made the assumption that you want to decide which rows need to be filled based on the values of Phase and ENSO columns. Here is my solution based on that.

import os
import pandas as pd
import numpy as np
from io import StringIO

x_str = """Phase  ENSO  EJO
0      1    -1    2
1      1     0    2
2      1     1    2
3      2    -1    7
4      2     1    1"""

y_str = """Phase  ENSO  EJO
0      1    -1    0.0
1      1     0    0.0
2      1     1    0.0
3      2    -1    0.0
4      2     0    0.0
5      2     1    0.0"""

x = pd.read_csv(StringIO(x_str), sep="\s ")
y = pd.read_csv(StringIO(y_str), sep="\s ")

# Outer merge with indicator on
z = pd.merge(x, y, how='outer', on=['Phase', 'ENSO'], indicator=True)
rows_to_be_filled = z['_merge'].isin(['right_only'])
z.loc[rows_to_be_filled, 'EJO_x'] = z.loc[rows_to_be_filled, 'EJO_y']

# Cleanup
z = z.rename({'EJO_x': 'EJO'}, axis=1).drop(['EJO_y', '_merge'], axis=1)
z
#    Phase  ENSO  EJO
# 0      1    -1  2.0
# 1      1     0  2.0
# 2      1     1  2.0
# 3      2    -1  7.0
# 4      2     1  1.0
# 5      2     0  0.0