Home > Enterprise >  How to merge two different size DataFrames in Pandas to update one dataframe depends on matching par
How to merge two different size DataFrames in Pandas to update one dataframe depends on matching par

Time:10-05

I am trying to code a task for work, so I made a simple case that simulates this task. I have two dataframes: data_1 and data_2, and I would like to replace some rows in data_1 with rows from data_2 by condition of matching values in a column named time.

Here is an example:

import numpy as np
import pandas as pd
a = {
    'time':[1,2,3,4,5,6],
    'column_1':[2,2,2,2,2,2],
    'column_2':[3,3,3,3,3,3]   
}
b = {
    'time':[3,4,5],
    'column_1':[0,0,0],
    'column_2':[0,0,0]    
}
data_1 = pd.DataFrame(a)
data_2 = pd.DataFrame(b)

In the result, I would like to get dataframe like this:

time   column_1   column_2
0   1   2   3
1   2   2   3
2   3   0   0
3   4   0   0
4   5   0   0
5   6   2   3

I tried merge and replace methods in Pandas, but it was not successful. I did mask boolean array:

time_1 = list(data_1['time'])
time_2 = list(data_2['time'])
mask_array = np.zeros(len(time_1),dtype = bool)
for i, item in enumerate(time_1):
    if item in time_2:
        mask_array[i] = True

and I received:

array([False, False,  True,  True,  True, False])

But I could not replace data_1 values with data_2 values. What did I do wrong? It does not seem like a difficult task, but I could not find anything useful and just do not know what to do. I do not have a lot of experience with pandas, so maybe I do not understand something.

CodePudding user response:

You can use .update() after setting index on time on both data_1a and data_1b, as follows:

data_1a = data_1.set_index('time')
data_1a.update(data_2.set_index('time'))
data_out = data_1a.reset_index()

.update() modifies in place using non-NA values from another DataFrame. Aligns on indices. Thus, when you set time as index on both data_1a and data_1b, .update() aligns on matching values in column time to perform the update of data_1 by corresponding values of data_2.

Data Setup:

a = {
    'time':[1,2,3,4,5,6],
    'column_1':[2,2,2,2,2,2],
    'column_2':[3,3,3,3,3,3]   
}
b = {
    'time':[3,4,5],
    'column_1':[0,0,0],
    'column_2':[0,0,0]    
}
data_1 = pd.DataFrame(a)
data_2 = pd.DataFrame(b)

Result:

print(data_out)

   time  column_1  column_2
0     1       2.0       3.0
1     2       2.0       3.0
2     3       0.0       0.0
3     4       0.0       0.0
4     5       0.0       0.0
5     6       2.0       3.0
  • Related