Pandas - Merge data frames based on conditions-CodePudding

I would like to merge n data frames based on certain variables (external to the data frame).

Let me clarify the problem referring to an example. We have two dataframes detailing the height and age of certain members of a population. On top, we are given one array per data frame, containing one value per property (so array length = number of columns with numerical value in the data frame). Consider the following two data frames

df1 = pd.DataFrame({'Name': ['A', 'B', 'C', 'D', 'E'], 
                    'Age': [3, 8, 4, 2, 5], 'Height': [7, 2, 1, 4, 9]})
df2 = pd.DataFrame({'Name': ['A', 'B', 'D'], 
                    'Age': [4, 6, 4], 'Height': [3,9, 2]})

looking as

(  Name  Age  Height
 0    A    3       7
 1    B    8       2
 2    C    4       1
 3    D    2       4
 4    E    5       9,
   Name  Age  Height
 0    A    4       3
 1    B    6       9
 2    D    4       2)

As mentioned, we also have two arrays, say

array1 = np.array([ 1, 5])
array2 = np.array([2, 3])

To make the example concrete, let us say each array contains the year in which the property was measured.

The output should be constructed as follows:

if an individual appears only in one dataframe, its properties are taken from said dataframe
if an individual appears in more than one data frame, for each property take the values from the data frame whose associated array has the corresponding higher value. So, for property i, compare array1[[i]] and array2[[i]], and take property values from dataframe df1 if array1[[i]] > array2[[i]], and viceversa. In the context of the example, the rules are translated as, take the property which has been measured more recently, if more are available

The output given the example data frames should look like

    Name    Age Height
0   A         4      7
1   B         6      2
2   C         4      1
3   D         4      4
4   E         5      9

Indeed, for the first property "Age", as array1[[0]] < array2[[0]], values are taken from the second dataframe, for the available individuals (A, B, D). Remaining values come from the first dataframe. For the second property "Height", as as array1[[1]] > array2[[1]], values come from the first dataframe, which already describes all the individuals.

At the moment I have some sort of solution based on looping over properties, but it is silly convoluted, I am wondering if any Pandas expert out there could help me towards an elegant solution.

Thanks for your support.

CodePudding user response：

Your question is a bit confusing: array indexes start from 0 so I think in your example it should be [[0]] and [[1]] instead of [[1]] and [[2]].
You can first concatenate your dataframes to have all names listed, then loop over your columns and update the values where the corresponding array is greater (I added a Z row to df2 to show new rows are being added):

df1 = pd.DataFrame({'Name': ['A', 'B', 'C', 'D', 'E'], 
                    'Age': [3, 8, 4, 2, 5], 'Height': [7, 2, 1, 4, 9]})
df2 = pd.DataFrame({'Name': ['A', 'B', 'D', 'Z'], 
                    'Age': [4, 6, 4, 8], 'Height': [3,9, 2, 7]})

array1 = np.array([ 1, 5])
array2 = np.array([2, 3])

df1.set_index('Name', inplace=True)
df2.set_index('Name', inplace=True)

df3 = pd.concat([df1, df2[~df2.index.isin(df1.index)]])
for i, col in enumerate(df1.columns):
    if array2[[i]] > array1[[i]]:
        df3[col].update(df2[col])

print(df3)

Note: You have to set `Name` as index in order to update the right rows

Output:

      Age  Height
Name             
A       4       7
B       6       2
C       4       1
D       4       4
E       5       9
Z       8       7

I you have more than two dataframes in a list, you'll have to store your arrays in a list as well and iterate over the dataframe list while keeping track of the highest array values in a new array.

Note: You have to set Name as index in order to update the right rows

Note: You have to set `Name` as index in order to update the right rows