I would like to merge n data frames based on certain variables (external to the data frame).
Let me clarify the problem referring to an example. We have two dataframes detailing the height and age of certain members of a population. On top, we are given one array per data frame, containing one value per property (so array length = number of columns with numerical value in the data frame). Consider the following two data frames
df1 = pd.DataFrame({'Name': ['A', 'B', 'C', 'D', 'E'],
'Age': [3, 8, 4, 2, 5], 'Height': [7, 2, 1, 4, 9]})
df2 = pd.DataFrame({'Name': ['A', 'B', 'D'],
'Age': [4, 6, 4], 'Height': [3,9, 2]})
looking as
( Name Age Height
0 A 3 7
1 B 8 2
2 C 4 1
3 D 2 4
4 E 5 9,
Name Age Height
0 A 4 3
1 B 6 9
2 D 4 2)
As mentioned, we also have two arrays, say
array1 = np.array([ 1, 5])
array2 = np.array([2, 3])
To make the example concrete, let us say each array contains the year in which the property was measured.
The output should be constructed as follows:
- if an individual appears only in one dataframe, its properties are taken from said dataframe
- if an individual appears in more than one data frame, for each property take the values from the data frame whose associated array has the corresponding higher value. So, for property i, compare
array1[[i]]
andarray2[[i]]
, and take property values from dataframe df1 ifarray1[[i]] > array2[[i]]
, and viceversa. In the context of the example, the rules are translated as, take the property which has been measured more recently, if more are available
The output given the example data frames should look like
Name Age Height
0 A 4 7
1 B 6 2
2 C 4 1
3 D 4 4
4 E 5 9
Indeed, for the first property "Age", as array1[[0]] < array2[[0]]
, values are taken from the second dataframe, for the available individuals (A, B, D). Remaining values come from the first dataframe.
For the second property "Height", as as array1[[1]] > array2[[1]]
, values come from the first dataframe, which already describes all the individuals.
At the moment I have some sort of solution based on looping over properties, but it is silly convoluted, I am wondering if any Pandas expert out there could help me towards an elegant solution.
Thanks for your support.
CodePudding user response:
Your question is a bit confusing: array indexes start from 0 so I think in your example it should be [[0]]
and [[1]]
instead of [[1]]
and [[2]]
.
You can first concatenate your dataframes to have all names listed, then loop over your columns and update the values where the corresponding array is greater (I added a Z
row to df2
to show new rows are being added):
df1 = pd.DataFrame({'Name': ['A', 'B', 'C', 'D', 'E'],
'Age': [3, 8, 4, 2, 5], 'Height': [7, 2, 1, 4, 9]})
df2 = pd.DataFrame({'Name': ['A', 'B', 'D', 'Z'],
'Age': [4, 6, 4, 8], 'Height': [3,9, 2, 7]})
array1 = np.array([ 1, 5])
array2 = np.array([2, 3])
df1.set_index('Name', inplace=True)
df2.set_index('Name', inplace=True)
df3 = pd.concat([df1, df2[~df2.index.isin(df1.index)]])
for i, col in enumerate(df1.columns):
if array2[[i]] > array1[[i]]:
df3[col].update(df2[col])
print(df3)
Note: You have to set Name
as index in order to update the right rows
Output:
Age Height
Name
A 4 7
B 6 2
C 4 1
D 4 4
E 5 9
Z 8 7
I you have more than two dataframes in a list, you'll have to store your arrays in a list as well and iterate over the dataframe list while keeping track of the highest array values in a new array.