I have two DFs, x and y. DF x contains a group
variable and two value variables val_1
and val_2
. Per group in x I want to select either val_1
or val_2
based on criteria in DF y. The result should be a new DF z with a group
column and a single value column which contains either values from val_1
or val_2
.
I have a solution but I am new to Python and would like to know if the below code could be more concise.
import pandas as pd
import numpy as np
# define groups and number of observations
groups = ['a','b','c']
N_group = 5
L = len(groups)*N_group
# define DF x
x = pd.DataFrame({'group': np.repeat(groups, N_group),
'val_1': np.random.uniform(size=L),
'val_2': np.random.uniform(low=10, high=20,size=L)})
# Citerum frame. For a particular group select either val_1 or val_2
y = pd.DataFrame({'group': groups,
'val_col' : ['val_1','val_2','val_1']})
# Using a for loop to subset x based on contents of each row in y.
# Could this be optimized?
z = []
for i, r in y.iterrows():
z.append(pd.DataFrame({'group': r[0],
'val': x.loc[x['group'] == r[0], r[1]]
}))
z = pd.concat(z)
print(z)
>>>
group val
0 a 0.346833
1 a 0.575308
2 a 0.083667
3 a 0.878797
4 a 0.649935
5 b 17.904204
6 b 15.361618
7 b 15.837250
8 b 11.352091
9 b 11.390637
10 c 0.123949
11 c 0.022118
12 c 0.737024
13 c 0.230206
14 c 0.613382
CodePudding user response:
You can use a one liner like the following which returns selected value from dataframe x:
x.apply(lambda row: row[y[y["group"]==row["group"]]["val_col"].values[0]], axis=1)
Complete Solution
import pandas as pd
import numpy as np
# define groups and number of observations
groups = ['a','b','c']
N_group = 5
L = len(groups)*N_group
# define DF x
x = pd.DataFrame({'group': np.repeat(groups, N_group),
'val_1': np.random.uniform(size=L),
'val_2': np.random.uniform(low=10, high=20,size=L)})
# Citerum frame. For a particular group select either val_1 or val_2
y = pd.DataFrame({'group': groups,
'val_col' : ['val_1','val_2','val_1']})
# Using a for loop to subset x based on contents of each row in y.
# Could this be optimized?
z = []
for i, r in y.iterrows():
z.append(pd.DataFrame({'group': r[0],
'val': x.loc[x['group'] == r[0], r[1]]
}))
z = pd.concat(z)
print(z)
x["val"] = x.apply(lambda row: row[y[y["group"]==row["group"]]["val_col"].values[0]], axis=1)
print(x[["group", "val"]])
Explanation
y[y["group"]=="a"]
returns pandas DataFrame:
group val_col
----------------
a val_1
Since there are unique group
values in dataframe y, the above expression always returns one record.
y[y["group"]=="a"]["val_col"]
returns values of column val_col
as pandas Series: (val_1)
. Here there is only one value in the Series.
y[y["group"]=="a"]["val_col"].values
converts above Series to numpy array: ["val_1"]
.
y[y["group"]=="a"]["val_col"].values[0]
returns string value: "val_1"
.
row[y[y["group"]==row["group"]]["val_col"].values[0]]
is eq to row["val_1"]
.
As apply()
iterates through each row of dataframe x, it evaluates above expressions on that row and returns resulting row["val_1"]
or row["val_2"]
or row["val_3"]
value.
x["val"] = x.apply()
appends each of this returned value to a new column val
in dataframe x. Column val
has one-to-one relationship with each row of dataframe x.
Hope that helps!