Home > Software design >  Selecting columns in pd.DataFrame based on group and defined by criteria in second DF
Selecting columns in pd.DataFrame based on group and defined by criteria in second DF

Time:09-03

I have two DFs, x and y. DF x contains a group variable and two value variables val_1 and val_2. Per group in x I want to select either val_1 or val_2 based on criteria in DF y. The result should be a new DF z with a group column and a single value column which contains either values from val_1 or val_2 .

I have a solution but I am new to Python and would like to know if the below code could be more concise.

import pandas as pd
import numpy as np

# define groups and number of observations
groups = ['a','b','c']
N_group = 5
L = len(groups)*N_group

# define DF x 
x = pd.DataFrame({'group': np.repeat(groups, N_group),
                  'val_1': np.random.uniform(size=L),
                  'val_2': np.random.uniform(low=10, high=20,size=L)})

# Citerum frame. For a particular group select either val_1 or val_2
y = pd.DataFrame({'group': groups,
                  'val_col' : ['val_1','val_2','val_1']})

# Using a for loop to subset x based on contents of each row in y.
# Could this be optimized?
z = []
for i, r in y.iterrows():
    z.append(pd.DataFrame({'group': r[0],
                           'val': x.loc[x['group'] == r[0], r[1]]
                           }))
z = pd.concat(z)
print(z)
>>>
   group        val
0      a   0.346833
1      a   0.575308
2      a   0.083667
3      a   0.878797
4      a   0.649935
5      b  17.904204
6      b  15.361618
7      b  15.837250
8      b  11.352091
9      b  11.390637
10     c   0.123949
11     c   0.022118
12     c   0.737024
13     c   0.230206
14     c   0.613382

CodePudding user response:

You can use a one liner like the following which returns selected value from dataframe x:

x.apply(lambda row: row[y[y["group"]==row["group"]]["val_col"].values[0]], axis=1)

Complete Solution

import pandas as pd
import numpy as np

# define groups and number of observations
groups = ['a','b','c']
N_group = 5
L = len(groups)*N_group

# define DF x 
x = pd.DataFrame({'group': np.repeat(groups, N_group),
                  'val_1': np.random.uniform(size=L),
                  'val_2': np.random.uniform(low=10, high=20,size=L)})

# Citerum frame. For a particular group select either val_1 or val_2
y = pd.DataFrame({'group': groups,
                  'val_col' : ['val_1','val_2','val_1']})

# Using a for loop to subset x based on contents of each row in y.
# Could this be optimized?
z = []
for i, r in y.iterrows():
    z.append(pd.DataFrame({'group': r[0],
                           'val': x.loc[x['group'] == r[0], r[1]]
                           }))
z = pd.concat(z)
print(z)

x["val"] = x.apply(lambda row: row[y[y["group"]==row["group"]]["val_col"].values[0]], axis=1)
print(x[["group", "val"]])

Explanation

y[y["group"]=="a"] returns pandas DataFrame:

group   val_col
----------------
a       val_1

Since there are unique group values in dataframe y, the above expression always returns one record.

y[y["group"]=="a"]["val_col"] returns values of column val_col as pandas Series: (val_1). Here there is only one value in the Series.

y[y["group"]=="a"]["val_col"].values converts above Series to numpy array: ["val_1"].

y[y["group"]=="a"]["val_col"].values[0] returns string value: "val_1".

row[y[y["group"]==row["group"]]["val_col"].values[0]] is eq to row["val_1"].

As apply() iterates through each row of dataframe x, it evaluates above expressions on that row and returns resulting row["val_1"] or row["val_2"] or row["val_3"] value.

x["val"] = x.apply() appends each of this returned value to a new column val in dataframe x. Column val has one-to-one relationship with each row of dataframe x.

Hope that helps!

  • Related