Pandas: How to set values from another column based on conditions column-wise-CodePudding

I have a dataframe that has this shape:

col A	col B	col i	col j	col ..	col z
1	2	B	A	A	A
2	3	B	A	B	B

The first two columns have the values I want to populate with the rest of the columns. The rest of the colums (col i...col z) should be populated with values from col A and col B depending on the value of each one. For example, the first row of col i has a B, so it should be populated with the value of col B at the same row (2).

I tried different functions for this, apply, applymap, but none of them are suitable for this use case.

The remaining table in this example should look like this:

col A	col B	col i	col j	col ..	col z
1	2	2	1	1	1
2	3	3	2	3	3

CodePudding user response：

Use indexing and map to replace letters:

df.iloc[:, 2:] = df.apply(lambda x: x[2:].map(x[:2]), axis=1)
print(df)

# Output:
   A  B  i  j  y  z
0  1  2  2  1  1  1
1  2  3  3  2  3  3

Setup:

df = pd.DataFrame({'A': [1, 2], 'B': [2, 3], 'i': ['B', 'B'],
                   'j': ['A', 'A'], 'y': ['A', 'B'], 'z': ['A', 'B']})
print(df)

# Output:
   A  B  i  j  y  z
0  1  2  B  A  A  A
1  2  3  B  A  B  B

Details:

For each row, apply the following function over index axis so x contains the whole row at each iteration:

Map the value from the third column (x[2:] <- i, j, y, z) to the index from the two first columns (x[:2] <- A, B) like a dictionary (a Series can act as dictionary, check the map method)

For the first iteration:

A    1  # <- index A
B    2  # <- index B
i    B  # <- value B
j    A  # <- value A
y    A  # <- value A
z    A  # <- value A
Name: 0, dtype: object

CodePudding user response：

Assuming that the values can be only one of two choices ('A' and 'B') a vectorized approach is to use numpy.where and take advantage of numpy broadcasting. This is should be faster than using apply for large DataFrames.

import numpy as np

df.iloc[:, 2:] = np.where(df.iloc[:, 2:] == 'A', df[['col A']], df[['col B']])

df.iloc[:,2:] = updates all the columns other than the first two. Note that the double brackets in df[['col A']] and df[['col B']] are needed so the results can be correctly broadcasted.

Output:

>>> df.iloc[:, 2:] = np.where(df.iloc[:, 2:] == 'A', df[['col A']], df[['col B']])
>>> df 

   col A  col B  col i  col j  col ..  col z
0      1      2      2      1       1      1
1      2      3      3      2       3      3