I have a dataframe that has this shape:
col A | col B | col i | col j | col .. | col z |
---|---|---|---|---|---|
1 | 2 | B | A | A | A |
2 | 3 | B | A | B | B |
The first two columns have the values I want to populate with the rest of the columns. The rest of the colums (col i...col z) should be populated with values from col A and col B depending on the value of each one. For example, the first row of col i has a B, so it should be populated with the value of col B at the same row (2).
I tried different functions for this, apply, applymap, but none of them are suitable for this use case.
The remaining table in this example should look like this:
col A | col B | col i | col j | col .. | col z |
---|---|---|---|---|---|
1 | 2 | 2 | 1 | 1 | 1 |
2 | 3 | 3 | 2 | 3 | 3 |
CodePudding user response:
Use indexing and map
to replace letters:
df.iloc[:, 2:] = df.apply(lambda x: x[2:].map(x[:2]), axis=1)
print(df)
# Output:
A B i j y z
0 1 2 2 1 1 1
1 2 3 3 2 3 3
Setup:
df = pd.DataFrame({'A': [1, 2], 'B': [2, 3], 'i': ['B', 'B'],
'j': ['A', 'A'], 'y': ['A', 'B'], 'z': ['A', 'B']})
print(df)
# Output:
A B i j y z
0 1 2 B A A A
1 2 3 B A B B
Details:
For each row, apply the following function over index axis so x
contains the whole row at each iteration:
Map the value from the third column (x[2:]
<- i, j, y, z) to the index from the two first columns (x[:2]
<- A, B) like a dictionary (a Series
can act as dictionary, check the map
method)
For the first iteration:
A 1 # <- index A
B 2 # <- index B
i B # <- value B
j A # <- value A
y A # <- value A
z A # <- value A
Name: 0, dtype: object
CodePudding user response:
Assuming that the values can be only one of two choices ('A' and 'B') a vectorized approach is to use numpy.where
and take advantage of numpy broadcasting. This is should be faster than using apply
for large DataFrames.
import numpy as np
df.iloc[:, 2:] = np.where(df.iloc[:, 2:] == 'A', df[['col A']], df[['col B']])
df.iloc[:,2:] =
updates all the columns other than the first two. Note that the double brackets in df[['col A']]
and df[['col B']]
are needed so the results can be correctly broadcasted.
Output:
>>> df.iloc[:, 2:] = np.where(df.iloc[:, 2:] == 'A', df[['col A']], df[['col B']])
>>> df
col A col B col i col j col .. col z
0 1 2 2 1 1 1
1 2 3 3 2 3 3