Consider the following dataframe:
df = pd.DataFrame({'Col1': [1, 2, 3, 4, 5],
'Col2': [6, 7, 8, 9, 10],
'Select': ["Col1", "Col1", "Col2", "Col2", "Col1"]})
Col1 Col2 Select
0 1 6 Col1
1 2 7 Col1
2 3 8 Col2
3 4 9 Col2
4 5 10 Col1
The values in the column "Select" are names of other existing columns. It is a column of column names. I would like to create a new column whose values are the same as those indicated by the "Select" column. So the output I would like is:
Col1 Col2 Select Values
0 1 6 Col1 1
1 2 7 Col1 2
2 3 8 Col2 8
3 4 9 Col2 9
4 5 10 Col1 5
So it's mapping a column of column names into the a column of values.
CodePudding user response:
Note that df[df.Select]
will give you:
Col1 Col1 Col2 Col2 Col1
0 1 1 6 6 1
1 2 2 7 7 2
2 3 3 8 8 3
3 4 4 9 9 4
4 5 5 10 10 5
The desired column is the diagonal of the dataframe above:
import numpy as np
df['Values'] = np.diag(df[df.Select])
Output:
Col1 Col2 Select Values
0 1 6 Col1 1
1 2 7 Col1 2
2 3 8 Col2 8
3 4 9 Col2 9
4 5 10 Col1 5
CodePudding user response:
you can first get the integer positions of column names in Select, then go to NumPy domain and index there:
df["Values"] = df.to_numpy()[np.arange(len(df)), df.columns.get_indexer(df.Select)]
to get
>>> df
Col1 Col2 Select Values
0 1 6 Col1 1
1 2 7 Col1 2
2 3 8 Col2 8
3 4 9 Col2 9
4 5 10 Col1 5
the indexers were:
In [144]: np.arange(len(df))
Out[144]: array([0, 1, 2, 3, 4])
In [145]: df.columns.get_indexer(df.Select)
Out[145]: array([0, 0, 1, 1, 0], dtype=int64)
so we paired these and selected [0, 0], [1, 0], [2, 1], ..., [4, 0] entries from the underlying numpy array of the frame.
now-deprecated df.lookup
could have been used...
In [143]: df.lookup(df.index, df.Select)
FutureWarning: The 'lookup' method is deprecated and will be
removed in a future version.You can use DataFrame.melt and DataFrame.loc
as a substitute.
Out[143]: array([1, 2, 8, 9, 5], dtype=int64)
Some shameless timings...
DF in question; it's 5 x 3:
In [146]: %timeit np.diag(df[df.Select])
425 µs ± 9.29 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [147]: %timeit df.to_numpy()[np.arange(len(df)), df.columns.get_indexer(df.Select)]
208 µs ± 4.19 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
DF but repeated 1_000 times, i.e., it's 5_000 x 3:
In [157]: df = pd.concat([df] * 1_000, ignore_index=True)
In [158]: %timeit np.diag(df[df.Select])
67.8 ms ± 3.02 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [159]: %timeit df.to_numpy()[np.arange(len(df)), df.columns.get_indexer(df.Select)]
542 µs ± 13.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
DF but repeated 100_000 times, i.e., it's 500_000 x 3:
In [159]: df = pd.concat([df] * 100_000, ignore_index=True)
In [160]: %timeit np.diag(df[df.Select])
---------------------------------------------------------------------------
MemoryError Traceback (most recent call last)
...
MemoryError: Unable to allocate 1.82 TiB for an array with shape (500000, 500000) and data type int64
In [155]: %timeit df.to_numpy()[np.arange(len(df)), df.columns.get_indexer(df.Select)]
41.5 ms ± 1 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
this is because diag-based one constructs an intermediate, large array only to get diagonals of it...
CodePudding user response:
The "official" way for indexing lookup is:
import numpy as np
idx, cols = pd.factorize(df['Select'])
df['Values'] = df.reindex(cols, axis=1).to_numpy()[np.arange(len(df)), idx]
This accounts for missing columns.
output:
Col1 Col2 Select Values
0 1 6 Col1 1
1 2 7 Col1 2
2 3 8 Col2 8
3 4 9 Col2 9
4 5 10 Col1 5
CodePudding user response:
use apply
data="""Col1 Col2 Select
0 1 6 Col1
1 2 7 Col1
2 3 8 Col2
3 4 9 Col2
4 5 10 Col1"""
def fncRow(row):
print(row)
df = pd.read_csv(io.StringIO(data), sep='\s ')
df["value"]=df.apply(lambda row: row[row["Select"]],axis=1)
print(df)
output:
Col1 Col2 Select value
0 1 6 Col1 1
1 2 7 Col1 2
2 3 8 Col2 8
3 4 9 Col2 9
4 5 10 Col1 5