Pandas: Converting a Column of Column names into a Column of values-CodePudding

Consider the following dataframe:

df = pd.DataFrame({'Col1': [1, 2, 3, 4, 5],
                   'Col2': [6, 7, 8, 9, 10],
                   'Select': ["Col1", "Col1", "Col2", "Col2", "Col1"]})

      Col1  Col2 Select
   0     1     6   Col1
   1     2     7   Col1
   2     3     8   Col2
   3     4     9   Col2
   4     5    10   Col1

The values in the column "Select" are names of other existing columns. It is a column of column names. I would like to create a new column whose values are the same as those indicated by the "Select" column. So the output I would like is:

   Col1  Col2 Select  Values
0     1     6   Col1       1
1     2     7   Col1       2
2     3     8   Col2       8
3     4     9   Col2       9
4     5    10   Col1       5

So it's mapping a column of column names into the a column of values.

CodePudding user response：

Note that df[df.Select] will give you:

   Col1  Col1  Col2  Col2  Col1
0     1     1     6     6     1
1     2     2     7     7     2
2     3     3     8     8     3
3     4     4     9     9     4
4     5     5    10    10     5

The desired column is the diagonal of the dataframe above:

import numpy as np

df['Values'] = np.diag(df[df.Select])

Output:

   Col1  Col2 Select  Values
0     1     6   Col1       1
1     2     7   Col1       2
2     3     8   Col2       8
3     4     9   Col2       9
4     5    10   Col1       5

CodePudding user response：

you can first get the integer positions of column names in Select, then go to NumPy domain and index there:

df["Values"] = df.to_numpy()[np.arange(len(df)), df.columns.get_indexer(df.Select)]

to get

>>> df

   Col1  Col2 Select Values
0     1     6   Col1      1
1     2     7   Col1      2
2     3     8   Col2      8
3     4     9   Col2      9
4     5    10   Col1      5

the indexers were:

In [144]: np.arange(len(df))
Out[144]: array([0, 1, 2, 3, 4])

In [145]: df.columns.get_indexer(df.Select)
Out[145]: array([0, 0, 1, 1, 0], dtype=int64)

so we paired these and selected [0, 0], [1, 0], [2, 1], ..., [4, 0] entries from the underlying numpy array of the frame.

now-deprecated df.lookup could have been used...

In [143]: df.lookup(df.index, df.Select)
FutureWarning: The 'lookup' method is deprecated and will be
removed in a future version.You can use DataFrame.melt and DataFrame.loc
as a substitute.
Out[143]: array([1, 2, 8, 9, 5], dtype=int64)

Some shameless timings...

DF in question; it's 5 x 3:

In [146]: %timeit np.diag(df[df.Select])
425 µs ± 9.29 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [147]: %timeit df.to_numpy()[np.arange(len(df)), df.columns.get_indexer(df.Select)]
208 µs ± 4.19 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

DF but repeated 1_000 times, i.e., it's 5_000 x 3:

In [157]: df = pd.concat([df] * 1_000, ignore_index=True)

In [158]: %timeit np.diag(df[df.Select])
67.8 ms ± 3.02 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [159]: %timeit df.to_numpy()[np.arange(len(df)), df.columns.get_indexer(df.Select)]
542 µs ± 13.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

DF but repeated 100_000 times, i.e., it's 500_000 x 3:

In [159]: df = pd.concat([df] * 100_000, ignore_index=True)

In [160]: %timeit np.diag(df[df.Select])
---------------------------------------------------------------------------
MemoryError                               Traceback (most recent call last)
...

MemoryError: Unable to allocate 1.82 TiB for an array with shape (500000, 500000) and data type int64

In [155]: %timeit df.to_numpy()[np.arange(len(df)), df.columns.get_indexer(df.Select)]
41.5 ms ± 1 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

this is because diag-based one constructs an intermediate, large array only to get diagonals of it...

CodePudding user response：

The "official" way for indexing lookup is:

import numpy as np

idx, cols = pd.factorize(df['Select'])
df['Values'] = df.reindex(cols, axis=1).to_numpy()[np.arange(len(df)), idx]

This accounts for missing columns.

output:

   Col1  Col2 Select  Values
0     1     6   Col1       1
1     2     7   Col1       2
2     3     8   Col2       8
3     4     9   Col2       9
4     5    10   Col1       5

CodePudding user response：

use apply

data="""Col1  Col2 Select
0     1     6   Col1
1     2     7   Col1
2     3     8   Col2
3     4     9   Col2
4     5    10   Col1"""

def fncRow(row):
    print(row)
    
df = pd.read_csv(io.StringIO(data), sep='\s ')
df["value"]=df.apply(lambda row: row[row["Select"]],axis=1)
print(df)

output:

Col1  Col2 Select  value
0     1     6   Col1      1
1     2     7   Col1      2
2     3     8   Col2      8
3     4     9   Col2      9
4     5    10   Col1      5