Pick a column value based on column index stored in another column (Pandas)-CodePudding

Let's say we have four columns: Column1, Column2, Column3, ind

import pandas as pd


tbl = {
        'Column1':['Spark',10000,'Python','35days'],
        'Column2' :[500,'PySpark',22000,30000],
        'Column3':['30days','40days','35days','pandas'],
        'ind':[1,2,1,3]
        }
df = pd.DataFrame(tbl)

Does anyone know is there a way to add a new column without loop that will gather values from first 3 columns based on index stored in 'ind' column?

'Course':['Spark','PySpark','Python','pandas']

I've tried some combinations with iloc, lambda and apply but failed.

Expected output:

  Column1  Column2 Column3  ind   Course
0   Spark      500  30days    1    Spark
1   10000  PySpark  40days    2  PySpark
2  Python    22000  35days    1   Python
3  35days    30000  pandas    3   pandas

CodePudding user response：

IIUC, you can try apply on rows

df['Course'] = df.apply(lambda row: row.iloc[row['ind']-1], axis=1)

Or you can try

df['Course'] = df.values[np.arange(len(df['ind'])), df['ind'].sub(1)]

print(df)

  Column1  Column2 Column3  ind   Course
0   Spark      500  30days    1    Spark
1   10000  PySpark  40days    2  PySpark
2  Python    22000  35days    1   Python
3  35days    30000  pandas    3   pandas

CodePudding user response：

For efficiency, use the official lookup by index/column labels:

import numpy as np
idx, cols = pd.factorize('Column' df['ind'].astype(str))
df['Course'] = df.reindex(cols, axis=1).to_numpy()[np.arange(len(df)), idx]

output:

  Column1  Column2 Column3  ind   Course
0   Spark      500  30days    1    Spark
1   10000  PySpark  40days    2  PySpark
2  Python    22000  35days    1   Python
3  35days    30000  pandas    3   pandas

timings

performed on 40k rows

# lookup (this answer)
25.6 ms ± 2.26 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

# apply
722 ms ± 48.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)