Let's say we have four columns: Column1, Column2, Column3, ind
import pandas as pd
tbl = {
'Column1':['Spark',10000,'Python','35days'],
'Column2' :[500,'PySpark',22000,30000],
'Column3':['30days','40days','35days','pandas'],
'ind':[1,2,1,3]
}
df = pd.DataFrame(tbl)
Does anyone know is there a way to add a new column without loop that will gather values from first 3 columns based on index stored in 'ind' column?
'Course':['Spark','PySpark','Python','pandas']
I've tried some combinations with iloc, lambda and apply but failed.
Expected output:
Column1 Column2 Column3 ind Course
0 Spark 500 30days 1 Spark
1 10000 PySpark 40days 2 PySpark
2 Python 22000 35days 1 Python
3 35days 30000 pandas 3 pandas
CodePudding user response:
IIUC, you can try apply
on rows
df['Course'] = df.apply(lambda row: row.iloc[row['ind']-1], axis=1)
Or you can try
df['Course'] = df.values[np.arange(len(df['ind'])), df['ind'].sub(1)]
print(df)
Column1 Column2 Column3 ind Course
0 Spark 500 30days 1 Spark
1 10000 PySpark 40days 2 PySpark
2 Python 22000 35days 1 Python
3 35days 30000 pandas 3 pandas
CodePudding user response:
For efficiency, use the official lookup by index/column labels:
import numpy as np
idx, cols = pd.factorize('Column' df['ind'].astype(str))
df['Course'] = df.reindex(cols, axis=1).to_numpy()[np.arange(len(df)), idx]
output:
Column1 Column2 Column3 ind Course
0 Spark 500 30days 1 Spark
1 10000 PySpark 40days 2 PySpark
2 Python 22000 35days 1 Python
3 35days 30000 pandas 3 pandas
timings
performed on 40k rows
# lookup (this answer)
25.6 ms ± 2.26 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
# apply
722 ms ± 48.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)