Use dataframe column containing "column name strings", to return values from dataframe bas-CodePudding

I have a dataframe as follows:

df=pandas.DataFrame()
df['A'] = numpy.random.random(10)
df['B'] = numpy.random.random(10)
df['C'] = numpy.random.random(10)
df['Col_name'] = numpy.random.choice(['A','B','C'],size=10)

I want to obtain an output that uses 'Col_name' and the respective index of the dataframe row to lookup the value in the dataframe. I can get the desired output this with .apply() follows:

df['output'] = df.apply(lambda x: x[ x['Col_name'] ], axis=1)

.apply() is slow over a large dataframe with it iterating row by row. Is there an obvious solution in pandas that is faster/vectorised?

CodePudding user response：

You can also pick each column name (or give list of possible names) and then apply it as mask to filter your dataframe then pick values from desired column and assign them to all rows matching the mask. Then repeat this for another coulmn.

for column_name in df: #or: for column_name in ['A', 'B', 'C']
    df.loc[df['Col_name']==column_name, 'output'] = df[column_name]

Rows that will not match any mask will have NaN values.

PS. Accodring to my test with 10000000 random rows - method with .apply() takes 2min 24s to finish while my method takes only 4,3s.

CodePudding user response：

Use melt to flatten your dataframe and keep rows where Col_name equals to variable column:

df['output'] = df.melt('Col_name', ignore_index=False).query('Col_name == variable')['value']
print(df)

# Output
          A         B         C Col_name    output
0  0.202197  0.430735  0.093551        B  0.430735
1  0.344753  0.979453  0.999160        C  0.999160
2  0.500904  0.778715  0.074786        A  0.500904
3  0.050951  0.317732  0.363027        B  0.317732
4  0.722624  0.026065  0.424639        C  0.424639
5  0.578185  0.626698  0.376692        C  0.376692
6  0.540849  0.805722  0.528886        A  0.540849
7  0.918618  0.869893  0.825991        C  0.825991
8  0.688967  0.203809  0.734467        B  0.203809
9  0.811571  0.010081  0.372657        B  0.010081

Transformation after melt:

>>> df.melt('Col_name', ignore_index=False)
  Col_name variable     value
0        B        A  0.202197
1        C        A  0.344753
2        A        A  0.500904  # keep
3        B        A  0.050951
4        C        A  0.722624
5        C        A  0.578185
6        A        A  0.540849  # keep
7        C        A  0.918618
8        B        A  0.688967
9        B        A  0.811571
0        B        B  0.430735  # keep
1        C        B  0.979453
2        A        B  0.778715
3        B        B  0.317732  # keep
4        C        B  0.026065
5        C        B  0.626698
6        A        B  0.805722
7        C        B  0.869893
8        B        B  0.203809  # keep
9        B        B  0.010081  # keep
0        B        C  0.093551
1        C        C  0.999160  # keep
2        A        C  0.074786
3        B        C  0.363027
4        C        C  0.424639  # keep
5        C        C  0.376692  # keep
6        A        C  0.528886
7        C        C  0.825991  # keep
8        B        C  0.734467
9        B        C  0.372657

Update

Alternative with set_index and stack for @Rabinzel:

df['output'] = (
  df.set_index('Col_name', append=True).stack()
    .loc[lambda x: x.index.get_level_values(1) == x.index.get_level_values(2)]
    .droplevel([1, 2])
)
print(df)

# Output
          A         B         C Col_name    output
0  0.209953  0.332294  0.812476        C  0.812476
1  0.284225  0.566939  0.087084        A  0.284225
2  0.815874  0.185154  0.155454        A  0.815874
3  0.017548  0.733474  0.766972        A  0.017548
4  0.494323  0.433719  0.979399        C  0.979399
5  0.875071  0.789891  0.319870        B  0.789891
6  0.475554  0.229837  0.338032        B  0.229837
7  0.123904  0.397463  0.288614        C  0.288614
8  0.288249  0.631578  0.393521        A  0.288249
9  0.107245  0.006969  0.367748        C  0.367748

CodePudding user response：

import pandas as pd
import numpy as np

df=pd.DataFrame()
df['A'] = np.random.random(10)
df['B'] = np.random.random(10)
df['C'] = np.random.random(10)
df['Col_name'] = np.random.choice(['A','B','C'],size=10)
df["output"] = np.nan

Even though you do not like going row per row, I still routinely use loops to go through each row just to know where it breaks when it breaks. Here are two loops just to satisfy myself. The column is created ahead with na values becausethe loops needs it to be.

# each rows by index
for i in range(len(df)):
    df['output'][i] = df[df['Col_name'][i]][i]

# each rows but by column name
for col in list(df["Col_name"]):
    df.loc[:,'output'] = df.loc[:,col]

Here are some "non-loop" ways to do so.

df["output"] = df.lookup(df.index, df.Col_name)

df['output'] = np.where(np.isnan(df['output']), df[df['Col_name']], np.nan)