Home > Back-end >  Python/Pandas: use one column's value to be the suffix of the column name from which I want a v
Python/Pandas: use one column's value to be the suffix of the column name from which I want a v

Time:08-03

I have a pandas dataframe. From multiple columns therein, I need to select the value from only one into a single new column, according to the ID (bar in this example) of that row. I need the fastest way to do this.

Dataframe for application is like this:

foo bar ID_A    ID_B    ID_C    ID_D    ID_E    ...
1   B   1.5     2.3     4.1     0.5     6.6     ...
2   E   3       4       5       6       7       ...
3   A   9       6       3       8       1       ...
4   C   13      5       88      9       0       ...
5   B   6       4       6       9       4       ...
...

An example of a way to do it (my fastest at present) is thus - however, it is too slow for my purposes.

df.loc[df.bar=='A', 'baz'] = df.ID_A
df.loc[df.bar=='B', 'baz'] = df.ID_B
df.loc[df.bar=='C', 'baz'] = df.ID_C
df.loc[df.bar=='D', 'baz'] = df.ID_D
df.loc[df.bar=='E', 'baz'] = df.ID_E
df.loc[df.bar=='F', 'baz'] = df.ID_F
df.loc[df.bar=='G', 'baz'] = df.ID_G

Result will be like this (after dropping used columns):

foo baz
1   2.3
2   7
3   9
4   88
5   4
...

I have tried with .apply() and it was very slow.
I tried with np.where() which was still much slower than the example shown above (which was 1000% faster than np.where()).

Would appreciate recommendations! Many thanks

EDIT: after the first few answers, I think I need to add this:
"whilst I would appreciate runtime estimate relative to the example, I know it's a small example so may be tricky. My actual data has 280000 rows and an extra 50 columns (which I need to keep along with foo and baz). I have to reduce 13 columns to the single column per the example. The speed is the only reason for asking, & no mention of speed thus far in first few responses. Thanks again!"

CodePudding user response:

You can use a variant of the indexing lookup:

idx, cols = pd.factorize('ID_' df['bar'])
out = pd.DataFrame({'foo': df['foo'],
                    'baz': df.reindex(cols, axis=1).to_numpy()[np.arange(len(df)), idx]})

output:

   foo   baz
0    1   2.3
1    2   7.0
2    3   9.0
3    4  88.0
4    5   4.0

testing speed

Setting up a test dataset (280k rows, 54 ID columns):

from string import ascii_uppercase, ascii_lowercase

letters = list(ascii_lowercase ascii_uppercase)
N = 280_000

np.random.seed(0)

df = (pd.DataFrame({'foo': np.arange(1, N 1),
                    'bar': np.random.choice(letters, size=N)})
        .join(pd.DataFrame(np.random.random(size=(N, len(letters))),
                           columns=[f'ID_{l}' for l in letters]
                          ))
     )

speed testing:

%%timeit
idx, cols = pd.factorize('ID_' df['bar'])
out = pd.DataFrame({'foo': df['foo'],
                    'baz': df.reindex(cols, axis=1).to_numpy()[np.arange(len(df)), idx]})

output:

54.4 ms ± 3.22 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

CodePudding user response:

Can try this. It should generalize to arbitrary number of columns.

import pandas as pd
import numpy as np

df = pd.DataFrame([[1, 'B', 1.5, 2.3, 4.1, 0.5, 6.6],
      [2, 'E', 3, 4, 5, 6, 7],
      [3, 'A', 9, 6, 3, 8, 1],
      [4, 'C', 13, 5, 88, 9, 0],
      [5, 'B', 6, 4, 6, 9, 4]])

df.columns = ['foo', 'bar', 'ID_A', 'ID_B', 'ID_C', 'ID_D', 'ID_E']

for val in np.unique(df['bar'].values):
    df.loc[df.bar==val, 'baz'] = df[f'ID_{val}']

CodePudding user response:

To show an alternative approach, you can perform a combination of melting your data and reindexing. In this case I used wide_to_long (instead of melt/stack) because of the patterned nature of your column names:

out = (
    pd.wide_to_long(
        df, stubnames=['ID'], i=['foo', 'bar'], j='', sep='_', suffix=r'\w '
    )
    .loc[lambda d: 
        d.index.get_level_values('bar') == d.index.get_level_values(level=-1),
        'ID'
    ]
    .droplevel(-1)
    .rename('baz')
    .reset_index()
)

print(out)
   foo bar   baz
0    1   B   2.3
1    2   E   7.0
2    3   A   9.0
3    4   C  88.0
4    5   B   4.0

An alternative syntax to the above leverages .melt & .query to shorten the syntax.

out = (
    df.melt(id_vars=['foo', 'bar'], var_name='id', value_name='baz')
    .assign(id=lambda d: d['id'].str.get(-1))
    .query('bar == id')
)

print(out)
    foo bar id   baz
2     3   A  A   9.0
5     1   B  B   2.3
9     5   B  B   4.0
13    4   C  C  88.0
21    2   E  E   7.0

  • Related