I have a pandas dataframe. From multiple columns therein, I need to select the value from only one into a single new column, according to the ID (bar
in this example) of that row.
I need the fastest way to do this.
Dataframe for application is like this:
foo bar ID_A ID_B ID_C ID_D ID_E ...
1 B 1.5 2.3 4.1 0.5 6.6 ...
2 E 3 4 5 6 7 ...
3 A 9 6 3 8 1 ...
4 C 13 5 88 9 0 ...
5 B 6 4 6 9 4 ...
...
An example of a way to do it (my fastest at present) is thus - however, it is too slow for my purposes.
df.loc[df.bar=='A', 'baz'] = df.ID_A
df.loc[df.bar=='B', 'baz'] = df.ID_B
df.loc[df.bar=='C', 'baz'] = df.ID_C
df.loc[df.bar=='D', 'baz'] = df.ID_D
df.loc[df.bar=='E', 'baz'] = df.ID_E
df.loc[df.bar=='F', 'baz'] = df.ID_F
df.loc[df.bar=='G', 'baz'] = df.ID_G
Result will be like this (after dropping used columns):
foo baz
1 2.3
2 7
3 9
4 88
5 4
...
I have tried with .apply()
and it was very slow.
I tried with np.where()
which was still much slower than the example shown above (which was 1000% faster than np.where()
).
Would appreciate recommendations! Many thanks
EDIT: after the first few answers, I think I need to add this:
"whilst I would appreciate runtime estimate relative to the example, I know it's a small example so may be tricky.
My actual data has 280000 rows and an extra 50 columns (which I need to keep along with foo
and baz
). I have to reduce 13 columns to the single column per the example.
The speed is the only reason for asking, & no mention of speed thus far in first few responses. Thanks again!"
CodePudding user response:
You can use a variant of the indexing lookup:
idx, cols = pd.factorize('ID_' df['bar'])
out = pd.DataFrame({'foo': df['foo'],
'baz': df.reindex(cols, axis=1).to_numpy()[np.arange(len(df)), idx]})
output:
foo baz
0 1 2.3
1 2 7.0
2 3 9.0
3 4 88.0
4 5 4.0
testing speed
Setting up a test dataset (280k rows, 54 ID columns):
from string import ascii_uppercase, ascii_lowercase
letters = list(ascii_lowercase ascii_uppercase)
N = 280_000
np.random.seed(0)
df = (pd.DataFrame({'foo': np.arange(1, N 1),
'bar': np.random.choice(letters, size=N)})
.join(pd.DataFrame(np.random.random(size=(N, len(letters))),
columns=[f'ID_{l}' for l in letters]
))
)
speed testing:
%%timeit
idx, cols = pd.factorize('ID_' df['bar'])
out = pd.DataFrame({'foo': df['foo'],
'baz': df.reindex(cols, axis=1).to_numpy()[np.arange(len(df)), idx]})
output:
54.4 ms ± 3.22 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
CodePudding user response:
Can try this. It should generalize to arbitrary number of columns.
import pandas as pd
import numpy as np
df = pd.DataFrame([[1, 'B', 1.5, 2.3, 4.1, 0.5, 6.6],
[2, 'E', 3, 4, 5, 6, 7],
[3, 'A', 9, 6, 3, 8, 1],
[4, 'C', 13, 5, 88, 9, 0],
[5, 'B', 6, 4, 6, 9, 4]])
df.columns = ['foo', 'bar', 'ID_A', 'ID_B', 'ID_C', 'ID_D', 'ID_E']
for val in np.unique(df['bar'].values):
df.loc[df.bar==val, 'baz'] = df[f'ID_{val}']
CodePudding user response:
To show an alternative approach, you can perform a combination of melting your data and reindexing. In this case I used wide_to_long
(instead of melt/stack) because of the patterned nature of your column names:
out = (
pd.wide_to_long(
df, stubnames=['ID'], i=['foo', 'bar'], j='', sep='_', suffix=r'\w '
)
.loc[lambda d:
d.index.get_level_values('bar') == d.index.get_level_values(level=-1),
'ID'
]
.droplevel(-1)
.rename('baz')
.reset_index()
)
print(out)
foo bar baz
0 1 B 2.3
1 2 E 7.0
2 3 A 9.0
3 4 C 88.0
4 5 B 4.0
An alternative syntax to the above leverages .melt
& .query
to shorten the syntax.
out = (
df.melt(id_vars=['foo', 'bar'], var_name='id', value_name='baz')
.assign(id=lambda d: d['id'].str.get(-1))
.query('bar == id')
)
print(out)
foo bar id baz
2 3 A A 9.0
5 1 B B 2.3
9 5 B B 4.0
13 4 C C 88.0
21 2 E E 7.0