Generate an example dataframe
import random
import string
import numpy as np
df = pd.DataFrame(
columns=[random.choice(string.ascii_uppercase) for i in range(5)],
data=np.random.rand(10,5))
df
V O C X E
0 0.060255 0.341051 0.288854 0.740567 0.236282
1 0.933778 0.393021 0.547383 0.469255 0.053089
2 0.994518 0.156547 0.917894 0.070152 0.201373
3 0.077694 0.685540 0.865004 0.830740 0.605135
4 0.760294 0.838441 0.905885 0.146982 0.157439
5 0.116676 0.340967 0.400340 0.293894 0.220995
6 0.632182 0.663218 0.479900 0.931314 0.003180
7 0.726736 0.276703 0.057806 0.624106 0.719631
8 0.677492 0.200079 0.374410 0.962232 0.915361
9 0.061653 0.984166 0.959516 0.261374 0.361677
Now I want to filter a dataframe using the values in the first column, but since I make heavy use of chaining (e.g. df.T.replace(0, np.nan).pipe(np.log2).mean(axis=1).fillna(0).pipe(func)
) I need a much more compact notation for the operation. Normally you'd do something like
df[df.iloc[:, 0] < 0.5]
V O C X E
0 0.060255 0.341051 0.288854 0.740567 0.236282
3 0.077694 0.685540 0.865004 0.830740 0.605135
5 0.116676 0.340967 0.400340 0.293894 0.220995
9 0.061653 0.984166 0.959516 0.261374 0.361677
but the awkwardly redundant syntax is horrible for chaining. I want to replace it with a .query()
, and normally you'd use the column name like df.query('V < 0.5')
, but here I want to be able to query the table by column index number instead of by name. So in the example, I've deliberately randomized the column names. I also can not use the table name in the query like df.query('@df[0] < 0.5')
since in a long chain, the intermediate result has no name.
I'm hoping there is some syntax such as df.query('_[0] < 0.05')
where I can refer to the source table as some symbol _
.
CodePudding user response:
You can use lambda functions in loc
, which passes in the dataframe. You can then use iloc
for your positional indexing. So you could do:
df.loc[lambda x: x.iloc[:, 0] > 0.5]
This should work in a method chain.
CodePudding user response:
For a single column with index:
df.query(f"{df.columns[0]}<0.5")
V O C X E
0 0.060255 0.341051 0.288854 0.740567 0.236282
3 0.077694 0.685540 0.865004 0.830740 0.605135
5 0.116676 0.340967 0.400340 0.293894 0.220995
9 0.061653 0.984166 0.959516 0.261374 0.361677
For multiple columns with index:
idx = [0,1]
col = df.columns[np.r_[idx]]
val = 0.5
query = ' and '.join([f"{i} < {val}" for i in col])
# V < 0.5 and O < 0.5
print(df.query(query))
V O C X E
0 0.060255 0.341051 0.288854 0.740567 0.236282
5 0.116676 0.340967 0.400340 0.293894 0.220995
CodePudding user response:
You can using f-string notation in df.query:
df.query(f'{df.columns[0]} < .5')
Output:
J M O R N
3 0.114554 0.131948 0.650307 0.672486 0.688872
4 0.272368 0.745900 0.544068 0.504299 0.434122
6 0.418988 0.023691 0.450398 0.488476 0.787383
7 0.040440 0.220282 0.263902 0.660016 0.955950
Update using "walrus" operator in python 3.8
Let's try this:
((dfout := df.T.replace(0, np.nan).pipe(np.log2).mean(axis=1).fillna(0).to_frame(name='values'))
.query(f'{dfout.columns[0]} > -2'))
output:
values
N -1.356779
O -1.202353
M -1.591623
T -1.557801