Given a dataframe df
with a simple Index
(not a MultiIndex
) - that corresponds to a 2-D real matrix with names for rows and columns - and a boolean expression e
over the elements in df
, I would like to get:
- the name and the integer-based index of the rows
- the name and the integer-based index of the columns
of all the elements satisfying the expression e
. The expression e
is nothing special: I am interested in the rows/columns of the elements greater than a threshold.
After reading the documentation and plenty of questions and answers here, I wrote the code given below. It contains two solutions:
- one based on
numpy
. Basically, I extract the numbers from the dataframe and treat them as anumpy
array. This solution seems reasonable: given the basic nature of the task, the code is simple enough. - one based on methods provided by
pandas
. Even ifpandas
is designed for more complex scenarios than a simple matrix with numbers, this solution seems way too complex for what I am trying to accomplish.
set up the data
import numpy as np
import pandas as pd
n_rows, n_cols, v = 4, 5, 3
rows = [ "r" str(i) for i in range(n_rows) ]
columns = [ "c" str(i) for i in range(n_cols) ]
values = np.zeros( (n_rows, n_cols), dtype=int)
ii = np.random.randint(n_rows, size=(2,))
jj = np.random.randint(n_cols, size=(2,))
poss = zip(ii, jj)
for pos in poss:
print(f"target set at {pos} -> ({rows[pos[0]]}, {columns[pos[1]]})")
values[pos] = v 1
print(" === values ===")
print(values)
df = pd.DataFrame(values, index=rows, columns=columns)
print(" === df === ")
print(df)
with output:
target set at (2, 4) -> (r2, c4)
target set at (1, 0) -> (r1, c0)
=== values ===
[[0 0 0 0 0]
[4 0 0 0 0]
[0 0 0 0 4]
[0 0 0 0 0]]
=== df ===
c0 c1 c2 c3 c4
r0 0 0 0 0 0
r1 4 0 0 0 0
r2 0 0 0 0 4
r3 0 0 0 0 0
solution with numpy
print("\n === USING NUMPY ===")
data = df.to_numpy()
indexes = np.argwhere(data > v)
for ind in indexes:
print(f"(numpy) target found at {ind} -> ({rows[ind[0]]}, {columns[ind[1]]})")
with output:
=== USING NUMPY ===
(numpy) target found at [1 0] -> (r1, c0)
(numpy) target found at [2 4] -> (r2, c4)
solution with pandas
print("\n === WITH PANDAS ===")
# select the rows with at least one column satisfying the condition
cond = (df > v).any(1)
df2 = df[cond]
print(df2, "\n")
# stack
stacked = df2.stack()
print(stacked, "\n")
# filter (again!)
stacked2 = stacked.loc[stacked>v]
print("indexes in stacked:", stacked2.index.to_list(), "\n")
# get index (it is a MultiIndex at this point)
target_rows = [a for (a, _) in stacked2.index.to_list()]
target_cols = [b for (_, b) in stacked2.index.to_list()]
target_rows_idx = [df.index.get_loc(row_name) for row_name in target_rows]
target_cols_idx = [columns.index(col_name) for col_name in target_cols]
for pos in zip(target_rows_idx, target_cols_idx):
print(f"(pandas) target found at {pos} -> ({rows[pos[0]]}, {columns[pos[1]]})")
with output:
=== WITH PANDAS ===
c0 c1 c2 c3 c4
r1 4 0 0 0 0
r2 0 0 0 0 4
r1 c0 4
c1 0
c2 0
c3 0
c4 0
r2 c0 0
c1 0
c2 0
c3 0
c4 4
dtype: int64
indexes in stacked: [('r1', 'c0'), ('r2', 'c4')]
(pandas) target found at (1, 0) -> (r1, c0)
(pandas) target found at (2, 4) -> (r2, c4)
Is there a simpler way to write the code using only pandas
?
CodePudding user response:
Since stack
drops NaN
values by default, we could mask out the values first then stack
(this avoids the need to filter twice). Then just grab the index
and use get_loc
on both index
and columns
to convert the labels to integer values:
stacked = df[df > v].stack()
label_idx = stacked.index.tolist()
integer_idx = [(df.index.get_loc(r), df.columns.get_loc(c))
for r, c in label_idx]
for i, j in zip(integer_idx, label_idx):
print(f'(pandas 2) target found at {i} -> {j}')
Output:
(pandas 2) target found at (0, 0) -> ('r0', 'c0')
(pandas 2) target found at (1, 4) -> ('r1', 'c4')
stacked
:
r0 c0 4.0
r1 c4 4.0
dtype: float64
label_idx
:
[('r0', 'c0'), ('r1', 'c4')]
integer_index
:
[(0, 0), (1, 4)]
Reproducible with:
np.random.seed(22)
CodePudding user response:
I'd use pd.Series.iteritems()
:
>>> [x for x, y in df.gt(3).stack().iteritems() if y]
[('r1', 'c3'), ('r2', 'c3')]
For index:
>>> [(df.index.get_loc(a), df.columns.get_loc(b)) for (a, b), y in df.gt(3).stack().iteritems() if y]
[(1, 3), (2, 3)]
>>>
df
in this case:
>>> df
c0 c1 c2 c3 c4
r0 0 0 0 0 0
r1 0 0 0 4 0
r2 0 0 0 4 0
r3 0 0 0 0 0
>>>