Home > database >  Getting row/column name and integer-based index in a dataframe where a condition holds
Getting row/column name and integer-based index in a dataframe where a condition holds

Time:10-31

Given a dataframe df with a simple Index (not a MultiIndex) - that corresponds to a 2-D real matrix with names for rows and columns - and a boolean expression e over the elements in df, I would like to get:

  • the name and the integer-based index of the rows
  • the name and the integer-based index of the columns

of all the elements satisfying the expression e. The expression e is nothing special: I am interested in the rows/columns of the elements greater than a threshold.

After reading the documentation and plenty of questions and answers here, I wrote the code given below. It contains two solutions:

  • one based on numpy. Basically, I extract the numbers from the dataframe and treat them as a numpy array. This solution seems reasonable: given the basic nature of the task, the code is simple enough.
  • one based on methods provided by pandas. Even if pandas is designed for more complex scenarios than a simple matrix with numbers, this solution seems way too complex for what I am trying to accomplish.

set up the data

import numpy as np
import pandas as pd

n_rows, n_cols, v = 4, 5, 3

rows = [ "r"   str(i) for i in range(n_rows) ]
columns = [ "c"    str(i) for i in range(n_cols) ]
values = np.zeros( (n_rows, n_cols), dtype=int)

ii = np.random.randint(n_rows, size=(2,))
jj = np.random.randint(n_cols, size=(2,))

poss = zip(ii, jj)
for pos in poss:
    print(f"target set at {pos} -> ({rows[pos[0]]}, {columns[pos[1]]})")
    values[pos] = v   1

print(" === values ===")
print(values)

df = pd.DataFrame(values, index=rows, columns=columns)
print(" === df === ")
print(df)

with output:

target set at (2, 4) -> (r2, c4)
target set at (1, 0) -> (r1, c0)
 === values ===
[[0 0 0 0 0]
 [4 0 0 0 0]
 [0 0 0 0 4]
 [0 0 0 0 0]]
 === df === 
    c0  c1  c2  c3  c4
r0   0   0   0   0   0
r1   4   0   0   0   0
r2   0   0   0   0   4
r3   0   0   0   0   0

solution with numpy

print("\n === USING NUMPY ===")
data = df.to_numpy()
indexes = np.argwhere(data > v)
for ind in indexes:
    print(f"(numpy) target found at {ind} -> ({rows[ind[0]]}, {columns[ind[1]]})")

with output:

 === USING NUMPY ===
(numpy) target found at [1 0] -> (r1, c0)
(numpy) target found at [2 4] -> (r2, c4)

solution with pandas

print("\n === WITH PANDAS ===")

# select the rows with at least one column satisfying the condition
cond = (df > v).any(1)
df2 = df[cond]
print(df2, "\n")

# stack 
stacked = df2.stack()
print(stacked, "\n")

# filter (again!)
stacked2 = stacked.loc[stacked>v]
print("indexes in stacked:", stacked2.index.to_list(), "\n")

# get index (it is a MultiIndex at this point)
target_rows = [a for (a, _) in stacked2.index.to_list()]
target_cols = [b for (_, b) in stacked2.index.to_list()]

target_rows_idx = [df.index.get_loc(row_name) for row_name in target_rows]
target_cols_idx = [columns.index(col_name) for col_name in target_cols]

for pos in zip(target_rows_idx, target_cols_idx):
    print(f"(pandas) target found at {pos} -> ({rows[pos[0]]}, {columns[pos[1]]})")

with output:

 === WITH PANDAS ===
    c0  c1  c2  c3  c4
r1   4   0   0   0   0
r2   0   0   0   0   4 

r1  c0    4
    c1    0
    c2    0
    c3    0
    c4    0
r2  c0    0
    c1    0
    c2    0
    c3    0
    c4    4
dtype: int64 

indexes in stacked: [('r1', 'c0'), ('r2', 'c4')] 

(pandas) target found at (1, 0) -> (r1, c0)
(pandas) target found at (2, 4) -> (r2, c4)

Is there a simpler way to write the code using only pandas?

CodePudding user response:

Since stack drops NaN values by default, we could mask out the values first then stack (this avoids the need to filter twice). Then just grab the index and use get_loc on both index and columns to convert the labels to integer values:

stacked = df[df > v].stack()
label_idx = stacked.index.tolist()
integer_idx = [(df.index.get_loc(r), df.columns.get_loc(c))
               for r, c in label_idx]

for i, j in zip(integer_idx, label_idx):
    print(f'(pandas 2) target found at {i} -> {j}')

Output:

(pandas 2) target found at (0, 0) -> ('r0', 'c0')
(pandas 2) target found at (1, 4) -> ('r1', 'c4')

stacked:

r0  c0    4.0
r1  c4    4.0
dtype: float64

label_idx:

[('r0', 'c0'), ('r1', 'c4')]

integer_index:

[(0, 0), (1, 4)]

Reproducible with:

np.random.seed(22)

CodePudding user response:

I'd use pd.Series.iteritems():

>>> [x for x, y in df.gt(3).stack().iteritems() if y]
[('r1', 'c3'), ('r2', 'c3')]

For index:

>>> [(df.index.get_loc(a), df.columns.get_loc(b)) for (a, b), y in df.gt(3).stack().iteritems() if y]
[(1, 3), (2, 3)]
>>> 

df in this case:

>>> df
    c0  c1  c2  c3  c4
r0   0   0   0   0   0
r1   0   0   0   4   0
r2   0   0   0   4   0
r3   0   0   0   0   0
>>> 
  • Related