Cell reference of pandas data frame, filtering based on specific value-CodePudding

I need to get the cell reference (row#, col#) for all the areas in my pandas data frame that contains a value == 1.

import pandas as pd
import numpy as np

df = pd.DataFrame({'x': [np.nan, 1, np.nan, np.nan, 1],
                   'y': [np.nan, np.nan, np.nan, np.nan, 1],
                   'z': [1, np.nan, 1, np.nan, np.nan]})

Looking to get a final dataframe with two columns: row & col that looks like:

CodePudding user response：

You can use numpy.argwhere.

This should be much faster than all solutions using for loop, df.stack, etc. Please see the timings below:

In [145]: import numpy as np

In [146]: res = pd.DataFrame(np.argwhere(df.notnull().values).tolist(), columns=['row', 'col'])

In [147]: res.col = res.col   1

In [148]: res
Out[148]: 
   row  col
0    0    3
1    1    1
2    2    3
3    4    1
4    4    2

Timings:

np.argwhere:

In [149]: %timeit pd.DataFrame(np.argwhere(df.notnull().values).tolist(), columns=['row', 'col'])
437 µs ± 4.71 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

@sophocles solution using df.stack:

In [151]: %timeit pd.DataFrame(df[df.notna()].stack().index.tolist(),columns=['row','col'])
1.33 ms ± 5.55 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

CodePudding user response：

You can use notna() to get a boolean DataFrame back, and stack() to remove the nan which have been converted to False. Grab the index and convert to list so that you can easily convert to DataFrame.

Wrapping that in a pd.DataFrame() with correct column names gives you what you need:

df.columns=[1,2,3]
pd.DataFrame(df[df.notna()].stack().index.tolist(),columns=['row','col'])

   row  col
0    0    3
1    1    1
2    2    3
3    4    1
4    4    2

CodePudding user response：

You can just iterate over rows and columns:

res_df = pd.DataFrame(columns=['row', 'col'])

for i in range(len(df)):
  for j in range(len(df.columns)):
    if df[df.columns[j]].iloc[i] == 1:
      res_df = res_df.append({'row': i, 'col': j 1}, ignore_index=True)

print(res_df.sort_values(by='col').reset_index(drop=True))

CodePudding user response：

You can try this:

import pandas as pd
import numpy as np

df = pd.DataFrame({'x': [np.nan, 1, np.nan, np.nan, 1],
                   'y': [np.nan, np.nan, np.nan, np.nan, 1],
                   'z': [1, np.nan, 1, np.nan, np.nan]})

list_indexes = []
for idx in range(len(df.columns)):
    rows = df.index[df.iloc[:, idx] == 1].tolist()
    for row in rows:
        list_indexes.append((row, idx 1))

final = pd.DataFrame(list_indexes, columns=['row', 'column'])
print(final)

CodePudding user response：

You can do this:

df.columns=list(range(1,len(df.columns) 1))

   1    2   3
0   NaN NaN 1.0
1   1.0 NaN NaN
2   NaN NaN 1.0
3   NaN NaN NaN
4   1.0 1.0 NaN

new_df = df.stack().reset_index().rename(columns = {'level_0':'row', 'level_1':'col'})[['row', 'col']]

   row  col
0    0    3
1    1    1
2    2    3
3    4    1
4    4    2