Home > Software design >  How can I get the column based on given value of a pandas Dataframe faster?
How can I get the column based on given value of a pandas Dataframe faster?

Time:01-18

I have a df that looks like this:

Column A Column B Column C Column D
Cell 1 Cell 6 Cell 8 Cell 4
Cell 2 Cell 7 Cell 9 Nan
Cell 3 Nan Cell 10 Nan
Cell 4 Nan Nan Nan
Cell 5 Nan Nan Nan

I'm trying to build a function such as the entry is any value in the df, and the output is the column to which the value belongs to.

My ideia would be doing something like

df[df == val].stack().index[0][1]

It does work, but it is a little to slow after a loot of iterations. If you people know any other, faster way to get the same result, it would be much apretiated.

The df I am working with isn't much larger than this one. it's shape is (21, 16), maybe if I store it like a dict or something it would be better.

Please, any input would be apretiated.

CodePudding user response:

Use numpy.where for indices for rows and columns, then filter columns names with next with iter trick for first matched value, if not matched value get custom string - here not exist:

r, c = np.where(df.to_numpy() == val)

first_matched_col = next(iter(df.columns[c]), 'not exist')

Comparison in small data sample - it is 53 times faster like original solution:

In [138]: %%timeit
     ...: df[df == val].stack().index[0][1]
     ...: 
1.23 ms ± 88.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [139]: %%timeit 
     ...: r, c = np.where(df == val)
     ...: 
     ...: first_matched_col = next(iter(df.columns[c]), 'not exist')
     ...: 
128 µs ± 23.6 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [140]: %%timeit 
     ...: r, c = np.where(df.to_numpy() == val)
     ...: 
     ...: first_matched_col = next(iter(df.columns[c]), 'not exist')
     ...: 
     ...: 
23.1 µs ± 1.39 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

For matched all matched values use join:

r, c = np.where(df.to_numpy() == val)
all_matched_col = ','.join(df.columns[c])

CodePudding user response:

There is another alternative, but is not as quick as the solution provided by jezrael.
You can use the df.eq() function in combination with the any() function to find the columns that contain the value. This function should return True for those columns that contain the value. You can then filter to select only those columns.

Here is an example of how that would work:

Code:

import pandas as pd

df = pd.DataFrame({
    'Column A': ['Cell 1', 'Cell 2', 'Cell 3', 'Cell 4', 'Cell 5'],
    'Column B': ['Cell 6', 'Cell 7',  None, None, None],
    'Column C': ['Cell 8', 'Cell 9', 'Cell 10', None, None],
    'Column D': ['Cell 4', None, None, None, None]
})

def get_Columns(df, val):
    columns = df.eq(val).any()
    result = list(columns[columns==True].index)
    return result
    
x = get_Columns(df, 'Cell 4')
print(x)

Output:

['Column A', 'Column D']
  • Related