Home > Net >  Python: pandas count row-wise string matches across specific variables
Python: pandas count row-wise string matches across specific variables

Time:02-03

I have a dataset that I am manipulating with Python's pandas library. The data frame has one string variable that is the ID of interest. I have a set of other variables with a shared prefix (i.e. name_). For each row, I want to count the number of variables in this set that have the ID as the value.

Question: Is there a pandas 1 liner that I can use?

Here is an example dataset

import pandas as pd

df = pd.DataFrame({
        'ID_var': ['ab?c', 'xyzyy', 'ab?c', 'ghi55'],
        'name_01': ['def55', 'abc',  'ab?c', 'def'],
        'name_02': ['ab?c', 'jkl123', 'ab?c', 'ghi55'],
        'name_03': ['ghi55', 'mn_o', 'ab?c', 'ghi55'],
        'not_name': [0, 1, 2, 3],
        'other_str': ['str1', 'str2', 'str3', 'str'],    
    })

and, for each row, I want to count the number of times variables with the prefix name_ equal ID_var. So the desired output is:

import pandas as pd
df_final = pd.DataFrame({
        'ID_var': ['ab?c', 'xyzyy', 'ab?c', 'ghi55'],
        'name_01': ['def55', 'abc',  'ab?c', 'def'],
        'name_02': ['ab?c', 'jkl123', 'ab?c', 'ghi55'],
        'name_03': ['ghi55', 'mn_o', 'ab?c', 'ghi55'],
        'not_name': [0, 1, 2, 3],
        'other_x': ['str1', 'str2', 'str3', 'str'],    
        'total_name':[1,0,3,2]
    })

I haven't been able to find this elsewhere on SO. I suspect that there is a way I can use pd.str.contains but I am not sure how. Thank you in advance

CodePudding user response:

Here's a pandas one-liner, a bit faster on my computer than @Timeless. The filter subsets to just columns that start wtih name_, then the .eq compares the values to the first column by-column (axis=0) then sums by row (axis=1)

df['total_name'] = df.filter(regex='^name_').eq(df['ID_var'],axis=0).sum(axis=1)

Timing: 196 µs ± 1.41 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

%%timeit
df['total_name'] = df.filter(regex='^name_').eq(df['ID_var'],axis=0).sum(axis=1)

Compared to: 353 µs ± 6.61 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

%%timeit
df["total_name"] = [(df.iloc[i, 0]==df.iloc[i, 1:]).sum() for i in range(len(df))]

CodePudding user response:

Question: Is there a pandas 1 liner that I can use?

One option, is to use a listcomp with iloc :

df["total_name"] = [(df.iloc[i, 0]==df.iloc[i, 1:]).sum() for i in range(len(df))]

Output :

print(df)

  ID_var name_01 name_02 name_03  not_name other_str  total_name
0   ab?c   def55    ab?c   ghi55         0      str1           1
1  xyzyy     abc  jkl123    mn_o         1      str2           0
2   ab?c    ab?c    ab?c    ab?c         2      str3           3
3  ghi55     def   ghi55   ghi55         3       str           2

#%%timeit : 547 µs ± 12.4 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
  • Related