I have a dataset that I am manipulating with Python's pandas
library. The data frame has one string variable that is the ID of interest. I have a set of other variables with a shared prefix (i.e. name_
). For each row, I want to count the number of variables in this set that have the ID as the value.
Question: Is there a pandas 1 liner that I can use?
Here is an example dataset
import pandas as pd
df = pd.DataFrame({
'ID_var': ['ab?c', 'xyzyy', 'ab?c', 'ghi55'],
'name_01': ['def55', 'abc', 'ab?c', 'def'],
'name_02': ['ab?c', 'jkl123', 'ab?c', 'ghi55'],
'name_03': ['ghi55', 'mn_o', 'ab?c', 'ghi55'],
'not_name': [0, 1, 2, 3],
'other_str': ['str1', 'str2', 'str3', 'str'],
})
and, for each row, I want to count the number of times variables with the prefix name_
equal ID_var
. So the desired output is:
import pandas as pd
df_final = pd.DataFrame({
'ID_var': ['ab?c', 'xyzyy', 'ab?c', 'ghi55'],
'name_01': ['def55', 'abc', 'ab?c', 'def'],
'name_02': ['ab?c', 'jkl123', 'ab?c', 'ghi55'],
'name_03': ['ghi55', 'mn_o', 'ab?c', 'ghi55'],
'not_name': [0, 1, 2, 3],
'other_x': ['str1', 'str2', 'str3', 'str'],
'total_name':[1,0,3,2]
})
I haven't been able to find this elsewhere on SO. I suspect that there is a way I can use pd.str.contains
but I am not sure how. Thank you in advance
CodePudding user response:
Here's a pandas one-liner, a bit faster on my computer than @Timeless. The filter
subsets to just columns that start wtih name_
, then the .eq
compares the values to the first column by-column (axis=0
) then sums by row (axis=1
)
df['total_name'] = df.filter(regex='^name_').eq(df['ID_var'],axis=0).sum(axis=1)
Timing: 196 µs ± 1.41 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
%%timeit
df['total_name'] = df.filter(regex='^name_').eq(df['ID_var'],axis=0).sum(axis=1)
Compared to: 353 µs ± 6.61 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
%%timeit
df["total_name"] = [(df.iloc[i, 0]==df.iloc[i, 1:]).sum() for i in range(len(df))]
CodePudding user response:
Question: Is there a pandas 1 liner that I can use?
One option, is to use a listcomp with iloc
:
df["total_name"] = [(df.iloc[i, 0]==df.iloc[i, 1:]).sum() for i in range(len(df))]
Output :
print(df)
ID_var name_01 name_02 name_03 not_name other_str total_name
0 ab?c def55 ab?c ghi55 0 str1 1
1 xyzyy abc jkl123 mn_o 1 str2 0
2 ab?c ab?c ab?c ab?c 2 str3 3
3 ghi55 def ghi55 ghi55 3 str 2
#%%timeit : 547 µs ± 12.4 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)