I have a dataframe with 23 columns. One of the evaluation criteria is represented by letters [A,B,C,D], they are sometimes alone and sometimes mixed (example on a sample).
For example with a value_counts():
result =[‘Letter’ : {‘A’, ‘C’, ‘B’,’D’,
’A , B’,’D , C’,’A , B , D’,
’D , C , A , B’,’A , C’,’A , C , B , D’,
’B , A‘,’A , C , D’,’D , A’,’A , C , D , B’},
‘Occurence’ : {29,11,6,5,2,1,1,1,1,1,1,1,1,1}]
I'd like to be able to count the total number of times the letter (A,B,C or D) appears, the formulas I've tested aren't very pretty and end up acting like a value_counts().
column_abcd = dataframe['ABCD_result']
a_sum = 0
b_sum = 0
c_sum = 0
d_sum = 0
nan_sum = 0
data = {'Letter':['A','B','C','D','NaN'],
'Occurence':[a_sum,b_sum,c_sum,d_sum,nan_sum]}
for value in column_abcd:
if value == 'A':
a_sum = a_sum 1
elif value == 'B':
b_sum = b_sum 1
elif value == 'C':
c_sum = c_sum 1
elif value == 'D':
d_sum = d_sum 1
else:
nan_sum = nan_sum 1
df_data_result = pd.DataFrame(data_result)
df_data_result.loc['total'] = df_data_result['Occurence'].sum()
df_data_result
The expected result is :
expected_result = {'Letter':['A','B','C','D','NaN'],
'Occurence':[39,13,17,12,49]}
Thank you in advance for your help.
CodePudding user response:
Use:
df['Letter'].str.split(r'\s*,\s*').explode().value_counts()
example:
df = pd.DataFrame({'Letter': ['A', 'A , B,C', 'D', 'A']})
out = df['Letter'].str.split(r'\s*,\s*').explode().value_counts()
output:
A 3
B 1
C 1
D 1
Name: Letter, dtype: int64
CodePudding user response:
Your given example does not work, since the definition of result
is wrong.
Assuming, that result
should be a dict()
you can do the following:
result={"Letter" : ["A", "C", "B","D",
"A , B","D , C","A , B , D",
"D , C , A , B","A , C","A , C , B , D",
"B , A","A , C , D","D , A","A , C , D , B"],
"Occurence" : [29,11,6,5,2,1,1,1,1,1,1,1,1,1] }
expected_result={"A":0, "B":0,"C":0, "D":0, "NaN":0} # With just 4 Letters initialization was simpler than checking if letter was already added to dict
for idx,(Letters,Occurence) in enumerate(zip(result["Letter"], result["Occurence"])): # Iterating in parallel over both lists
for Letter in [Letter.replace(" ","") for Letter in Letters.split(",")]: #Eliminate spaces
if Letter in expected_result.keys():
expected_result[Letter]=expected_result[Letter] Occurence
else:
expected_result["NaN"]=expected_result["NaN"] Occurence
print(expected_result)
This gives
{'A': 39, 'B': 13, 'C': 17, 'D': 12, 'NaN': 0}
which matches your given expected_result
. However, in your example there are no Letters except A,B,C,D thus "NaN" is empty. If unfamiliar with zip()
look here