Home > Mobile >  Count the total number of times the letter (A,B,C or D) appear in the column incl. in a list
Count the total number of times the letter (A,B,C or D) appear in the column incl. in a list

Time:07-26

I have a dataframe with 23 columns. One of the evaluation criteria is represented by letters [A,B,C,D], they are sometimes alone and sometimes mixed (example on a sample).

For example with a value_counts():

result =[‘Letter’ : {‘A’, ‘C’, ‘B’,’D’,
                     ’A , B’,’D , C’,’A , B , D’,
                     ’D , C , A , B’,’A , C’,’A , C , B , D’,
                     ’B , A‘,’A , C , D’,’D , A’,’A , C , D , B’}, 
        ‘Occurence’ : {29,11,6,5,2,1,1,1,1,1,1,1,1,1}]

I'd like to be able to count the total number of times the letter (A,B,C or D) appears, the formulas I've tested aren't very pretty and end up acting like a value_counts().

column_abcd = dataframe['ABCD_result']

a_sum = 0
b_sum = 0
c_sum = 0
d_sum = 0
nan_sum = 0
data = {'Letter':['A','B','C','D','NaN'],
        'Occurence':[a_sum,b_sum,c_sum,d_sum,nan_sum]}

for value in column_abcd:
    if value == 'A':
        a_sum = a_sum   1
    elif value == 'B':
        b_sum = b_sum   1
    elif value == 'C':
        c_sum = c_sum   1
    elif value == 'D':
        d_sum = d_sum   1   
    else:
        nan_sum = nan_sum   1

df_data_result = pd.DataFrame(data_result)
df_data_result.loc['total'] = df_data_result['Occurence'].sum()

df_data_result

The expected result is :

expected_result = {'Letter':['A','B','C','D','NaN'],
        'Occurence':[39,13,17,12,49]}

Thank you in advance for your help.

CodePudding user response:

Use:

df['Letter'].str.split(r'\s*,\s*').explode().value_counts()

example:

df = pd.DataFrame({'Letter': ['A', 'A , B,C', 'D', 'A']})

out = df['Letter'].str.split(r'\s*,\s*').explode().value_counts()

output:

A    3
B    1
C    1
D    1
Name: Letter, dtype: int64

CodePudding user response:

Your given example does not work, since the definition of result is wrong. Assuming, that result should be a dict()you can do the following:

result={"Letter" : ["A", "C", "B","D",
                     "A , B","D , C","A , B , D",
                     "D , C , A , B","A , C","A , C , B , D",
                     "B , A","A , C , D","D , A","A , C , D , B"], 
        "Occurence" : [29,11,6,5,2,1,1,1,1,1,1,1,1,1] }

expected_result={"A":0, "B":0,"C":0, "D":0, "NaN":0} # With just 4 Letters initialization was simpler than checking if letter was already added to dict
for idx,(Letters,Occurence) in enumerate(zip(result["Letter"], result["Occurence"])): # Iterating in parallel over both lists 
    for Letter in [Letter.replace(" ","") for Letter in Letters.split(",")]: #Eliminate spaces 
        if Letter in expected_result.keys():
            expected_result[Letter]=expected_result[Letter] Occurence
        else:
            expected_result["NaN"]=expected_result["NaN"] Occurence
print(expected_result)

This gives

{'A': 39, 'B': 13, 'C': 17, 'D': 12, 'NaN': 0}

which matches your given expected_result. However, in your example there are no Letters except A,B,C,D thus "NaN" is empty. If unfamiliar with zip() look here

  • Related