Home > Back-end >  How to count first occurence of a value in a dataframe of lists in Python?
How to count first occurence of a value in a dataframe of lists in Python?

Time:03-16

I have a dataframe which has one column of lists and one column with the number of objects in that list and is sorted by its descending order. For example:

df=pd.DataFrame({'value':[['AB','BC','CD','DE','EF','FG','GH','HI'],
                          ['BC','CD','DE','IJ','JK','KL','LM'],
                          ['AB','CD','DE','MN'],
                          ['C', 'D', 'M'],
                          ['MN','NO'],
                          ['APQ']],
                 'no_of_values': [8,7,4,3,2,1]})

I would like to have a 3rd column with number of values that occur for the first time (from top to bottom). For example:

df_goal=pd.DataFrame({'value':[['AB','BC','CD','DE','EF','FG','GH','HI'],
                               ['BC','CD','DE','IJ','JK','KL','LM'],
                               ['AB','CD','DE','MN'],
                               ['C', 'D', 'M'],
                               ['MN','NO'],
                               ['APQ']],
                      'no_of_values': [8,7,4,3,2,1],
                      'no_of_1st_occurence': [8,4,1,3,1,1]})

My approach was to go through each row of the 'value' column and add each value, that is not already part of it, to a 'non_redundant_list' . If the letter is not already part of it, the column 'no_of_1st_occurence' should also go 1 in the respective row. I tried it as follows:

df['no_of_1st_occurence'] = 0
non_redundant_list = []

for index in df.index:
    for list in df['value'][index]:
        for value in list:
            if not value in non_redundant_list:
                non_redundant_list.append(value)
                df['no_of_1st_occurence'][index]  = 1  

However this somehow only checks for first occurence of every single letter and not of the objects in the lists. How do I have to adapt my code to work or is there any more simple solution?

CodePudding user response:

You can do it in a much more efficient way using explode:

df['no_of_1st_occurence'] = (~df['value'].explode().duplicated()).groupby(level=0).sum()

Output:

>>> df
                              value  no_of_values  no_of_1st_occurence
0  [AB, BC, CD, DE, EF, FG, GH, HI]             8                    8
1      [BC, CD, DE, IJ, JK, KL, LM]             7                    4
2                  [AB, CD, DE, MN]             4                    1
3                         [C, D, M]             3                    3
4                          [MN, NO]             2                    1
5                             [APQ]             1                    1
  • Related