I have a dataframe which has one column of lists and one column with the number of objects in that list and is sorted by its descending order. For example:
df=pd.DataFrame({'value':[['AB','BC','CD','DE','EF','FG','GH','HI'],
['BC','CD','DE','IJ','JK','KL','LM'],
['AB','CD','DE','MN'],
['C', 'D', 'M'],
['MN','NO'],
['APQ']],
'no_of_values': [8,7,4,3,2,1]})
I would like to have a 3rd column with number of values that occur for the first time (from top to bottom). For example:
df_goal=pd.DataFrame({'value':[['AB','BC','CD','DE','EF','FG','GH','HI'],
['BC','CD','DE','IJ','JK','KL','LM'],
['AB','CD','DE','MN'],
['C', 'D', 'M'],
['MN','NO'],
['APQ']],
'no_of_values': [8,7,4,3,2,1],
'no_of_1st_occurence': [8,4,1,3,1,1]})
My approach was to go through each row of the 'value' column and add each value, that is not already part of it, to a 'non_redundant_list' . If the letter is not already part of it, the column 'no_of_1st_occurence' should also go 1 in the respective row. I tried it as follows:
df['no_of_1st_occurence'] = 0
non_redundant_list = []
for index in df.index:
for list in df['value'][index]:
for value in list:
if not value in non_redundant_list:
non_redundant_list.append(value)
df['no_of_1st_occurence'][index] = 1
However this somehow only checks for first occurence of every single letter and not of the objects in the lists. How do I have to adapt my code to work or is there any more simple solution?
CodePudding user response:
You can do it in a much more efficient way using explode
:
df['no_of_1st_occurence'] = (~df['value'].explode().duplicated()).groupby(level=0).sum()
Output:
>>> df
value no_of_values no_of_1st_occurence
0 [AB, BC, CD, DE, EF, FG, GH, HI] 8 8
1 [BC, CD, DE, IJ, JK, KL, LM] 7 4
2 [AB, CD, DE, MN] 4 1
3 [C, D, M] 3 3
4 [MN, NO] 2 1
5 [APQ] 1 1