Home > other >  Pandas Avoiding Summation of Duplicates
Pandas Avoiding Summation of Duplicates

Time:10-08

I am working with the Pandas Library and learning more about it with data manipulation and analysis. With this Dataframe ITEM

  ItemNo1 ItemNo2 ItemNo3 ItemNo4 
0   Paper   Paper    Rock   Paper
1 Scissor Scissor    Rock     NaN
2    Rock    Rock    Rock   Paper
3 Scissor   Paper Scissor   Paper
4    Rock   Paper Scissor    Rock

I'd like to create a Dataframe with two columns. Item and Unique Item Row Count. The unique item row count should only have the unique row values of Item. Meaning if there are three of the same item in the row, it will only take into account the first instance for count. This would mean the following output:


  Item UniqueItemRowCount
0 Paper                 4
1 Rock                  4
2 Scissor               3

There will be 4 for rock because for index 0 and 1 there is one rock. For index 2 and 4 there is already one Rock found so the rest are ignored/not added for the sum.

My main issue is trying to create a lambda function for this as I cannot make it properly get the sum. I have also tried to make the values NaN (null) if there is a duplicate found in the row but this does not work either.

CodePudding user response:

Use DataFrame.melt with DataFrame.drop_duplicates for count duplicates per rows and then count values by Series.value_counts:

Notice: Expected ouput is different, because removed duplicates per rows:

There will be 4 for rock because for index 0 and 1 there is one rock. For index 2 and 4 there is already one Rock found so the rest are ignored/not added for the sum.

df1 = (df.melt()
         .drop_duplicates()['value']
         .value_counts()
         .rename_axis('Item')
         .reset_index(name='UniqueItemRowCount'))
print (df1)
      Item  UniqueItemRowCount
0     Rock                   4
1  Scissor                   3
2    Paper                   3

If need remove duplicates per columns with DataFrame.stack and DataFrame.reset_index, then use DataFrame.drop_duplicates - with level_0 for unique values per columns or level_1 for unique values per rows:

df2 = (df.stack()
        .reset_index()
        .drop_duplicates(['level_0',0])[0]
        .value_counts()
        .rename_axis('Item')
        .reset_index(name='UniqueItemColCount'))
print (df2)
      Item  UniqueItemColCount
0     Rock                   4
1    Paper                   4
2  Scissor                   3

df1 = (df.stack()
        .reset_index()
        .drop_duplicates(['level_1',0])[0]
        .value_counts()
        .rename_axis('Item')
        .reset_index(name='UniqueItemRowCount'))
print (df1)
      Item  UniqueItemRowCount
0     Rock                   4
1  Scissor                   3
2    Paper                   3

CodePudding user response:

You can remove the values with aggregation as set before value_counts:

(df.agg(set, axis=1)
   .explode()
   .value_counts()
   #.rename_axis('Item')
   #.reset_index(name='UniqueItemRowCount')
 )

Alternative with reshaping:

(df.stack().reset_index(name='Item')
   .drop_duplicates(subset=['level_0', 'Item'])
   ['Item'].value_counts()
  #.reset_index(name='UniqueItemRowCount')
 )

Output as Series:

Paper      4
Rock       4
Scissor    3
dtype: int64

Output as DataFrame:

      Item  UniqueItemRowCount
0    Paper                   4
1     Rock                   4
2  Scissor                   3
  • Related