I am working with the Pandas Library and learning more about it with data manipulation and analysis. With this Dataframe ITEM
ItemNo1 ItemNo2 ItemNo3 ItemNo4
0 Paper Paper Rock Paper
1 Scissor Scissor Rock NaN
2 Rock Rock Rock Paper
3 Scissor Paper Scissor Paper
4 Rock Paper Scissor Rock
I'd like to create a Dataframe with two columns. Item and Unique Item Row Count. The unique item row count should only have the unique row values of Item. Meaning if there are three of the same item in the row, it will only take into account the first instance for count. This would mean the following output:
Item UniqueItemRowCount
0 Paper 4
1 Rock 4
2 Scissor 3
There will be 4 for rock because for index 0 and 1 there is one rock. For index 2 and 4 there is already one Rock found so the rest are ignored/not added for the sum.
My main issue is trying to create a lambda function for this as I cannot make it properly get the sum. I have also tried to make the values NaN (null) if there is a duplicate found in the row but this does not work either.
CodePudding user response:
Use DataFrame.melt
with DataFrame.drop_duplicates
for count duplicates per rows and then count values by Series.value_counts
:
Notice: Expected ouput is different, because removed duplicates per rows:
There will be 4 for rock because for index 0 and 1 there is one rock. For index 2 and 4 there is already one Rock found so the rest are ignored/not added for the sum.
df1 = (df.melt()
.drop_duplicates()['value']
.value_counts()
.rename_axis('Item')
.reset_index(name='UniqueItemRowCount'))
print (df1)
Item UniqueItemRowCount
0 Rock 4
1 Scissor 3
2 Paper 3
If need remove duplicates per columns with DataFrame.stack
and DataFrame.reset_index
, then use DataFrame.drop_duplicates
- with level_0
for unique values per columns or level_1
for unique values per rows:
df2 = (df.stack()
.reset_index()
.drop_duplicates(['level_0',0])[0]
.value_counts()
.rename_axis('Item')
.reset_index(name='UniqueItemColCount'))
print (df2)
Item UniqueItemColCount
0 Rock 4
1 Paper 4
2 Scissor 3
df1 = (df.stack()
.reset_index()
.drop_duplicates(['level_1',0])[0]
.value_counts()
.rename_axis('Item')
.reset_index(name='UniqueItemRowCount'))
print (df1)
Item UniqueItemRowCount
0 Rock 4
1 Scissor 3
2 Paper 3
CodePudding user response:
You can remove the values with aggregation as set
before value_counts
:
(df.agg(set, axis=1)
.explode()
.value_counts()
#.rename_axis('Item')
#.reset_index(name='UniqueItemRowCount')
)
Alternative with reshaping:
(df.stack().reset_index(name='Item')
.drop_duplicates(subset=['level_0', 'Item'])
['Item'].value_counts()
#.reset_index(name='UniqueItemRowCount')
)
Output as Series:
Paper 4
Rock 4
Scissor 3
dtype: int64
Output as DataFrame:
Item UniqueItemRowCount
0 Paper 4
1 Rock 4
2 Scissor 3