With the following data, I would like to remove the rows with low frequency. Suppose I pick 2
as a threshold, how can I only keep rows whose frequency is greater than 2, or threshold x
.
data = [[1, 0, 0, 0, 1 ], [0, 1, 1, 0, 2],
[3, 1, 0, 1, 0], [0, 1, 1, 0, 0],
[2, 0,0 , 0, 0], [1, 0, 2, 2, 0],
[0, 3, 0, 1, 3], [0, 0, 1, 0, 0],
[0, 0, 3, 0, 1], [0, 0, 1, 0, 0]]
df = pd.DataFrame(data, columns =[1608, 1772, 1790, 1791, 1799],
index = [1000, 1001, 1002, 1003, 1004, 1005, 1006, 1007, 1008, 1009],
dtype = int)
df.index.name = 'user'
print(df)
user 1608 1772 1790 1791 1799
1000 1 0 0 0 1
1001 0 1 1 0 2
1002 3 1 0 1 0
1003 0 1 1 0 0
1004 2 0 0 0 0
1005 1 0 2 2 0
1006 0 3 0 1 3
1007 0 0 1 0 0
1008 0 0 3 0 1
1009 0 0 1 0 0
Desired output:
I will only keep rows where the user (row id) has bought from 3 or more retailers (columns). Note that this is not the total of each row. For example, user 1000
will be dropped since he only bought from two retailers, but user 1001
will be kept, he bought from 3
retailers.
Actual data has 2 Million rows and 30K columns
user 1608 1772 1790 1791 1799
1001 0 1 1 0 2
1002 3 1 0 1 0
1005 1 0 2 2 0
1006 0 3 0 1 3
I have done some reading on the site and don't seem to come across an example much similar to my problem. Much appreciate your help.
CodePudding user response:
We can count the non zero values along the columns axis, then filter the rows which have count >= 3
df[df.ne(0).sum(1).ge(3)]
1608 1772 1790 1791 1799
user
1001 0 1 1 0 2
1002 3 1 0 1 0
1005 1 0 2 2 0
1006 0 3 0 1 3
CodePudding user response:
You can count not 0
values by compare for not equal with sum
in boolean indexing
:
df = df[df.ne(0).sum(axis=1) >= 3]
print (df)
1608 1772 1790 1791 1799
user
1001 0 1 1 0 2
1002 3 1 0 1 0
1005 1 0 2 2 0
1006 0 3 0 1 3
Details:
print(df.ne(0))
1608 1772 1790 1791 1799
user
1000 True False False False True
1001 False True True False True
1002 True True False True False
1003 False True True False False
1004 True False False False False
1005 True False True True False
1006 False True False True True
1007 False False True False False
1008 False False True False True
1009 False False True False False
print (df.ne(0).sum(axis=1))
user
1000 2
1001 3
1002 3
1003 2
1004 1
1005 3
1006 3
1007 1
1008 2
1009 1
dtype: int64
CodePudding user response:
You can achieve this by a temporary column like max of all sales.
First create a max sales column:
df['maxsales'] = df.max(axis=1)
Then filter through max sales columns where x is your threshold:
df2 = df.loc[df.maxsales>x,:]
Finally remove/drop maxsales column:
df2 = df2.drop('maxsales', axis=1)