Is there a way to drop rows with low frequency given a threshold value-CodePudding

With the following data, I would like to remove the rows with low frequency. Suppose I pick 2 as a threshold, how can I only keep rows whose frequency is greater than 2, or threshold x.

data = [[1, 0, 0, 0, 1 ], [0, 1, 1, 0, 2], 
       [3, 1, 0, 1, 0], [0, 1, 1, 0, 0],
      [2, 0,0 , 0, 0], [1, 0, 2, 2, 0], 
      [0, 3, 0, 1, 3], [0, 0, 1, 0, 0], 
      [0, 0, 3, 0, 1], [0, 0, 1, 0, 0]] 
      
df = pd.DataFrame(data, columns =[1608, 1772, 1790, 1791, 1799], 
                   index = [1000, 1001, 1002, 1003, 1004, 1005, 1006, 1007, 1008, 1009],
                                           dtype = int) 
df.index.name = 'user'

print(df)

user    1608  1772  1790  1791  1799
1000     1     0     0     0     1
1001     0     1     1     0     2
1002     3     1     0     1     0
1003     0     1     1     0     0
1004     2     0     0     0     0
1005     1     0     2     2     0
1006     0     3     0     1     3
1007     0     0     1     0     0
1008     0     0     3     0     1
1009     0     0     1     0     0

Desired output: I will only keep rows where the user (row id) has bought from 3 or more retailers (columns). Note that this is not the total of each row. For example, user 1000 will be dropped since he only bought from two retailers, but user 1001 will be kept, he bought from 3 retailers.

Actual data has 2 Million rows and 30K columns

user   1608  1772  1790  1791  1799
1001     0     1     1     0     2
1002     3     1     0     1     0
1005     1     0     2     2     0
1006     0     3     0     1     3

I have done some reading on the site and don't seem to come across an example much similar to my problem. Much appreciate your help.

CodePudding user response：

We can count the non zero values along the columns axis, then filter the rows which have count >= 3

df[df.ne(0).sum(1).ge(3)]

      1608  1772  1790  1791  1799
user                              
1001     0     1     1     0     2
1002     3     1     0     1     0
1005     1     0     2     2     0
1006     0     3     0     1     3

CodePudding user response：

You can count not 0 values by compare for not equal with sum in boolean indexing:

df = df[df.ne(0).sum(axis=1) >= 3]
print (df)
      1608  1772  1790  1791  1799
user                              
1001     0     1     1     0     2
1002     3     1     0     1     0
1005     1     0     2     2     0
1006     0     3     0     1     3

Details:

print(df.ne(0))
       1608   1772   1790   1791   1799
user                                   
1000   True  False  False  False   True
1001  False   True   True  False   True
1002   True   True  False   True  False
1003  False   True   True  False  False
1004   True  False  False  False  False
1005   True  False   True   True  False
1006  False   True  False   True   True
1007  False  False   True  False  False
1008  False  False   True  False   True
1009  False  False   True  False  False

print (df.ne(0).sum(axis=1))
user
1000    2
1001    3
1002    3
1003    2
1004    1
1005    3
1006    3
1007    1
1008    2
1009    1
dtype: int64

CodePudding user response：

You can achieve this by a temporary column like max of all sales.

First create a max sales column:

df['maxsales'] = df.max(axis=1)

Then filter through max sales columns where x is your threshold:

df2 = df.loc[df.maxsales>x,:]

Finally remove/drop maxsales column:

df2 = df2.drop('maxsales', axis=1)