Filtering Pandas DataFrame based on number of non-null values-CodePudding

I am analyzing a consumer survey and doing the data cleaning with Pandas. I have a question where participants could answer with what frequency they are confronted with ads ('Daily': 1, 'Multiple times a week': 2, 'Once a week':3, 'Once a year': 4, 'Never': 5).

If the participant would answer that they would see ads at least weekly (1,2,3) they would get a whole new line of questions based on the product categories whether they were confronted with them and how often. The survey system would not ask all of the categories but would randomize the survey so 4 categories were asked about. The DataFrame with the answers looks like this:

Respondent ID	Question Frequency Ads	Question Product 1	Question Product 2	Question Product 3	Question Product 19	Question Product 20
1	5
2	4
3	2	1	2	3	5
4	1	1		3	5	2
5	1		5	3	5	2
5	1	4	5		5	2

I therefore want to filter the respondents that filled in the questionnaire as it should. The respondents that answered that they are confronted at least weekly with ads, should have at least an answer for 4 product questions. I already tried the following code:

data = data[((data['Question Frequency Ads'].isin([1,2,3])) & (data['Question Product 1'].isnull()   data['Question Product 2'].isnull()   data['Question Product 3'].isnull()   ...   data['Question Product 20'].isnull())) == (20-4)]

I realize this code will not work because of the fact that when adding these boolean expressions, you just get a True/False value and no integer which shows how many True/False values there are. Can someone help me with getting the right expression for this problem?

CodePudding user response：

You could use:

m1 = df['Question Frequency Ads'].between(1,3)
m2 = df.filter(like='Question Product').notnull().sum(1).ge(4)

df.loc[m1&m2, 'valid'] = 'valid'

output:

   Respondent ID  Question Frequency Ads  Question Product 1  \
0              1                       5                 NaN   
1              2                       4                 NaN   
2              3                       2                 1.0   
3              4                       1                 1.0   
4              5                       1                 NaN   
5              5                       1                 4.0   

   Question Product 2  Question Product 3  ...  Question Product 19  \
0                 NaN                 NaN  NaN                  NaN   
1                 NaN                 NaN  NaN                  NaN   
2                 2.0                 3.0  NaN                  5.0   
3                 NaN                 3.0  NaN                  5.0   
4                 5.0                 3.0  NaN                  5.0   
5                 5.0                 NaN  NaN                  5.0   

   Question Product 20  valid  
0                  NaN    NaN  
1                  NaN    NaN  
2                  NaN  valid  
3                  2.0  valid  
4                  2.0  valid  
5                  2.0  valid

CodePudding user response：

IIUC, try:

Keep all rows where Question Frequency Ads is in [1, 2, 3] and
Where the number of questions that have a response is exactly 4.

>>> >>> data[data["Question Frequency Ads"].isin([1,2,3]) & data.filter(like="Question Product").count(1).eq(4)]

Or:

>>> data[data["Question Frequency Ads"].isin([1,2,3]) & data.drop(["Respondent ID", "Question Frequency Ads"], axis=1).count(1).eq(4)]

CodePudding user response：

You can use:

out = df[df['Question Frequency Ads'].le(3) & df.iloc[:, 2:].notna().sum(axis=1).ge(4)]

Output:

Respondent ID	Question Frequency Ads	Question Product 1	Question Product 2	Question Product 3	Question Product 19	Question Product 20
3	2	1	2	3	5
4	1	1		3	5	2
5	1		5	3	5	2
5	1	4	5		5	2

CodePudding user response：

here is a different approach:

def is_int(x):
    try:
        int(x)
    except:
        return False
    return True

def is_valid(line):
    count = 0
    for el in line[2:]:
        if is_int(el):
            count =1
    return count == 4

df[df[df['Question Frequency Ads'].between(1,3)].apply(is_valid, axis=1)]