I am analyzing a consumer survey and doing the data cleaning with Pandas. I have a question where participants could answer with what frequency they are confronted with ads ('Daily': 1, 'Multiple times a week': 2, 'Once a week':3, 'Once a year': 4, 'Never': 5).
If the participant would answer that they would see ads at least weekly (1,2,3) they would get a whole new line of questions based on the product categories whether they were confronted with them and how often. The survey system would not ask all of the categories but would randomize the survey so 4 categories were asked about. The DataFrame with the answers looks like this:
Respondent ID | Question Frequency Ads | Question Product 1 | Question Product 2 | Question Product 3 | ... | Question Product 19 | Question Product 20 |
---|---|---|---|---|---|---|---|
1 | 5 | ||||||
2 | 4 | ||||||
3 | 2 | 1 | 2 | 3 | 5 | ||
4 | 1 | 1 | 3 | 5 | 2 | ||
5 | 1 | 5 | 3 | 5 | 2 | ||
5 | 1 | 4 | 5 | 5 | 2 |
I therefore want to filter the respondents that filled in the questionnaire as it should. The respondents that answered that they are confronted at least weekly with ads, should have at least an answer for 4 product questions. I already tried the following code:
data = data[((data['Question Frequency Ads'].isin([1,2,3])) & (data['Question Product 1'].isnull() data['Question Product 2'].isnull() data['Question Product 3'].isnull() ... data['Question Product 20'].isnull())) == (20-4)]
I realize this code will not work because of the fact that when adding these boolean expressions, you just get a True/False value and no integer which shows how many True/False values there are. Can someone help me with getting the right expression for this problem?
CodePudding user response:
You could use:
m1 = df['Question Frequency Ads'].between(1,3)
m2 = df.filter(like='Question Product').notnull().sum(1).ge(4)
df.loc[m1&m2, 'valid'] = 'valid'
output:
Respondent ID Question Frequency Ads Question Product 1 \
0 1 5 NaN
1 2 4 NaN
2 3 2 1.0
3 4 1 1.0
4 5 1 NaN
5 5 1 4.0
Question Product 2 Question Product 3 ... Question Product 19 \
0 NaN NaN NaN NaN
1 NaN NaN NaN NaN
2 2.0 3.0 NaN 5.0
3 NaN 3.0 NaN 5.0
4 5.0 3.0 NaN 5.0
5 5.0 NaN NaN 5.0
Question Product 20 valid
0 NaN NaN
1 NaN NaN
2 NaN valid
3 2.0 valid
4 2.0 valid
5 2.0 valid
CodePudding user response:
IIUC, try:
- Keep all rows where Question Frequency Ads is in
[1, 2, 3]
and - Where the number of questions that have a response is exactly 4.
>>> >>> data[data["Question Frequency Ads"].isin([1,2,3]) & data.filter(like="Question Product").count(1).eq(4)]
Or:
>>> data[data["Question Frequency Ads"].isin([1,2,3]) & data.drop(["Respondent ID", "Question Frequency Ads"], axis=1).count(1).eq(4)]
CodePudding user response:
You can use:
out = df[df['Question Frequency Ads'].le(3) & df.iloc[:, 2:].notna().sum(axis=1).ge(4)]
Output:
Respondent ID | Question Frequency Ads | Question Product 1 | Question Product 2 | Question Product 3 | Question Product 19 | Question Product 20 |
---|---|---|---|---|---|---|
3 | 2 | 1 | 2 | 3 | 5 | |
4 | 1 | 1 | 3 | 5 | 2 | |
5 | 1 | 5 | 3 | 5 | 2 | |
5 | 1 | 4 | 5 | 5 | 2 |
CodePudding user response:
here is a different approach:
def is_int(x):
try:
int(x)
except:
return False
return True
def is_valid(line):
count = 0
for el in line[2:]:
if is_int(el):
count =1
return count == 4
df[df[df['Question Frequency Ads'].between(1,3)].apply(is_valid, axis=1)]