So basically, I am looking at survey data, and one of the questions has an answer where people can select multiple options. All these options will now be under one column, but based on the order in which the respondent has selected the options. So basically, when I perform value_counts() on the column, it looks like this:
A 10
B 15
C 6
D 19
E 23
A,B 2
A,C 5
A,B,E 7
E,A,C 4
B,C 6
..
So now I want to select a combination where respondents have selected at least one of A,B,C or D to this question, but not just A, just B, just C or just D. So in essence, I want the combinations where the selected option is more than one, and the options have at least A/B/C/D. Ex: (A,D,E), (A,B,F), (B,F) and so on.
I have tried splitting this with a simple delimiter and making a column for each option, but the problem is: not all the rows are of same length and also, the order is not always the same and all the first elements go under the first column, which again makes it useless. I have tried manually selecting the options from value counts, like:
df = df[(df['variable'] == 'A,D,E') | (df['variable'] == 'B,F')]
But I want to be able to automate this, but just don't know how.
I have also tried looking for a sub-string, creating a column, and adding all the numbers - but the issue with this is, it also takes respondents who answered just A, just B, etc. into consideration, which is not something I want.
x ='A'
df["A_column"]= df["variable"].str.find(x)
Can someone please help me with the logic of this?
CodePudding user response:
(first I'm creating a dataframe I think is similar to yours)
df = pd.DataFrame([{'answer': 'A,B,C'}, {'answer': 'A,D'}, {'answer': 'B,C,D'}, {'answer': 'D,E'}, {'answer': 'E'}, {'answer': 'B,C'}])
df
answer
0 A,B,C
1 A,D
2 B,C,D
3 D,E
4 E
5 B,C
first step is to turn your csv strings into an explodable column type
df2 = df.reset_index().rename(
columns={'index': 'original_index'}
).pipe(lambda x: x.assign(
split_answer=x.answer.apply(lambda a: a.split(',')),
selected=True
))
df2
original_index answer split_answer selected
0 0 A,B,C [A, B, C] True
1 1 A,D [A, D] True
2 2 B,C,D [B, C, D] True
3 3 D,E [D, E] True
4 4 E [E] True
5 5 B,C [B, C] True
second you can explode the new column into multiple rows per answer and pivot the options out into columns
df2.explode('split_answer').pivot(columns='split_answer', index='original_index', values='selected').fillna(False)
split_answer A B C D E
original_index
0 True True True False False
1 True False False True False
2 False True True True False
3 False False False True True
4 False False False False True
5 False True True False False
the selected=True
and .fillna(False)
is just to get the final result to look nice with booleans
CodePudding user response:
One solution would be to check for the presence of any of the individual characters in the column name but also that the column name is more than one character long. This is brittle in the sense that it only works for your particular problem, and if your selection criteria changes at all, then this won't work anymore.
For example if you have the following df:
>>> df
variable value
0 A 0
1 B 1
2 C 2
3 D 3
4 A,B 4
5 A,B,C 5
6 A,D 6
7 A,B,C,D 7
Then running the following:
df[(df['variable'].str.contains("A|B|C|D")) & (df['variable'].str.len() > 1)]
... will return this:
variable value
4 A,B 4
5 A,B,C 5
6 A,D 6
7 A,B,C,D 7