How to split a column of items - which are not in order, into their own columns in pandas?-CodePudding

So basically, I am looking at survey data, and one of the questions has an answer where people can select multiple options. All these options will now be under one column, but based on the order in which the respondent has selected the options. So basically, when I perform value_counts() on the column, it looks like this:

A       10
B       15
C       6
D       19
E       23
A,B     2
A,C     5
A,B,E   7
E,A,C   4
B,C     6
..

So now I want to select a combination where respondents have selected at least one of A,B,C or D to this question, but not just A, just B, just C or just D. So in essence, I want the combinations where the selected option is more than one, and the options have at least A/B/C/D. Ex: (A,D,E), (A,B,F), (B,F) and so on.

I have tried splitting this with a simple delimiter and making a column for each option, but the problem is: not all the rows are of same length and also, the order is not always the same and all the first elements go under the first column, which again makes it useless. I have tried manually selecting the options from value counts, like:

df = df[(df['variable'] == 'A,D,E') | (df['variable'] == 'B,F')]

But I want to be able to automate this, but just don't know how.

I have also tried looking for a sub-string, creating a column, and adding all the numbers - but the issue with this is, it also takes respondents who answered just A, just B, etc. into consideration, which is not something I want.

x ='A'
df["A_column"]= df["variable"].str.find(x)

Can someone please help me with the logic of this?

CodePudding user response：

(first I'm creating a dataframe I think is similar to yours)

df = pd.DataFrame([{'answer': 'A,B,C'}, {'answer': 'A,D'}, {'answer': 'B,C,D'}, {'answer': 'D,E'}, {'answer': 'E'}, {'answer': 'B,C'}])
df
  answer
0  A,B,C
1    A,D
2  B,C,D
3    D,E
4      E
5    B,C

first step is to turn your csv strings into an explodable column type

df2 = df.reset_index().rename(
    columns={'index': 'original_index'}
).pipe(lambda x: x.assign(
    split_answer=x.answer.apply(lambda a: a.split(',')),
    selected=True
))
df2
   original_index answer split_answer  selected
0               0  A,B,C    [A, B, C]      True
1               1    A,D       [A, D]      True
2               2  B,C,D    [B, C, D]      True
3               3    D,E       [D, E]      True
4               4      E          [E]      True
5               5    B,C       [B, C]      True

second you can explode the new column into multiple rows per answer and pivot the options out into columns

df2.explode('split_answer').pivot(columns='split_answer', index='original_index', values='selected').fillna(False)
split_answer        A      B      C      D      E
original_index                                   
0                True   True   True  False  False
1                True  False  False   True  False
2               False   True   True   True  False
3               False  False  False   True   True
4               False  False  False  False   True
5               False   True   True  False  False

the selected=True and .fillna(False) is just to get the final result to look nice with booleans

CodePudding user response：

One solution would be to check for the presence of any of the individual characters in the column name but also that the column name is more than one character long. This is brittle in the sense that it only works for your particular problem, and if your selection criteria changes at all, then this won't work anymore.

For example if you have the following df:

>>> df
  variable  value
0        A      0
1        B      1
2        C      2
3        D      3
4      A,B      4
5    A,B,C      5
6      A,D      6
7  A,B,C,D      7

Then running the following:

df[(df['variable'].str.contains("A|B|C|D")) & (df['variable'].str.len() > 1)]

... will return this:

  variable  value
4      A,B      4
5    A,B,C      5
6      A,D      6
7  A,B,C,D      7