I have a unique situation where my dataset contains multiple survey responses that were asked on two different scales primarily - a 5 point scale and then a 10 point scale and I have consolidated all of these responses in one dataframe. Now I would like to split and create a new column in my dataframe that can tell by looking into the responses and correspondinlgy identify it it's a 5 point scale or a 10 point scale question. For a response where there are no numbers mentioned such as 1-5 scale or 1-10 scale, the output should be blank. My dataframe looks like:
Question_Text
on a scale of 1 – 10 how well would you rate the following statements.
on a scale of 1 to 10 how well would you rate the following statements.
on a scale of 1-10 how well would you rate the following statements.
on a scale of 1 10 how well would you rate the following statements.
on a scale of 1 – 5 how well would you rate the following statements.
on a scale of 1 to 5 how well would you rate the following statements.
on a scale of 1-5 how well would you rate the following statements.
on a scale of 1 5 how well would you rate the following statements.
please tell us how ready you feel for this (0 - 6 not ready, 6-8 somewhat ready, and 9-10 ready)
how useful did you find the today’s webinar?
and what I would like to achieve looks like:
Question_Text Type_of_Question
on a scale of 1 – 10 how well would you rate the following 10 point scale
on a scale of 1 to 10 how well would you rate the following 10 point scale
on a scale of 1 to 5 how well would you rate the following 5 point scale
please tell us how ready you feel for this (0 - 6 not ready)... 10 point scale
how useful did you find the today’s webinar?
...
Is there any possible way to achieve this? Can a pattern be identified using regex that can take care of different sorts of inputs as I have shown above?
CodePudding user response:
You can check to see if 5
or 10
exists in your column and create the new column based on that:
df["Type_of_Question"] = df.Question_Text.apply(lambda x: "10 point scale" if "10" in x else "5 point scale")
print(df)
Question_Text Type_of_Question
0 on a scale of 1 – 10 how well would you rate... 10 point scale
1 on a scale of 1 to 10 how well would you rate ... 10 point scale
2 on a scale of 1-10 how well would you rate the... 10 point scale
3 on a scale of 1 10 how well would you rate the... 10 point scale
4 on a scale of 1 – 5 how well would you rate ... 5 point scale
5 on a scale of 1 to 5 how well would you rate t... 5 point scale
6 on a scale of 1-5 how well would you rate the ... 5 point scale
7 on a scale of 1 5 how well would you rate the ... 5 point scale
8 please tell us how ready you feel for this (0 ... 10 point scale
CodePudding user response:
Assuming "Question_Text" the column of interest and that you only have numbers that describe the scale, you could extractall
and get the max number to compare it to 5
. If greater, we assume it's a 10 points scale:
import numpy as np
df['Type_of_Question'] = np.where(df['Question_Text'].str.extractall('(\d )')[0]
.astype(int).groupby(level=0).max().gt(5),
'10 point scale', '5 point scale')
Output:
Question_Text Type_of_Question
0 on a scale of 1 – 10 how well would you rate the following statements. 10 point scale
1 on a scale of 1 to 10 how well would you rate the following statements. 10 point scale
2 on a scale of 1-10 how well would you rate the following statements. 10 point scale
3 on a scale of 1 10 how well would you rate the following statements. 10 point scale
4 on a scale of 1 – 5 how well would you rate the following statements. 5 point scale
5 on a scale of 1 to 5 how well would you rate the following statements. 5 point scale
6 on a scale of 1-5 how well would you rate the following statements. 5 point scale
7 on a scale of 1 5 how well would you rate the following statements. 5 point scale
8 please tell us how ready you feel for this (0 - 6 not ready, 6-8 somewhat ready, and 9-10 ready) 10 point scale