I am analyzing a consumer survey and there are both Dutch (NL) and French (FR) respondents. Depending on the answer they gave when we asked about their mother language they got the same questionnaire but translated in Dutch or French. The problem is that the output of Qualtrics (the survey software) gave us the following output:
User_Language | Q1_NL | Q2_NL | ... | Q1_FR | Q_FR | ... |
---|---|---|---|---|---|---|
NL | 1 | 3 | ... | ... | ||
NL | 4 | 4 | .. | ... | ||
NL | 1 | 3 | ... | ... | ||
NL | 2 | 5 | ... | ... | ||
... | ... | ... | ... | ... | ... | ... |
FR | ... | 3 | 2 | ... | ||
FR | .. | 4 | 3 | ... | ||
FR | ... | 2 | 5 | ... | ||
FR | ... | 1 | 2 | ... |
As you can see the answers for the Dutch-speaking participants were recorded in the first n columns, while the French answers were recorded in the following n columns.
How can I cut the French answers from the last n columns and append them on the bottom of the DataFrame as those are answers to the exact same questions?
Thanks!
EDIT: Solutions that make use of grouping by the strings "Q1" or "Q2" are unfortunately not viable as the column names are actually the questions, this was just an example value. I do know the exact range of the French answers and the Dutch answers.
CodePudding user response:
If there aren't any overlapping values you could simply split the data frame into two data frames based on where the dutch values "stop" and the french "start" rename the value columns of those two data frames Q2_NL and so on to simply Q2 ... and then concatenate those frames again into one.
CodePudding user response:
IIUC, you can add a helper column, rename the columns to something common, and stack
/unstack
(df
.assign(index=df.groupby('User_Language').cumcount())
.set_index(['User_Language', 'index'])
.rename(columns=lambda c: c.split('_')[0])
.stack().unstack()
.droplevel('index').reset_index()
)
output:
User_Language Q1 Q2
0 FR 3 2
1 FR 4 3
2 FR 2 5
3 FR 1 2
4 NL 1 3
5 NL 4 4
6 NL 1 3
7 NL 2 5
CodePudding user response:
One approach would be to group
the dataframe along column axis and select the first non null value per row
s = df.set_index('User_Language')
s.groupby(s.columns.str.split('_').str[0], axis=1).first()
Q1 Q2
User_Language
NL 1 3
NL 4 4
NL 1 3
NL 2 5
FR 3 2
FR 4 3
FR 2 5
FR 1 2