given df a dataframe containing the answers to a questionnaire
import pandas as pd
import numpy as np
df = pd.DataFrame(data=[[10,np.nan,'Answer 1','Answer 2'],
[40,"Answer 4","Answer 3",'Answer 2'],
[20,"Answer 2", np.nan,'Answer 1']],
columns = ['Age','Question 1','Question 2', 'Question 3'])
df
Age Question 1 Question 2 Question 3
0 10 NaN Answer 1 Answer 2
1 40 Answer 4 Answer 3 Answer 2
2 20 Answer 2 NaN Answer 1
I would like to create a second dataframe containing for each question which was the minimum age of the respondent
Disclaimer: very ugly code ahead!
df2 = pd.DataFrame(data=df.columns.tolist(),columns=["Questions"])
for col in df2["Questions"]:
if col != "Age":
df2.loc[df2["Questions"]==col,"min_age"] = df.loc[:,["Age",col]].dropna()["Age"].min()
df2
Question min_age
0 Age NaN
1 Question 1 20.0
2 Question 2 10.0
3 Question 3 10.0
CodePudding user response:
You can find the index of the first non-NAN value with Age as ordered index:
df2 = df.set_index('Age').sort_index().notna()
out = df2.idxmax().where(df2.any())
output:
Question 1 20
Question 2 10
Question 3 10
dtype: int64
CodePudding user response:
Another possible solution, based on pandas.melt
and pandas.pivot_table
:
(df.melt(id_vars='Age').dropna()
.pivot_table(index='variable', values='Age', aggfunc='min')
.reset_index(names='Question'))
Output:
Question Age
0 Question 1 20
1 Question 2 10
2 Question 3 10
CodePudding user response:
This could be another solution:
(pd.melt(df, id_vars='Age', value_vars=df.columns[1:])
.dropna()
.groupby('variable')
.agg({'Age' : np.min}))
Age
variable
Question 1 20
Question 2 10
Question 3 10