Given this DataFrame:
data = [['Tom', 'run', '2022-01-26', 'run', '2027-01-26'], ['Max', 'stop', '2020-11-16', 'run', '2022-04-26'], ['Bob', 'run', '2021-10-03', 'stop', '2022-01-26'], ['Ben', 'run', '2020-03-11', 'stop', '2013-01-26'], ['Eva', 'stop', '2017-11-16', 'run', '2015-01-26']]
df = pd.DataFrame(data, columns=['Patient', '1_Action', '1_Time', '2_Action', '2_Time'])
Patient 1_Action 1_Time 2_Action 2_Time
0 Tom run 2022-01-26 run 2027-01-26
1 Max stop 2020-11-16 run 2022-04-26
2 Bob run 2021-10-03 stop 2022-01-26
3 Ben run 2020-03-11 stop 2013-01-26
4 Eva stop 2017-11-16 run 2015-01-26
I want it to look like:
Patient action time
0 Tom run 2022-01-26
1 Max stop 2020-11-16
2 Bob run 2021-10-03
3 Ben run 2020-03-11
4 Eva stop 2017-11-16
5 Tom run 2027-01-26
6 Max run 2022-04-26
7 Bob stop 2022-01-26
8 Ben stop 2013-01-26
9 Eva run 2015-01-26
There is a really good answer from @BeRT2me, including the wide_to_long format:
df = pd.wide_to_long(df,
stubnames=['action', 'time'],
i='patient',
j='num',
sep='_').reset_index()
But my column names don't match the solution, they should be: action_1, but they are 1_Action.
The Error message I get is: Value error: the id variables need to be uniquely identify each row.
Is there a way to rename my column names, that they are matching the solution, (Dataframe has about 45 columns so the manual renaming is not an option), or can I change the solution code to work with my given column names?
CodePudding user response:
We need to edit the column name in order to pass them as stubnames
in wide_to_long
df.columns = df.columns.str.split('_').map(lambda x : '_'.join(x[::-1]))
#Index(['Patient', 'Action_1', 'Time_1', 'Action_2', 'Time_2'], dtype='object')
df = pd.wide_to_long(df,
stubnames=['Action', 'Time'],
i='Patient',
j='num',
sep='_',suffix = '\w ').reset_index()
CodePudding user response:
one option to transform the second dataframe is with pivot_longer from pyjanitor:
# pip install pyjanitor
import pandas as pd
import janitor
df.pivot_longer(index = 'Patient',
names_to = '.value',
names_pattern = r"\d_(. )")
Patient Action Time
0 Tom run 2022-01-26
1 Max stop 2020-11-16
2 Bob run 2021-10-03
3 Ben run 2020-03-11
4 Eva stop 2017-11-16
5 Tom run 2027-01-26
6 Max run 2022-04-26
7 Bob stop 2022-01-26
8 Ben stop 2013-01-26
9 Eva run 2015-01-26
The .value
determines which parts of the columns remain as headers. .value
is aligned with the parenthesised group in the regex
You could use names_sep
instead of names_pattern
:
(df
.pivot_longer(
index = 'Patient',
names_to = ('num', '.value'),
names_sep='_')
.drop(columns='num')
)
Patient Action Time
0 Tom run 2022-01-26
1 Max stop 2020-11-16
2 Bob run 2021-10-03
3 Ben run 2020-03-11
4 Eva stop 2017-11-16
5 Tom run 2027-01-26
6 Max run 2022-04-26
7 Bob stop 2022-01-26
8 Ben stop 2013-01-26
9 Eva run 2015-01-26
Another option is via pd.stack
, which is equally efficient, and usually more efficient than pd.wide_to_long
:
temp = df.set_index('Patient')
temp.columns = temp.columns.str.split('_', expand = True)
temp.stack(0).droplevel(-1).reset_index()
Patient Action Time
0 Tom run 2022-01-26
1 Tom run 2027-01-26
2 Max stop 2020-11-16
3 Max run 2022-04-26
4 Bob run 2021-10-03
5 Bob stop 2022-01-26
6 Ben run 2020-03-11
7 Ben stop 2013-01-26
8 Eva stop 2017-11-16
9 Eva run 2015-01-26