Home > Software design >  Merge multiple time columns into a single column using wide_to_long
Merge multiple time columns into a single column using wide_to_long

Time:02-23

I have a dataframe that has multiple time columns and an assigned value.

import pandas as pd
df = pd.DataFrame({'time': {0: 0.2, 1: 0.3, 2: 0.4, 3: nan}, 'val1': {0: 1.0, 1: 3.0, 2: 1.0, 3: nan}, 'time2': {0: 0.1, 1: 0.4, 2: 0.8, 3: 1.0}, 'val2': {0: 2, 1: 2, 2: 9, 3: 2}})

Which looks like this:

   time  val1  time2  val2
0   0.2   1.0    0.1     2
1   0.3   3.0    0.4     2
2   0.4   1.0    0.8     9
3   NaN   NaN    1.0     2

There can be many more time and value columns (but they always come in pairs). I would like to merge all the times columns into ONE column, while keeping and filling in the val's to their corressponding times.

Example output:

   time  val1   val2
0   0.1   1.0    2.0     
1   0.2   1.0    2.0     
2   0.3   3.0    2.0     
3   0.4   1.0    2.0   
4   0.8   1.0    9.0   
5   1.0   1.0    2.0     

I have asked this question previously, and one answer got very close! Answer and output below:

df1 = (pd.wide_to_long(df.rename(columns={'time':'time1'}).reset_index(),
                      'time', i='index', j='t')
        .sort_values(['time','val2'])
        .drop_duplicates('time')
        .dropna(subset=['time'])
        .reset_index(drop=True))

output:

   val1  val2  time
0   1.0     2   0.1
1   1.0     2   0.2
2   3.0     2   0.3
3   3.0     2   0.4 <- val1 incorrect
4   1.0     9   0.8
5   NaN     2   1.0

CodePudding user response:

IIUC, you can't achieve this with wide_to_long.

You don't have a canonical reshaping. There are duplicate values (e.g. time 0.4) and you need to make a choice here.

So, I guess you need to perform two merges and combine in the desired order:

m1 = (
 df[['time', 'val1']]
 .merge(df[['time2', 'val2']]
        .rename(columns={'time2': 'time'}),
        on='time', how='outer')
 .sort_values(by='time')
)

m2 = (
 df[['time', 'val2']]
 .merge(df[['time2', 'val1']]
        .rename(columns={'time2': 'time'}),
        on='time', how='outer')
 .sort_values(by='time')
)

out = m1.combine_first(m2).dropna(subset='time')

output:

   time  val1  val2
4   0.1   1.0   2.0
0   0.2   1.0   2.0
1   0.3   3.0   2.0
2   0.4   1.0   2.0
5   0.8   1.0   9.0
6   1.0   NaN   2.0

CodePudding user response:

Here is another simple approach. melt the data, sort to have time1 before time2 and in case of duplicate times, get the first row for val1 and last for val2.

cols = ['val1', 'val2']
(df
 .rename(columns={'time': 'time1'})
 .melt(id_vars=cols, value_name='time')
 .dropna(subset='time')
 .sort_values(by=['time', 'variable'])
 .groupby('time').agg({'val1': 'first', 'val2': 'last'})
 .reset_index()
)

output:

   time  val1  val2
0   0.1   1.0     2
1   0.2   1.0     2
2   0.3   3.0     2
3   0.4   1.0     2
4   0.8   1.0     9
5   1.0   NaN     2
  • Related