unpacking a column of a list of pairs into two columns

I have dataframe df:

ID   li
ld1  [1205369-1205491, 1206384-1206570]
ld2  [111-112,113-114,117-119]
ld3  [444-445,765-785,777-779]
ld5  [1203843-1203967, 1204033-1204235, 1204398-1204485]
ld28 [666-777,756-788,896-909]
ld37 [999-1000,1001-1111,1112-1119,1234-1278,1999-2007]

I want to split this column of lists (li) into two columns start and end positions of pairs where the start column will have all start positions and the end column will have all end positions separated by','

Desired result:

ID     start                    end
ld1   1205369,1206384           1205491,1206570
ld2   111,113,117               112,114,119
ld3   444,765,777               445,785,779
ld5   1203843,1204033,1204398   1203967,1204235,1204485
ld28  666,756,896               777,788,909
ld37  999,1001,1112,1234,1999   1000,1111,1119,1278,2007

CodePudding user response：

Another way using .explode and .str.split followed by join

assuming = was a typo? but we can handle it with a str.replace

df1 = df.join(df['li'].str.replace('=','-').str.split(',')\
             .explode().str.split('-',expand=True)\
             .groupby(level=0).agg(','.join)\
             .rename(columns={0 : 'start', 1 : 'end'})).drop('li',axis=1)


print(df1)

     ID                      start                       end
0   ld1           1205369, 1206384           1205491,1206570
1   ld2                111,113,117               112,114,119
2   ld3                444,765,777               445,785,779
3   ld5  1203843, 1204033, 1204398   1203967,1204235,1204485
4  ld28                666,756,896               777,788,909
5  ld37    999,1001,1112,1234,1999  1000,1111,1119,1278,2007

CodePudding user response：

Use DataFrame.explode with split and then join back is first idea:

df[['start', 'end']] = (df.pop('li')
                          .explode()
                          .str.split('-', expand=True)
                          .groupby(level=0)
                          .agg(','.join))
print (df)
     ID                    start                       end
0   ld1          1205369,1206384           1205491,1206570
1   ld2              111,113,117               112,114,119
2   ld3              444,765,777               445,785,779
3   ld5  1203843,1204033,1204398   1203967,1204235,1204485
4  ld28              666,756,896               777,788,909
5  ld37  999,1001,1112,1234,1999  1000,1111,1119,1278,2007

Or use list comprehension with split and join with zip:

L = [[','.join(z) for z in zip(*[y.split('-') for y in x])] for x in df.pop('li')]
df = df.join(pd.DataFrame(L, columns=['start', 'end'], index=df.index))
print (df)
     ID                    start                       end
0   ld1          1205369,1206384           1205491,1206570
1   ld2              111,113,117               112,114,119
2   ld3              444,765,777               445,785,779
3   ld5  1203843,1204033,1204398   1203967,1204235,1204485
4  ld28              666,756,896               777,788,909
5  ld37  999,1001,1112,1234,1999  1000,1111,1119,1278,2007

CodePudding user response：

You can use a list comprehension:

import re

df2 = pd.DataFrame([list(map(list, zip(*(re.split(r'[-=]', x) for x in l))))
                    for l in df['li']],
                   columns=['start', 'end'], index=df.index)

df.drop(columns='li').join(df2)

Output:

     ID                          start                             end
0   ld1            [1205369,  1206384]              [1205491, 1206570]
1   ld2                [111, 113, 117]                 [112, 114, 119]
2   ld3                [444, 765, 777]                 [445, 785, 779]
3   ld5  [1203843,  1204033,  1204398]     [1203967, 1204235, 1204485]
4  ld28                [666, 756, 896]                 [777, 788, 909]
5  ld37  [999, 1001, 1112, 1234, 1999]  [1000, 1111, 1119, 1278, 2007]

For in place modification:

import re

df[['start', 'end']] = [list(map(list, zip(*(re.split(r'[-=]', x) for x in l))))
                        for l in df.pop('li')]