I have dataframe df:
ID li
ld1 [1205369-1205491, 1206384-1206570]
ld2 [111-112,113-114,117-119]
ld3 [444-445,765-785,777-779]
ld5 [1203843-1203967, 1204033-1204235, 1204398-1204485]
ld28 [666-777,756-788,896-909]
ld37 [999-1000,1001-1111,1112-1119,1234-1278,1999-2007]
I want to split this column of lists (li) into two columns start and end positions of pairs where the start column will have all start positions and the end column will have all end positions separated by','
Desired result:
ID start end
ld1 1205369,1206384 1205491,1206570
ld2 111,113,117 112,114,119
ld3 444,765,777 445,785,779
ld5 1203843,1204033,1204398 1203967,1204235,1204485
ld28 666,756,896 777,788,909
ld37 999,1001,1112,1234,1999 1000,1111,1119,1278,2007
CodePudding user response:
Another way using .explode
and .str.split
followed by join
assuming =
was a typo? but we can handle it with a str.replace
df1 = df.join(df['li'].str.replace('=','-').str.split(',')\
.explode().str.split('-',expand=True)\
.groupby(level=0).agg(','.join)\
.rename(columns={0 : 'start', 1 : 'end'})).drop('li',axis=1)
print(df1)
ID start end
0 ld1 1205369, 1206384 1205491,1206570
1 ld2 111,113,117 112,114,119
2 ld3 444,765,777 445,785,779
3 ld5 1203843, 1204033, 1204398 1203967,1204235,1204485
4 ld28 666,756,896 777,788,909
5 ld37 999,1001,1112,1234,1999 1000,1111,1119,1278,2007
CodePudding user response:
Use DataFrame.explode
with split
and then join back is first idea:
df[['start', 'end']] = (df.pop('li')
.explode()
.str.split('-', expand=True)
.groupby(level=0)
.agg(','.join))
print (df)
ID start end
0 ld1 1205369,1206384 1205491,1206570
1 ld2 111,113,117 112,114,119
2 ld3 444,765,777 445,785,779
3 ld5 1203843,1204033,1204398 1203967,1204235,1204485
4 ld28 666,756,896 777,788,909
5 ld37 999,1001,1112,1234,1999 1000,1111,1119,1278,2007
Or use list comprehension with split and join
with zip
:
L = [[','.join(z) for z in zip(*[y.split('-') for y in x])] for x in df.pop('li')]
df = df.join(pd.DataFrame(L, columns=['start', 'end'], index=df.index))
print (df)
ID start end
0 ld1 1205369,1206384 1205491,1206570
1 ld2 111,113,117 112,114,119
2 ld3 444,765,777 445,785,779
3 ld5 1203843,1204033,1204398 1203967,1204235,1204485
4 ld28 666,756,896 777,788,909
5 ld37 999,1001,1112,1234,1999 1000,1111,1119,1278,2007
CodePudding user response:
You can use a list comprehension:
import re
df2 = pd.DataFrame([list(map(list, zip(*(re.split(r'[-=]', x) for x in l))))
for l in df['li']],
columns=['start', 'end'], index=df.index)
df.drop(columns='li').join(df2)
Output:
ID start end
0 ld1 [1205369, 1206384] [1205491, 1206570]
1 ld2 [111, 113, 117] [112, 114, 119]
2 ld3 [444, 765, 777] [445, 785, 779]
3 ld5 [1203843, 1204033, 1204398] [1203967, 1204235, 1204485]
4 ld28 [666, 756, 896] [777, 788, 909]
5 ld37 [999, 1001, 1112, 1234, 1999] [1000, 1111, 1119, 1278, 2007]
For in place modification:
import re
df[['start', 'end']] = [list(map(list, zip(*(re.split(r'[-=]', x) for x in l))))
for l in df.pop('li')]