I have a panda series/column where values are like -
Values |
---|
101;1001 |
130;125 |
113;99 |
1001;101 |
I need to sort the values within the cell with an expected outcome like below using python as the dataframe is large with more than 5 million values so any faster way would be appreciated.
Values |
---|
101;1001 |
125;130 |
99;113 |
101;1001 |
CodePudding user response:
Here's one way using str.split
explode
sort_values
to sort the values; then groupby.agg
to join them back.
Note that I assumed no integer is more than 10 digits long, so I used str.zfill(10)
to pad '0's for sorting.
df['Values_new'] = df['Values'].str.split(';').explode().sort_values(key=lambda x: x.str.zfill(10)).groupby(level=0).agg(';'.join)
Output:
Values Values_new
0 101;1001 101;1001
1 130;125 125;130
2 113;99 99;113
3 1001;101 101;1001
CodePudding user response:
Convert splitted values to integers, sorting, convert back to strings and join:
df['Values'] = df['Values'].apply(lambda x: ';'.join(map(str, sorted(map(int, x.split(';'))))))
Performance:
#10k rows
df = pd.concat([df] * 10000, ignore_index=True)
#enke solution
In [52]: %timeit df['Values'].str.split(';').explode().sort_values(key=lambda x: x.str.zfill(10)).groupby(level=0).agg(';'.join)
616 ms ± 6.03 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [53]: %timeit df['Values'].apply(lambda x: ';'.join(map(str, sorted(map(int, x.split(';'))))))
70.7 ms ± 420 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Solution for 2 columns:
df1 = df['Values'].str.split(';', expand=True).astype(int)
df1 = pd.DataFrame(np.sort(df1, axis=1), index=df1.index, columns=df1.columns)
print (df1)
0 1
0 101 1001
1 125 130
2 99 113
3 101 1001
CodePudding user response:
You can use a list comprehension, this will be faster on small datasets:
df['Values'] = [';'.join(map(str, sorted(map(int, x.split(';'))))) for x in df['Values']]
output:
Values
0 101;1001
1 125;130
2 99;113
3 101;1001
timings:
For two columns:
df2 = pd.DataFrame([sorted(map(int, x.split(';'))) for x in df['Values']])
output:
0 1
0 101 1001
1 125 130
2 99 113
3 101 1001