I am trying to join multiple columns using pandas and the separator is defined into another column. The problem I am facing is to avoid the separator for cells which are blank.
The sample code for my attempt and the output to explain the problem is given below:
import pandas as pd
df = pd.DataFrame({'col_1': ['', '1', '1', '2', '2', '3', '3', '4', '', '4', '5', '5', '5', '5', '5', '5'],
'col_2': ['A', '', 'C', 'A', '', 'C', 'D', 'D', 'A', 'A', 'B', 'E', 'F', 'G', 'H', 'I'],
'col_3': ['256', '546', '985', '573', '265', '731', '968', '592', '364', '', '953', '476', '835',
'', '572', '903'],
'col_4': ['.', '.', '.', '-', '_', '_', '-', '.', '.', '/', '/', '.', '_', '_', '-', '.']})
df['concatenated'] = df['col_1'] df['col_4'] df['col_2'] df['col_4'] df['col_3']
print(df)
The output I am getting is:
col_1 col_2 col_3 col_4 concatenated
0 A 256 . .A.256
1 1 546 . 1..546
2 1 C 985 . 1.C.985
3 2 A 573 - 2-A-573
4 2 265 _ 2__265
5 3 C 731 _ 3_C_731
6 3 D 968 - 3-D-968
7 4 D 592 . 4.D.592
8 A 364 . .A.364
9 4 A / 4/A/
10 5 B 953 / 5/B/953
11 5 E 476 . 5.E.476
12 5 F 835 _ 5_F_835
13 5 G _ 5_G_
14 5 H 572 - 5-H-572
15 5 I 903 . 5.I.903
But the expected output is:
col_1 col_2 col_3 col_4 concatenated
0 A 256 . A.256
1 1 546 . 1.546
2 1 C 985 . 1.C.985
3 2 A 573 - 2-A-573
4 2 265 _ 2_265
5 3 C 731 _ 3_C_731
6 3 D 968 - 3-D-968
7 4 D 592 . 4.D.592
8 A 364 . A.364
9 4 A / 4/A
10 5 B 953 / 5/B/953
11 5 E 476 . 5.E.476
12 5 F 835 _ 5_F_835
13 5 G _ 5_G
14 5 H 572 - 5-H-572
15 5 I 903 . 5.I.903
The actual data contains many more columns but I need to join only selective columns.
Can anyone help me to find out the solution or guide me in the right direction?
CodePudding user response:
You can use str.strip()
to remove the extra separators at both ends and also str.replace()
to remove repeated consecutive separators, as follows:
import re
sep = list(map(re.escape, df['col_4'].unique()))
sep_regex = '|'.join(sep)
df['concatenated'] = (df['concatenated'].str.strip(sep_regex)
.str.replace(fr'({sep_regex})\1', r'\1', regex=True)
)
Result:
print(df)
col_1 col_2 col_3 col_4 concatenated
0 A 256 . A.256
1 1 546 . 1.546
2 1 C 985 . 1.C.985
3 2 A 573 - 2-A-573
4 2 265 _ 2_265
5 3 C 731 _ 3_C_731
6 3 D 968 - 3-D-968
7 4 D 592 . 4.D.592
8 A 364 . A.364
9 4 A / 4/A
10 5 B 953 / 5/B/953
11 5 E 476 . 5.E.476
12 5 F 835 _ 5_F_835
13 5 G _ 5_G
14 5 H 572 - 5-H-572
15 5 I 903 . 5.I.903
Explanation:
Here, we created a list of unique sysmbols in col_4
and escaped these characters if they are regex meta-characters by using re.escape
.
print(sep)
['\\.', '\\-', '_', '/']
Also, in order to match with these characters in str.strip()
and str.replace()
, we further make a regex expression listing these possible alternatives:
These are the |
(that is "or") of those escaped separators above:
print(sep_regex)
'\\.|\\-|_|/'
We used regex back-referencing \1
to detect repeated consecutive characters and replace them with single occurrence of these characters.
CodePudding user response:
solution (assuming df
does not have concatenated
column)
df.apply(lambda row: row[-1].join([x for x in row[:-1] if x != '']), axis=1)
This works by applying a function to each row, which gets the last element, and uses it as the separator in a call to string join, joining all but the last element, which are not equal to "".
CodePudding user response:
This solution might work for you:
df['concat'] = ''
for row in df.iterrows():
index = row[0]
values = row[1]
vals = [x for x in values if x != '']
df.loc[index, 'concat'] = (vals[-1].join(vals[:-1]))
df
Result:
col_1 col_2 col_3 col_4 concat
0 A 256 . A.256
1 1 546 . 1.546
2 1 C 985 . 1.C.985
3 2 A 573 - 2-A-573
4 2 265 _ 2_265
5 3 C 731 _ 3_C_731
6 3 D 968 - 3-D-968
7 4 D 592 . 4.D.592
8 A 364 . A.364
9 4 A / 4/A
10 5 B 953 / 5/B/953
11 5 E 476 . 5.E.476
12 5 F 835 _ 5_F_835
13 5 G _ 5_G
14 5 H 572 - 5-H-572
15 5 I 903 . 5.I.903