Pandas: Concatenate multiple columns using another separator column and avoid extra separators for b-CodePudding

I am trying to join multiple columns using pandas and the separator is defined into another column. The problem I am facing is to avoid the separator for cells which are blank.

The sample code for my attempt and the output to explain the problem is given below:

import pandas as pd
df = pd.DataFrame({'col_1': ['', '1', '1', '2', '2', '3', '3', '4', '', '4', '5', '5', '5', '5', '5', '5'],
                   'col_2': ['A', '', 'C', 'A', '', 'C', 'D', 'D', 'A', 'A', 'B', 'E', 'F', 'G', 'H', 'I'],
                   'col_3': ['256', '546', '985', '573', '265', '731', '968', '592', '364', '', '953', '476', '835',
                             '', '572', '903'],
                   'col_4': ['.', '.', '.', '-', '_', '_', '-', '.', '.', '/', '/', '.', '_', '_', '-', '.']})

df['concatenated'] = df['col_1']   df['col_4']   df['col_2']   df['col_4']   df['col_3']
print(df)

The output I am getting is:

     col_1 col_2 col_3 col_4    concatenated
0            A   256     .       .A.256
1      1         546     .       1..546
2      1     C   985     .      1.C.985
3      2     A   573     -      2-A-573
4      2         265     _       2__265
5      3     C   731     _      3_C_731
6      3     D   968     -      3-D-968
7      4     D   592     .      4.D.592
8            A   364     .       .A.364
9      4     A           /         4/A/
10     5     B   953     /      5/B/953
11     5     E   476     .      5.E.476
12     5     F   835     _      5_F_835
13     5     G           _         5_G_
14     5     H   572     -      5-H-572
15     5     I   903     .      5.I.903

But the expected output is:

     col_1 col_2 col_3 col_4   concatenated
0            A   256     .      A.256
1      1         546     .      1.546
2      1     C   985     .      1.C.985
3      2     A   573     -      2-A-573
4      2         265     _      2_265
5      3     C   731     _      3_C_731
6      3     D   968     -      3-D-968
7      4     D   592     .      4.D.592
8            A   364     .      A.364
9      4     A           /      4/A
10     5     B   953     /      5/B/953
11     5     E   476     .      5.E.476
12     5     F   835     _      5_F_835
13     5     G           _      5_G
14     5     H   572     -      5-H-572
15     5     I   903     .      5.I.903

The actual data contains many more columns but I need to join only selective columns.

Can anyone help me to find out the solution or guide me in the right direction?

CodePudding user response：

You can use str.strip() to remove the extra separators at both ends and also str.replace() to remove repeated consecutive separators, as follows:

import re
sep = list(map(re.escape, df['col_4'].unique()))
sep_regex = '|'.join(sep)

df['concatenated'] = (df['concatenated'].str.strip(sep_regex)
                                        .str.replace(fr'({sep_regex})\1', r'\1', regex=True)
                     )

Result:

print(df)

   col_1 col_2 col_3 col_4 concatenated
0            A   256     .        A.256
1      1         546     .        1.546
2      1     C   985     .      1.C.985
3      2     A   573     -      2-A-573
4      2         265     _        2_265
5      3     C   731     _      3_C_731
6      3     D   968     -      3-D-968
7      4     D   592     .      4.D.592
8            A   364     .        A.364
9      4     A           /          4/A
10     5     B   953     /      5/B/953
11     5     E   476     .      5.E.476
12     5     F   835     _      5_F_835
13     5     G           _          5_G
14     5     H   572     -      5-H-572
15     5     I   903     .      5.I.903

Explanation:

Here, we created a list of unique sysmbols in col_4 and escaped these characters if they are regex meta-characters by using re.escape.

print(sep)

['\\.', '\\-', '_', '/']

Also, in order to match with these characters in str.strip() and str.replace(), we further make a regex expression listing these possible alternatives:

These are the | (that is "or") of those escaped separators above:

print(sep_regex)

'\\.|\\-|_|/'

We used regex back-referencing \1 to detect repeated consecutive characters and replace them with single occurrence of these characters.

CodePudding user response：

solution (assuming df does not have concatenated column)

df.apply(lambda row: row[-1].join([x for x in row[:-1] if x != '']), axis=1)

This works by applying a function to each row, which gets the last element, and uses it as the separator in a call to string join, joining all but the last element, which are not equal to "".

CodePudding user response：

This solution might work for you:

df['concat'] = ''
for row in df.iterrows():
    index = row[0]
    values = row[1]
    vals = [x for x in values if x != '']
    df.loc[index, 'concat'] = (vals[-1].join(vals[:-1]))
df

Result:

   col_1 col_2 col_3 col_4   concat
0            A   256     .    A.256
1      1         546     .    1.546
2      1     C   985     .  1.C.985
3      2     A   573     -  2-A-573
4      2         265     _    2_265
5      3     C   731     _  3_C_731
6      3     D   968     -  3-D-968
7      4     D   592     .  4.D.592
8            A   364     .    A.364
9      4     A           /      4/A
10     5     B   953     /  5/B/953
11     5     E   476     .  5.E.476
12     5     F   835     _  5_F_835
13     5     G           _      5_G
14     5     H   572     -  5-H-572
15     5     I   903     .  5.I.903