Pandas splitting values by a single character and concatenating part of first split with second wher-CodePudding

Been trying to manipulate some data with pandas, but not having any luck.

I have a column of data, which is quite messy. It contains codes. Each cell only has a maximum of two codes (I think).

Some of these codes are split with a / but the second part only contains that last part of the code after the last dash (-) (e.g. k500-899-200 / 300 represents two codes: k500-899-200 and k500-899-300)

Some only have one code. Some have codes separated with a &, others have codes separates with a #. Some cells are just blank.

I would likes a new column where codes are fully written out, separated with a ; if there are two codes instead of one.

What I have:

colA                                 

JA01-03-AP-00075 / 00014
JA01-03-AP-00065 / 00013
JA01-03-AP-00089
JA01-05-AP-00089 & JA01-03-AP-08894
JA09-08-BC-88873#JA09-08-BC-88845
JA09-08-BC-88123#
dummy
[blank]
unset

What I want

ColA                                        ColB

JA01-03-AP-00075 / 00014                   JA01-03-AP-00075 ; JA01-03-AP-00014
JA01-03-AP-00065 / 00013                   JA01-03-AP-00065 ; JA01-03-AP-00013
JA01-03-AP-00089                           JA01-03-AP-00089
JA01-05-AP-00089 & JA01-03-AP-08894        JA01-05-AP-00089 ; JA01-03-AP-08894 
JA09-08-BC-88873 # JA09-08-BC-88845        JA09-08-BC-88873 ; JA09-08-BC-88845  
JA09-08-BC-88123#                          JA09-08-BC-88123
dummy                                      dummy
[blank]                                    [blank]
unset                                      unset

So far, I've focused on trying to split by /. My codes is a bit like this, although it doesn't actually work, as I tend to get an "operands cannot be broadcast together" error. I haven't worked out why.

But I think I'm overcomplicating it a bit.

def split_code(code): 
    split_code = code.split('/') 
    return split_code if len(split_code) == 2 else ['', split_code[0]]

df[['tempCol1', 'tempCol2']] = pd.DataFrame(df['ColA'].apply(split_code).tolist())

df['ColB'] = np.where(df['ColA'].str.contains('/'),
                      df['tempCol1'].str.rsplit('-', 1)[0]   df['tempCol2'],
                      df['ColA'])

CodePudding user response：

This works, at least per your expected output:

regex = re.compile(r'\s*[/&#]\s*')

def p(col):
    parts = regex.split(col.strip('&#;/'))
    if len(parts) > 1:
        if parts[1].count('-') <= 1: # short code
            parts[1] = f'{"-".join(parts[0].split("-")[0:-1])}-{parts[1]}'
    return ' ; '.join(parts)

df['ColB'] = df['ColA'].transform(p)

CodePudding user response：

# match prefix-suffix1 / suffix2 
pat = r'(?P<prefix>. )-(?P<suffix1>\d )\s*\/\s*(?P<suffix2>\d )'

# m is a re.Match object
# prefix-suffix1 / suffix2  => prefix-suffix1 ; prefix-suffix2 
def repl_suffix(m):
    return f"{m['prefix']}-{m['suffix1']} ; {m['prefix']}-{m['suffix2']}"


df['colB'] = (
    df['colA'].str.replace(pat, repl_suffix)   
              .str.rstrip(r';#&')   # strip ';','#','&' from single codes 
              .str.replace(r'\s*(#|&)\s*',' ; ') # '#' and '&' => ';'
)

>>> df

                                  colA                                 colB
0             JA01-03-AP-00075 / 00014  JA01-03-AP-00075 ; JA01-03-AP-00014
1             JA01-03-AP-00065 / 00013  JA01-03-AP-00065 ; JA01-03-AP-00013
2                     JA01-03-AP-00089                     JA01-03-AP-00089
3  JA01-05-AP-00089 & JA01-03-AP-08894  JA01-05-AP-00089 ; JA01-03-AP-08894
4    JA09-08-BC-88873#JA09-08-BC-88845  JA09-08-BC-88873 ; JA09-08-BC-88845
5                    JA09-08-BC-88123#                     JA09-08-BC-88123

EDIT: I'm assuming that the suffixes are always numbers, but you can easily generalize for any suffix.