Been trying to manipulate some data with pandas, but not having any luck.
I have a column of data, which is quite messy. It contains codes. Each cell only has a maximum of two codes (I think).
Some of these codes are split with a /
but the second part only contains that last part of the code after the last dash (-
) (e.g. k500-899-200 / 300 represents two codes: k500-899-200 and k500-899-300)
Some only have one code. Some have codes separated with a &
, others have codes separates with a #
. Some cells are just blank.
I would likes a new column where codes are fully written out, separated with a ;
if there are two codes instead of one.
What I have:
colA
JA01-03-AP-00075 / 00014
JA01-03-AP-00065 / 00013
JA01-03-AP-00089
JA01-05-AP-00089 & JA01-03-AP-08894
JA09-08-BC-88873#JA09-08-BC-88845
JA09-08-BC-88123#
dummy
[blank]
unset
What I want
ColA ColB
JA01-03-AP-00075 / 00014 JA01-03-AP-00075 ; JA01-03-AP-00014
JA01-03-AP-00065 / 00013 JA01-03-AP-00065 ; JA01-03-AP-00013
JA01-03-AP-00089 JA01-03-AP-00089
JA01-05-AP-00089 & JA01-03-AP-08894 JA01-05-AP-00089 ; JA01-03-AP-08894
JA09-08-BC-88873 # JA09-08-BC-88845 JA09-08-BC-88873 ; JA09-08-BC-88845
JA09-08-BC-88123# JA09-08-BC-88123
dummy dummy
[blank] [blank]
unset unset
So far, I've focused on trying to split by /
. My codes is a bit like this, although it doesn't actually work, as I tend to get an "operands cannot be broadcast together" error. I haven't worked out why.
But I think I'm overcomplicating it a bit.
def split_code(code):
split_code = code.split('/')
return split_code if len(split_code) == 2 else ['', split_code[0]]
df[['tempCol1', 'tempCol2']] = pd.DataFrame(df['ColA'].apply(split_code).tolist())
df['ColB'] = np.where(df['ColA'].str.contains('/'),
df['tempCol1'].str.rsplit('-', 1)[0] df['tempCol2'],
df['ColA'])
CodePudding user response:
This works, at least per your expected output:
regex = re.compile(r'\s*[/&#]\s*')
def p(col):
parts = regex.split(col.strip('&#;/'))
if len(parts) > 1:
if parts[1].count('-') <= 1: # short code
parts[1] = f'{"-".join(parts[0].split("-")[0:-1])}-{parts[1]}'
return ' ; '.join(parts)
df['ColB'] = df['ColA'].transform(p)
CodePudding user response:
# match prefix-suffix1 / suffix2
pat = r'(?P<prefix>. )-(?P<suffix1>\d )\s*\/\s*(?P<suffix2>\d )'
# m is a re.Match object
# prefix-suffix1 / suffix2 => prefix-suffix1 ; prefix-suffix2
def repl_suffix(m):
return f"{m['prefix']}-{m['suffix1']} ; {m['prefix']}-{m['suffix2']}"
df['colB'] = (
df['colA'].str.replace(pat, repl_suffix)
.str.rstrip(r';#&') # strip ';','#','&' from single codes
.str.replace(r'\s*(#|&)\s*',' ; ') # '#' and '&' => ';'
)
>>> df
colA colB
0 JA01-03-AP-00075 / 00014 JA01-03-AP-00075 ; JA01-03-AP-00014
1 JA01-03-AP-00065 / 00013 JA01-03-AP-00065 ; JA01-03-AP-00013
2 JA01-03-AP-00089 JA01-03-AP-00089
3 JA01-05-AP-00089 & JA01-03-AP-08894 JA01-05-AP-00089 ; JA01-03-AP-08894
4 JA09-08-BC-88873#JA09-08-BC-88845 JA09-08-BC-88873 ; JA09-08-BC-88845
5 JA09-08-BC-88123# JA09-08-BC-88123
EDIT: I'm assuming that the suffixes are always numbers, but you can easily generalize for any suffix.