Home > Software design >  Conver string from `ABCD0011` to `ABCD_11`
Conver string from `ABCD0011` to `ABCD_11`

Time:01-04

I have the following CSV file:

ABCD0011
ABCD1404
ABCD1255

There are many such rows in the CSV file which I want to convert as follows:

Input Expected Output Actual Output
ABCD0011 ABCD_11_0 ABCD_0011_0
ABCD1404 ABCD_1404_0 ABCD_144_0
ABCD1255 ABCD_1255_0 ABCD_1255_0

Basically, it takes the zeros after the letters and replace it with an underscore ("_").

Code

import numpy as np    
import pandas as pd    
df = pd.read_csv('Book1.csv')    
df.A = df.A.str.replace('[0-9] ', '') '_' df.A.str.replace('([A-Z]) ', '') '_0'

Actual Output and Issues

I got the values that are without leading zeros correctly converted like

  • from ABCD1255 to ABCD_1255_0.

But for values with leading zeros it failed, example:

  • from ABCD0011 to ABCD_0011_0. Did not change the format.

Even for values with zeros inside it failed, like

  • from ABCD1404 to ABCD_144_0. It deleted the zero in the middle.

Question

How can I fix this issue?

CodePudding user response:

If we know the input strings will always be eight characters, with the first four being letter and the second set of four being a number, we could:

>>> s = "ABCD0011"
>>> f"{s[:4]}_{int(s[4:])}_0"
'ABCD_11_0'

If we don't know the lengths for sure, we can use re.sub with a lambda to transform two different matching groups.

>>> import re
>>> re.sub(r'([a-zA-Z] )(\d )', lambda m: f"{m.group(1)}_{int(m.group(2))}_0", s)
'ABCD_11_0'
>>> re.sub(r'([a-zA-Z] )(\d )', lambda m: f"{m.group(1)}_{int(m.group(2))}_0", 'A709')
'A_709_0'

CodePudding user response:

Ignoring the apparent requirement for a dataframe, this is how you could parse the file and generate the strings you need. Uses re. Does not use numpy and/or pandas

import re

FILENAME = 'well.csv'
PATTERN = re.compile(r'^([a-zA-Z] )(\d )$')

with open(FILENAME) as csv_data:
    next(csv_data) # skip header(s)
    for line in csv_data:
        if m := PATTERN.search(line):
            print(f'{m.group(1)}_{int(m.group(2))}_0')

This will work for the data shown in the question. Other data structures may cause this to fail.

Output:

ABCD_11_0
ABCD_1404_0
ABCD_1255_0
  • Related