I have the following CSV file:
ABCD0011
ABCD1404
ABCD1255
There are many such rows in the CSV file which I want to convert as follows:
Input | Expected Output | Actual Output |
---|---|---|
ABCD0011 | ABCD_11_0 | ABCD_0011_0 |
ABCD1404 | ABCD_1404_0 | ABCD_144_0 |
ABCD1255 | ABCD_1255_0 | ABCD_1255_0 |
Basically, it takes the zeros after the letters and replace it with an underscore ("_").
Code
import numpy as np
import pandas as pd
df = pd.read_csv('Book1.csv')
df.A = df.A.str.replace('[0-9] ', '') '_' df.A.str.replace('([A-Z]) ', '') '_0'
Actual Output and Issues
I got the values that are without leading zeros correctly converted like
- from
ABCD1255
toABCD_1255_0
.
But for values with leading zeros it failed, example:
- from
ABCD0011
toABCD_0011_0
. Did not change the format.
Even for values with zeros inside it failed, like
- from
ABCD1404
toABCD_144_0
. It deleted the zero in the middle.
Question
How can I fix this issue?
CodePudding user response:
If we know the input strings will always be eight characters, with the first four being letter and the second set of four being a number, we could:
>>> s = "ABCD0011"
>>> f"{s[:4]}_{int(s[4:])}_0"
'ABCD_11_0'
If we don't know the lengths for sure, we can use re.sub
with a lambda to transform two different matching groups.
>>> import re
>>> re.sub(r'([a-zA-Z] )(\d )', lambda m: f"{m.group(1)}_{int(m.group(2))}_0", s)
'ABCD_11_0'
>>> re.sub(r'([a-zA-Z] )(\d )', lambda m: f"{m.group(1)}_{int(m.group(2))}_0", 'A709')
'A_709_0'
CodePudding user response:
Ignoring the apparent requirement for a dataframe, this is how you could parse the file and generate the strings you need. Uses re. Does not use numpy and/or pandas
import re
FILENAME = 'well.csv'
PATTERN = re.compile(r'^([a-zA-Z] )(\d )$')
with open(FILENAME) as csv_data:
next(csv_data) # skip header(s)
for line in csv_data:
if m := PATTERN.search(line):
print(f'{m.group(1)}_{int(m.group(2))}_0')
This will work for the data shown in the question. Other data structures may cause this to fail.
Output:
ABCD_11_0
ABCD_1404_0
ABCD_1255_0