I have a dataframe with a column that has some already formatted strings and other values without formatting, how to put everything in the same format?
Data example:
| Data |
| -------- |
| 3101-2/00 |
| 7319-0/02 |
| 8800-6/00 |
| 4772500 |
| 7410299 |
This is the correct format: 3101-2/00
How to place the ones that are not formatted correctly and keep the ones that are formatted?
To format I would use this code below. But how to apply it to the entire column and ignore the ones that are already formatted? Maybe by the len of the string?
data= 7410299
cnae = data[:4] "-" data[4:5] "/" data[5:]
print(data)
>> 7410-2/99
CodePudding user response:
What about unformating all the column, then re-format it.
lis = ["45/9-98", "458-9", "7410299"]
r = [re.sub('[^0-9]',"", x) for x in lis]
then apply your function to r.
CodePudding user response:
Try:
- Extract all the numbers (7 digits) from each row
- Apply your string formatting
df["Data"] = df["Data"].str.extractall("(\d )").groupby(level=0).agg("".join)
df["Data"] = df["Data"].apply(lambda x: f"{x[:4]}-{x[4:5]}/{x[5:]}")
>>> df
Data
0 3101-2/00
1 7319-0/02
2 8800-6/00
3 4772-5/00
4 7410-2/99
In one line:
df["Data"] = (df["Data"].str
.extractall("(\d )")
.groupby(level=0)
.agg("".join)
.squeeze()
.apply(lambda x: f"{x[:4]}-{x[4:5]}/{x[5:]}")
)
CodePudding user response:
With regular expressions?
import re
for data in (
"3101-2/00",
"7319-0/02",
"8800-6/00",
"4772500",
"7410299"):
unformatted = re.search("(\d\d\d\d)(\d)(\d\d)", data)
formatted = re.search("(\d\d\d\d)-(\d)/(\d\d)", data)
if unformatted :
print(f"{unformatted.group(1)}-{unformatted.group(2)}/{unformatted.group(3)}")
if formatted:
print(data)
The output is
3101-2/00
7319-0/02
8800-6/00
4772-5/00
7410-2/99