Home > Blockchain >  How to put a column of values in a specific format?
How to put a column of values in a specific format?

Time:03-10

I have a dataframe with a column that has some already formatted strings and other values without formatting, how to put everything in the same format?

Data example:

| Data |
| -------- |
| 3101-2/00    |
| 7319-0/02   |
| 8800-6/00   |
| 4772500  |
| 7410299  |

This is the correct format: 3101-2/00

How to place the ones that are not formatted correctly and keep the ones that are formatted?

To format I would use this code below. But how to apply it to the entire column and ignore the ones that are already formatted? Maybe by the len of the string?

data= 7410299 
cnae = data[:4]   "-"   data[4:5]   "/"   data[5:]
print(data)
>> 7410-2/99

CodePudding user response:

What about unformating all the column, then re-format it.

lis = ["45/9-98", "458-9", "7410299"]
r = [re.sub('[^0-9]',"", x) for x in lis]

then apply your function to r.

CodePudding user response:

Try:

  1. Extract all the numbers (7 digits) from each row
  2. Apply your string formatting
df["Data"] = df["Data"].str.extractall("(\d )").groupby(level=0).agg("".join)
df["Data"] = df["Data"].apply(lambda x: f"{x[:4]}-{x[4:5]}/{x[5:]}")

>>> df
        Data
0  3101-2/00
1  7319-0/02
2  8800-6/00
3  4772-5/00
4  7410-2/99

In one line:

df["Data"] = (df["Data"].str
                        .extractall("(\d )")
                        .groupby(level=0)
                        .agg("".join)
                        .squeeze()
                        .apply(lambda x: f"{x[:4]}-{x[4:5]}/{x[5:]}")
              )

CodePudding user response:

With regular expressions?

import re

for data in (
    "3101-2/00",
    "7319-0/02",
    "8800-6/00",
    "4772500",
    "7410299"):
    unformatted = re.search("(\d\d\d\d)(\d)(\d\d)", data)
    formatted = re.search("(\d\d\d\d)-(\d)/(\d\d)", data)
    if unformatted :
        print(f"{unformatted.group(1)}-{unformatted.group(2)}/{unformatted.group(3)}")
    if formatted:
        print(data)

The output is

3101-2/00
7319-0/02
8800-6/00
4772-5/00
7410-2/99
  • Related