How to decode a string using regex?-CodePudding

I have a file which contains data of users in rows which is stored in some cryptic format. I want to decode that and create a dataframe

sample row -- AN04N010105SANDY0205SMITH030802031989

Note- AN04N01 is standard 7 letter string at the start to denote that this row is valid.

Here 0105SANDY refers to 1st column(name) having length 5

01 -> 1st column ( which is name column )
05 -> length of name ( Sandy )

Similarly,0205SMITH refers to

02 -> 2nd column ( which is surname column )
05 -> length of surname ( Smith )

Similarly,030802031989 refers to

03 -> 3rd column ( DOB )
08 -> length of DOB

I want a data frame like --

| name | surname | DOB |
|Sandy  | SMITH | 02031989 |

I was trying to use regex, but i don't know how to put this into a data frame after identifying names, also how will you find the number of characters to read?

CodePudding user response：

Rather than using regex for groups that might be out of order and varying length, it might be simpler to consume the string in a serial manner.

With the following, you track an index i through the string and consume two characters for code, then length and finally the variable amount of characters given by length. Then, you store the values in a dict, append the dicts to a list and turn that list of dicts into a dataframe. Bonus, it works with the elements in any order.

import pandas as pd

test_strings = [
    "AN04N010105ALICE0205ADAMS030802031989",
    "AN04N010103BOB0205SMITH0306210876",
    "AN04N0103060101010104FRED0204OWEN",
    "XXXXXXX0105SANDY0205SMITH030802031989",
    ]

code_map = {"01": "name", "02": "surname", "03": "DOB"}

def parse(s):
    i = 7
    d = {}
    while i < len(s):
        code, i = s[i:i 2], i 2  # read code
        length, i = int(s[i:i 2]), i 2  # read length
        val, i = s[i:i length], i   length  # read value
        d[code_map[code]] = val  # store value
    return d

ds = []

for s in test_strings:
    if not s.startswith("AN04N01"):
        continue
    ds.append(parse(s))

df = pd.DataFrame(ds)

df contains:

    name surname       DOB
0  ALICE   ADAMS  02031989
1    BOB   SMITH    210876
2   FRED    OWEN    010101

CodePudding user response：

Try:

def fn(x):
    rv, x = [], x[7:]
    while x:
        _, n, x = x[:2], x[2:4], x[4:]
        value, x = x[: int(n)], x[int(n) :]
        rv.append(value)
    return rv


m = df["row"].str.startswith("AN04N01")
df[["NAME", "SURNAME", "DOB"]] = df.loc[m, "row"].apply(fn).apply(pd.Series)
print(df)

Prints:

                                     row   NAME SURNAME       DOB
0  AN04N010105SANDY0205SMITH030802031989  SANDY   SMITH  02031989
1  AN04N010105BANDY0205BMITH030802031989  BANDY   BMITH  02031989
2  AN04N010105CANDY0205CMITH030802031989  CANDY   CMITH  02031989
3  XXXXXXX0105DANDY0205DMITH030802031989    NaN     NaN       NaN

Dataframe used:

                                     row
0  AN04N010105SANDY0205SMITH030802031989
1  AN04N010105BANDY0205BMITH030802031989
2  AN04N010105CANDY0205CMITH030802031989
3  XXXXXXX0105DANDY0205DMITH030802031989

CodePudding user response：

here it is the code for this pattern : (\w{2}\d{2}\w{1}\d{2})(\d{4}\w{5}\d \w{5})(\d )

or use this pattern : (\D{5})\d (\D )\d (02\d )