Home > Mobile >  How to decode a string using regex?
How to decode a string using regex?

Time:08-30

I have a file which contains data of users in rows which is stored in some cryptic format. I want to decode that and create a dataframe

sample row -- AN04N010105SANDY0205SMITH030802031989

Note- AN04N01 is standard 7 letter string at the start to denote that this row is valid.

  1. Here 0105SANDY refers to 1st column(name) having length 5
  • 01 -> 1st column ( which is name column )
  • 05 -> length of name ( Sandy )
  1. Similarly,0205SMITH refers to
  • 02 -> 2nd column ( which is surname column )
  • 05 -> length of surname ( Smith )
  1. Similarly,030802031989 refers to
  • 03 -> 3rd column ( DOB )
  • 08 -> length of DOB

I want a data frame like --

| name | surname | DOB |
|Sandy  | SMITH | 02031989 |

I was trying to use regex, but i don't know how to put this into a data frame after identifying names, also how will you find the number of characters to read?

CodePudding user response:

Rather than using regex for groups that might be out of order and varying length, it might be simpler to consume the string in a serial manner.

With the following, you track an index i through the string and consume two characters for code, then length and finally the variable amount of characters given by length. Then, you store the values in a dict, append the dicts to a list and turn that list of dicts into a dataframe. Bonus, it works with the elements in any order.

import pandas as pd

test_strings = [
    "AN04N010105ALICE0205ADAMS030802031989",
    "AN04N010103BOB0205SMITH0306210876",
    "AN04N0103060101010104FRED0204OWEN",
    "XXXXXXX0105SANDY0205SMITH030802031989",
    ]

code_map = {"01": "name", "02": "surname", "03": "DOB"}

def parse(s):
    i = 7
    d = {}
    while i < len(s):
        code, i = s[i:i 2], i 2  # read code
        length, i = int(s[i:i 2]), i 2  # read length
        val, i = s[i:i length], i   length  # read value
        d[code_map[code]] = val  # store value
    return d

ds = []

for s in test_strings:
    if not s.startswith("AN04N01"):
        continue
    ds.append(parse(s))

df = pd.DataFrame(ds)

df contains:

    name surname       DOB
0  ALICE   ADAMS  02031989
1    BOB   SMITH    210876
2   FRED    OWEN    010101

CodePudding user response:

Try:

def fn(x):
    rv, x = [], x[7:]
    while x:
        _, n, x = x[:2], x[2:4], x[4:]
        value, x = x[: int(n)], x[int(n) :]
        rv.append(value)
    return rv


m = df["row"].str.startswith("AN04N01")
df[["NAME", "SURNAME", "DOB"]] = df.loc[m, "row"].apply(fn).apply(pd.Series)
print(df)

Prints:

                                     row   NAME SURNAME       DOB
0  AN04N010105SANDY0205SMITH030802031989  SANDY   SMITH  02031989
1  AN04N010105BANDY0205BMITH030802031989  BANDY   BMITH  02031989
2  AN04N010105CANDY0205CMITH030802031989  CANDY   CMITH  02031989
3  XXXXXXX0105DANDY0205DMITH030802031989    NaN     NaN       NaN

Dataframe used:

                                     row
0  AN04N010105SANDY0205SMITH030802031989
1  AN04N010105BANDY0205BMITH030802031989
2  AN04N010105CANDY0205CMITH030802031989
3  XXXXXXX0105DANDY0205DMITH030802031989

CodePudding user response:

here it is the code for this pattern : (\w{2}\d{2}\w{1}\d{2})(\d{4}\w{5}\d \w{5})(\d )

or use this pattern : (\D{5})\d (\D )\d (02\d )

  • Related