I have a file which contains data of users in rows which is stored in some cryptic format. I want to decode that and create a dataframe
sample row -- AN04N010105SANDY0205SMITH030802031989
Note-
AN04N01
is standard 7 letter string at the start to denote that this row is valid.
- Here
0105SANDY
refers to 1st column(name) having length 5
- 01 -> 1st column ( which is name column )
- 05 -> length of name ( Sandy )
- Similarly,
0205SMITH
refers to
- 02 -> 2nd column ( which is surname column )
- 05 -> length of surname ( Smith )
- Similarly,
030802031989
refers to
- 03 -> 3rd column ( DOB )
- 08 -> length of DOB
I want a data frame like --
| name | surname | DOB |
|Sandy | SMITH | 02031989 |
I was trying to use regex, but i don't know how to put this into a data frame after identifying names, also how will you find the number of characters to read?
CodePudding user response:
Rather than using regex for groups that might be out of order and varying length, it might be simpler to consume the string in a serial manner.
With the following, you track an index i
through the string and consume two characters for code
, then length
and finally the variable amount of characters given by length
. Then, you store the values in a dict
, append the dict
s to a list and turn that list
of dict
s into a dataframe. Bonus, it works with the elements in any order.
import pandas as pd
test_strings = [
"AN04N010105ALICE0205ADAMS030802031989",
"AN04N010103BOB0205SMITH0306210876",
"AN04N0103060101010104FRED0204OWEN",
"XXXXXXX0105SANDY0205SMITH030802031989",
]
code_map = {"01": "name", "02": "surname", "03": "DOB"}
def parse(s):
i = 7
d = {}
while i < len(s):
code, i = s[i:i 2], i 2 # read code
length, i = int(s[i:i 2]), i 2 # read length
val, i = s[i:i length], i length # read value
d[code_map[code]] = val # store value
return d
ds = []
for s in test_strings:
if not s.startswith("AN04N01"):
continue
ds.append(parse(s))
df = pd.DataFrame(ds)
df
contains:
name surname DOB
0 ALICE ADAMS 02031989
1 BOB SMITH 210876
2 FRED OWEN 010101
CodePudding user response:
Try:
def fn(x):
rv, x = [], x[7:]
while x:
_, n, x = x[:2], x[2:4], x[4:]
value, x = x[: int(n)], x[int(n) :]
rv.append(value)
return rv
m = df["row"].str.startswith("AN04N01")
df[["NAME", "SURNAME", "DOB"]] = df.loc[m, "row"].apply(fn).apply(pd.Series)
print(df)
Prints:
row NAME SURNAME DOB
0 AN04N010105SANDY0205SMITH030802031989 SANDY SMITH 02031989
1 AN04N010105BANDY0205BMITH030802031989 BANDY BMITH 02031989
2 AN04N010105CANDY0205CMITH030802031989 CANDY CMITH 02031989
3 XXXXXXX0105DANDY0205DMITH030802031989 NaN NaN NaN
Dataframe used:
row
0 AN04N010105SANDY0205SMITH030802031989
1 AN04N010105BANDY0205BMITH030802031989
2 AN04N010105CANDY0205CMITH030802031989
3 XXXXXXX0105DANDY0205DMITH030802031989
CodePudding user response:
here it is the code for this pattern : (\w{2}\d{2}\w{1}\d{2})(\d{4}\w{5}\d \w{5})(\d )
or use this pattern : (\D{5})\d (\D )\d (02\d )