I have a dataframe were each row of a certain column is a text that comes from some bad formatted form where each 'field' is after the the 'field title', an example is:
col |
---|
Name: Bob Surname: Ross Title: painter age:34 |
Surname: Isaac Name: Newton Title: coin checker age: 42 |
age:20 Title: pilot Name: jack |
this is some trash text Name: John Surname: Doe |
As from example, the fields can be in any order an some of them could not exist.
What I need to do is to parse the fields so that the second line becomes something like:
{'Name': 'Isaac','Surname': 'Newton',...}
While i can deal with the 'pythonic part' I believe that the parsing should be done using some regex (also due to the fact that the rows are thousands) but I have no idea on how to design it.
CodePudding user response:
Try:
x = df["col"].str.extractall(r"([^\s:] ):\s*(. ?)\s*(?=[^\s:] :|\Z)")
x = x.droplevel(level="match").pivot(columns=0, values=1)
print(x.apply(lambda x: x[x.notna()].to_dict(), axis=1).to_list())
Prints:
[
{"Name": "Bob", "Surname": "Ross", "Title": "painter", "age": "34"},
{
"Name": "Newton",
"Surname": "Isaac",
"Title": "coin checker",
"age": "42",
},
{"Name": "jack", "Title": "pilot", "age": "20"},
]