I have a dataframe which looks like this:
val
"ID: abc\nName: John\nLast name: Johnson\nAge: 27"
"ID: igb1\nName: Mike\nLast name: Jackson\nPosition: CEO\nAge: 42"
...
I would like to extract Name, Position and Age from those values and turn them into separate columns to get dataframe which looks like this:
Name Position Age
John NaN 27
Mike CEO 42
How could I do that?
CodePudding user response:
Option 1
- Use
Series.str.extract
inside a list comprehension to get capturing groups for each string ("key") in the listcols
. Each series thus produced will get the appropiate column name this way. - Next, we pass the result to
pd.concat
withaxis=1
to combine the individual series. - (You could potentially do this in one go with
Series.str.extractall
, but whether that works, depends on the order of the different capturing groups. E.g. if the order is always the same, this can work.)
cols = ['Name', 'Position', 'Age']
res = pd.concat([df.val.str.extract(fr'(?P<{col}>(?<={col}: ).*)\n?')
for col in cols], axis=1)
# e.g. for "Name", we'll look for all chars preceded by "Name :" and followed
# by "\n", and same for "Position". Question mark at the end is necessary for "Age"
# (or any other such category that comes at the end)
print(res)
Name Position Age
0 John NaN 27
1 Mike CEO 42
Option 2
- Instead of relying on regex patterns, we could also resort to
Series.str.split
. First, we split on\n
and useSeries.explode
to get rows with "key: value" pairs (but as strings, of course). - Next, we apply
split
again, but on:
, and now addexpand=True
. - We will now have one column (
0
) that contains all the keys, and another (1
) that contains the associated values. We usedf.pivot
to turn the values in the first column into the column names and get the values from the second column as data. - Now, we just keep only the columns that we are interested in (i.e.
['Name', 'Position', 'Age']
). - N.B. you could reverse these last two steps, of course. E.g. do
res[res[0].isin(cols)]
and only then applypivot
.
cols = ['Name', 'Position', 'Age']
res = df.val.str.split('\n').explode().str.split(': ', expand=True)\
.pivot(columns=0, values=1)[cols]
# cosmetic (because of `pivot` columns.name will be `0`)
res.columns.name = None
print(res)
Name Position Age
0 John NaN 27
1 Mike CEO 42