Home > Enterprise >  How to extract certain parts of string values in column and turn them into columns?
How to extract certain parts of string values in column and turn them into columns?

Time:10-16

I have a dataframe which looks like this:

val
"ID: abc\nName: John\nLast name: Johnson\nAge: 27"
"ID: igb1\nName: Mike\nLast name: Jackson\nPosition: CEO\nAge: 42"
...

I would like to extract Name, Position and Age from those values and turn them into separate columns to get dataframe which looks like this:

Name  Position     Age
John     NaN        27
Mike     CEO        42

How could I do that?

CodePudding user response:

Option 1

  • Use Series.str.extract inside a list comprehension to get capturing groups for each string ("key") in the list cols. Each series thus produced will get the appropiate column name this way.
  • Next, we pass the result to pd.concat with axis=1 to combine the individual series.
  • (You could potentially do this in one go with Series.str.extractall, but whether that works, depends on the order of the different capturing groups. E.g. if the order is always the same, this can work.)
cols = ['Name', 'Position', 'Age']
res = pd.concat([df.val.str.extract(fr'(?P<{col}>(?<={col}: ).*)\n?') 
                 for col in cols], axis=1)

# e.g. for "Name", we'll look for all chars preceded by "Name :" and followed 
# by "\n", and same for "Position". Question mark at the end is necessary for "Age" 
# (or any other such category that comes at the end)

print(res)

   Name Position Age
0  John      NaN  27
1  Mike      CEO  42

Option 2

  • Instead of relying on regex patterns, we could also resort to Series.str.split. First, we split on \n and use Series.explode to get rows with "key: value" pairs (but as strings, of course).
  • Next, we apply split again, but on : , and now add expand=True.
  • We will now have one column (0) that contains all the keys, and another (1) that contains the associated values. We use df.pivot to turn the values in the first column into the column names and get the values from the second column as data.
  • Now, we just keep only the columns that we are interested in (i.e. ['Name', 'Position', 'Age']).
  • N.B. you could reverse these last two steps, of course. E.g. do res[res[0].isin(cols)] and only then apply pivot.
cols = ['Name', 'Position', 'Age']

res = df.val.str.split('\n').explode().str.split(': ', expand=True)\
    .pivot(columns=0, values=1)[cols]

# cosmetic (because of `pivot` columns.name will be `0`)
res.columns.name = None

print(res)

   Name Position Age
0  John      NaN  27
1  Mike      CEO  42
  • Related