Home > Net >  Split a Pandas DataFrame column based on a pattern
Split a Pandas DataFrame column based on a pattern

Time:03-01

I have a pandas DataFrame which looks something like this

d1 = pd.DataFrame({"Title":['t1','t2','t3'],
               "Description":['[rank:102] [status: Confirmed] [City: xyz] [Country:C1]',
                              '[rank:93] [status: Rejected] [City: abc] [Country:C2]',
                              '[rank: ] [status: Confirmed] [City: ] [Country: ]']})

As you can see, the important data which is needed for analysis is in the second column in a pattern which I want to separate. What is the best way to separate 'rank','status','city' and 'country into different columns themselves. Also, sometimes the data will will be absent in the description column for a few rows (record 3 for example)

As my final output, I would want my dataset to contain five columns

  1. Title
  2. Rank
  3. Status
  4. City
  5. Country

CodePudding user response:

You could use a regex with str.extractall to extract the key/value pairs as two columns, then pivot to set the first column as column names, and finally join to the original:

(d1
 .drop('Description', axis=1)
 .join(
  d1['Description']
  .str.extractall('([^][] ):([^][] )')
  .droplevel(1) # get rid of extra "match" level to aggregate the rows into 1
  .pivot(columns=0, values=1)
 )
)

output:

  Title  City Country rank      status
0    t1   xyz      C1  102   Confirmed
1    t2   abc      C2   93    Rejected
2    t3                      Confirmed
  • Related