I have a pandas DataFrame which looks something like this
d1 = pd.DataFrame({"Title":['t1','t2','t3'],
"Description":['[rank:102] [status: Confirmed] [City: xyz] [Country:C1]',
'[rank:93] [status: Rejected] [City: abc] [Country:C2]',
'[rank: ] [status: Confirmed] [City: ] [Country: ]']})
As you can see, the important data which is needed for analysis is in the second column in a pattern which I want to separate. What is the best way to separate 'rank','status','city' and 'country into different columns themselves. Also, sometimes the data will will be absent in the description column for a few rows (record 3 for example)
As my final output, I would want my dataset to contain five columns
- Title
- Rank
- Status
- City
- Country
CodePudding user response:
You could use a regex with str.extractall
to extract the key/value pairs as two columns, then pivot
to set the first column as column names, and finally join
to the original:
(d1
.drop('Description', axis=1)
.join(
d1['Description']
.str.extractall('([^][] ):([^][] )')
.droplevel(1) # get rid of extra "match" level to aggregate the rows into 1
.pivot(columns=0, values=1)
)
)
output:
Title City Country rank status
0 t1 xyz C1 102 Confirmed
1 t2 abc C2 93 Rejected
2 t3 Confirmed