Extract organization from a column with list of dictionaries using pd dataframe-CodePudding

My data frame has multiple columns like ID, Organizations, Date, Location, etc. I am trying to extract the "organization" values that are within the "Organizations" column. My desired output should be multiple organization's names in a new column, separated by a comma. For example:

ID	Organizations
1	[{organization=Glaxosmithkline, character_offset=10512}, {organization=Vulpes Fund, character_offset=13845}]
2	[{organization=Amazon, character_offset=14589}, {organization=Sinovac, character_offset=18923}]

I want the output to be something like:

ID	Organizations
1	Glaxosmithkline, Vulpes Fund
2	Amazon, Sinovac

I tried the following code (getting output as NaN):

latin_combined['newOrg'] = latin_combined['organizations'].str[0].str['organization']

Edited: df.head(5)['organizations'].to_dict() gives me the following output:

{0: '[{organization= Vac, character_offset=14199}, {organization=Health, character_offset=1494}]',
 1: '[{organization=Store, character_offset=700}, {organization= Museum, character_offset=1711}]',
 2: '[{organization= Mart, character_offset=8232}, {organization= Rep, character_offset=5517}]',
 3: '[{organization= Lodge, character_offset=3881}, {organization= Hotel, character_offset=5947}]',
 4: '[{organization=Airport, character_offset=3881}, {organization=Landmark, character_offset=5947}]'}

Any suggestions will be helpful.

CodePudding user response：

It seems you have a string. You can use regex to extract the key, value pair separated by =, pivot as shown below:

(df['organizations'].str.extractall('([^{=,] )= *([^=,}] )') 
  .rename({0:'key', 1:'value'}, axis = 1).reset_index()
  .groupby(['level_0', 'key'])['value'].agg(', '.join).unstack())

key      character_offset       organization
level_0                                     
0             14199, 1494        Vac, Health
1               700, 1711      Store, Museum
2              8232, 5517          Mart, Rep
3              3881, 5947       Lodge, Hotel
4              3881, 5947  Airport, Landmark

The data

d = {0: '[{organization= Vac, character_offset=14199}, {organization=Health, character_offset=1494}]',
 1: '[{organization=Store, character_offset=700}, {organization= Museum, character_offset=1711}]',
 2: '[{organization= Mart, character_offset=8232}, {organization= Rep, character_offset=5517}]',
 3: '[{organization= Lodge, character_offset=3881}, {organization= Hotel, character_offset=5947}]',
 4: '[{organization=Airport, character_offset=3881}, {organization=Landmark, character_offset=5947}]'}

df = pd.Series(d).to_frame('organizations')

CodePudding user response：

Is this what you are trying to do?

latin_combined['newOrg'] = latin_combined['organizations'].apply(lambda x : x.split(',')[0])

CodePudding user response：

You could use list comprehension with apply:

import pandas as pd

df = pd.DataFrame([[[{'organization':'Glaxosmithkline', 'character_offset':10512}, {'organization':'Vulpes Life Sciences Fund', 'character_offset':13845}]]], columns=['newOrg'])
df['Organizations'] = df['newOrg'].apply(lambda x: [i['organization'] for i in x])

output:

	newOrg	Organizations
0	[{'organization': 'Glaxosmithkline', 'character_offset': 10512}, {'organization': 'Vulpes Life Sciences Fund', 'character_offset': 13845}]	['Glaxosmithkline', 'Vulpes Life Sciences Fund']

CodePudding user response：

1. Updated following your recent dataframe update:

data = {'Organizations': ['[{organization= Vac, character_offset=14199}, {organization=Health, character_offset=1494}]',
        '[{organization=Store, character_offset=700}, {organization= Museum, character_offset=1711}]',
        '[{organization= Mart, character_offset=8232}, {organization= Rep, character_offset=5517}]',
        '[{organization= Lodge, character_offset=3881}, {organization= Hotel, character_offset=5947}]',
        '[{organization=Airport, character_offset=3881}, {organization=Landmark, character_offset=5947}]']}
df = pd.DataFrame(data) 
df

index	Organizations
0	[{organization= Vac, character_offset=14199}, {organization=Health, character_offset=1494}]
1	[{organization=Store, character_offset=700}, {organization= Museum, character_offset=1711}]
2	[{organization= Mart, character_offset=8232}, {organization= Rep, character_offset=5517}]
3	[{organization= Lodge, character_offset=3881}, {organization= Hotel, character_offset=5947}]
4	[{organization=Airport, character_offset=3881}, {organization=Landmark, character_offset=5947}]

2. Use ''.join() regex with .apply() on the column you want:

import re
df.Organizations = df.Organizations.apply(lambda x: ', '.join(re.findall(r'{[^=] =\s*([^=,}] )', x)))
df

3. Result:

index	Organizations
0	Vac, Health
1	Store, Museum
2	Mart, Rep
3	Lodge, Hotel
4	Airport, Landmark

#In my own opinion, you should try to better scrap and/or clean your data before putting them into a dataframe ;)

I've spent some of my free time, so please accept ✅ this answer if it solved your problem, it motivates me :)

Otherwise mention me (using @) in comment while telling me what's wrong ;)

CodePudding user response：

You could do:

df['organizations'].str.extractall(r"organization= *(\w )") \
    .groupby(level=0).agg(', '.join).rename(columns={0:'Organizations'})

       Organizations
0        Vac, Health
1      Store, Museum
2          Mart, Rep
3       Lodge, Hotel
4  Airport, Landmark