My data frame has multiple columns like ID, Organizations, Date, Location, etc. I am trying to extract the "organization" values that are within the "Organizations" column. My desired output should be multiple organization's names in a new column, separated by a comma. For example:
ID | Organizations |
---|---|
1 | [{organization=Glaxosmithkline, character_offset=10512}, {organization=Vulpes Fund, character_offset=13845}] |
2 | [{organization=Amazon, character_offset=14589}, {organization=Sinovac, character_offset=18923}] |
I want the output to be something like:
ID | Organizations |
---|---|
1 | Glaxosmithkline, Vulpes Fund |
2 | Amazon, Sinovac |
I tried the following code (getting output as NaN):
latin_combined['newOrg'] = latin_combined['organizations'].str[0].str['organization']
Edited:
df.head(5)['organizations'].to_dict()
gives me the following output:
{0: '[{organization= Vac, character_offset=14199}, {organization=Health, character_offset=1494}]',
1: '[{organization=Store, character_offset=700}, {organization= Museum, character_offset=1711}]',
2: '[{organization= Mart, character_offset=8232}, {organization= Rep, character_offset=5517}]',
3: '[{organization= Lodge, character_offset=3881}, {organization= Hotel, character_offset=5947}]',
4: '[{organization=Airport, character_offset=3881}, {organization=Landmark, character_offset=5947}]'}
Any suggestions will be helpful.
CodePudding user response:
It seems you have a string. You can use regex
to extract the key, value pair separated by =
, pivot as shown below:
(df['organizations'].str.extractall('([^{=,] )= *([^=,}] )')
.rename({0:'key', 1:'value'}, axis = 1).reset_index()
.groupby(['level_0', 'key'])['value'].agg(', '.join).unstack())
key character_offset organization
level_0
0 14199, 1494 Vac, Health
1 700, 1711 Store, Museum
2 8232, 5517 Mart, Rep
3 3881, 5947 Lodge, Hotel
4 3881, 5947 Airport, Landmark
The data
d = {0: '[{organization= Vac, character_offset=14199}, {organization=Health, character_offset=1494}]',
1: '[{organization=Store, character_offset=700}, {organization= Museum, character_offset=1711}]',
2: '[{organization= Mart, character_offset=8232}, {organization= Rep, character_offset=5517}]',
3: '[{organization= Lodge, character_offset=3881}, {organization= Hotel, character_offset=5947}]',
4: '[{organization=Airport, character_offset=3881}, {organization=Landmark, character_offset=5947}]'}
df = pd.Series(d).to_frame('organizations')
CodePudding user response:
Is this what you are trying to do?
latin_combined['newOrg'] = latin_combined['organizations'].apply(lambda x : x.split(',')[0])
CodePudding user response:
You could use list comprehension with apply:
import pandas as pd
df = pd.DataFrame([[[{'organization':'Glaxosmithkline', 'character_offset':10512}, {'organization':'Vulpes Life Sciences Fund', 'character_offset':13845}]]], columns=['newOrg'])
df['Organizations'] = df['newOrg'].apply(lambda x: [i['organization'] for i in x])
output:
newOrg | Organizations | |
---|---|---|
0 | [{'organization': 'Glaxosmithkline', 'character_offset': 10512}, {'organization': 'Vulpes Life Sciences Fund', 'character_offset': 13845}] | ['Glaxosmithkline', 'Vulpes Life Sciences Fund'] |
CodePudding user response:
1. Updated following your recent dataframe update:
data = {'Organizations': ['[{organization= Vac, character_offset=14199}, {organization=Health, character_offset=1494}]',
'[{organization=Store, character_offset=700}, {organization= Museum, character_offset=1711}]',
'[{organization= Mart, character_offset=8232}, {organization= Rep, character_offset=5517}]',
'[{organization= Lodge, character_offset=3881}, {organization= Hotel, character_offset=5947}]',
'[{organization=Airport, character_offset=3881}, {organization=Landmark, character_offset=5947}]']}
df = pd.DataFrame(data)
df
index | Organizations |
---|---|
0 | [{organization= Vac, character_offset=14199}, {organization=Health, character_offset=1494}] |
1 | [{organization=Store, character_offset=700}, {organization= Museum, character_offset=1711}] |
2 | [{organization= Mart, character_offset=8232}, {organization= Rep, character_offset=5517}] |
3 | [{organization= Lodge, character_offset=3881}, {organization= Hotel, character_offset=5947}] |
4 | [{organization=Airport, character_offset=3881}, {organization=Landmark, character_offset=5947}] |
2. Use ''.join()
regex
with .apply()
on the column you want:
import re
df.Organizations = df.Organizations.apply(lambda x: ', '.join(re.findall(r'{[^=] =\s*([^=,}] )', x)))
df
3. Result:
index | Organizations |
---|---|
0 | Vac, Health |
1 | Store, Museum |
2 | Mart, Rep |
3 | Lodge, Hotel |
4 | Airport, Landmark |
#In my own opinion, you should try to better scrap and/or clean your data before putting them into a dataframe ;)
I've spent some of my free time, so please accept ✅ this answer if it solved your problem, it motivates me :)
Otherwise mention me (using @) in comment while telling me what's wrong ;)
CodePudding user response:
You could do:
df['organizations'].str.extractall(r"organization= *(\w )") \
.groupby(level=0).agg(', '.join).rename(columns={0:'Organizations'})
Organizations
0 Vac, Health
1 Store, Museum
2 Mart, Rep
3 Lodge, Hotel
4 Airport, Landmark