I am trying to remove the United States listed in the organization columns within a list for all the rows of an extensive dataset.
My df looks something like this:
ID | Organizations |
---|---|
1 | ['education', 'health', 'United States', 'facebook'] |
2 | ['health', 'Airlines', 'WHO', 'United States'] |
...
I want my output to look like this:
ID | Organizations |
---|---|
1 | ['education', 'health','facebook'] |
2 | ['health', 'Airlines', 'WHO'] |
The code I tried:
df=df['organizations'].remove("United States")
gave me the following error:
AttributeError: 'Series' object has no attribute 'remove'
CodePudding user response:
You would need to loop here, using apply
:
df['Organizations'].apply(lambda l: l.remove('United States'))
Or a list comprehension:
df['Organizations'] = [[x for x in l if x != 'United States'] for l in df['Organizations']]
Output:
ID Organizations
0 1 [education, health, facebook]
1 2 [health, Airlines, WHO]
Note that the first one will fail if you don't have 'United States' in all the lists
handling NaNs
df['Organizations'] = [[x for x in l if x != 'United States']
if isinstance(l, list) else l
for l in df['Organizations']]
Used input:
df = pd.DataFrame({'ID': [1, 2],
'Organizations': [['education', 'health', 'United States', 'facebook'],
['health', 'Airlines', 'WHO', 'United States']]})
CodePudding user response:
You could also consider:
(df.explode('Organizations')
.query('Organizations != "United States"').groupby('ID')
.agg(list).reset_index())
ID Organizations
0 1 [education, health, facebook]
1 2 [health, Airlines, WHO]
CodePudding user response:
Another way is to explode, drop the values, then aggregate back with groupby
:
df['Organizations'] = df['Organizations'].explode()\
.loc[lambda x: x!='United States']]\
.groupby(level=0).agg(list)