Humbly asking again for the community's help.
I have a task in Data Analysis, to research the connections between different columns of the dataset given. For that sake I have to edit the columns I want to work with. The column I need contains data, which looks like a list of dictionaries, but it's actually a string. So I have to edit it to take 'name' values from those former "dictionaries".
The code below represents my magical rituals to take "name" values from that string, to save them in another column as a string with only those "name" values collected in a list, after what I would apply that function to a whole column and group it by unique combinations of those strings with "name" values. (Maximum-task was to separate those "name" values for several additional columns, to sort them later by all these columns; but the problem appeared, that a huge string in source column (df['specializations']) can contain a number of "dictionaries", so I can't know exactly, how many additional columns to create for them; so I gave up on that idea.)
Typical string with pseudo-list of dictionaries looks like that (the number of those "dictionaries" varies):
[{'id': '1.172', 'name': 'Beginner', 'profarea_id': '1', 'profarea_name': 'IT'}, {'id': '1.117', 'name': 'Testing', 'profarea_id': '1', 'profarea_name': 'IT'}, {'id': '15.93', 'name': 'IT', 'profarea_id': '15', 'profarea_name': 'Beginner'}]
import re
def get_names_values(df):
for a in df['specializations']:
for r in (("\'", "\""), ("[", ""), ("]", ""), ("}", "")):
a = a.replace(*r)
a = re.split("{", a)
m = 0
while m < len(a):
if a[m] in ('', ': ', ', '):
del a[m]
m = 1
a = "".join(a)
a = re.split("\"", a)
n = 0
while n < len(a):
if a[n] in ('', ': ', ', '):
del a[n]
n = 1
nameslist = []
for num in range(len(a)):
if a[num] == 'name':
nameslist.append(a[num 1])
return str(nameslist)
df['specializations_names'] = df['specializations'].fillna('{}').apply(get_names_values)
df['specializations_names']
The problem arouses with for a in df['specializations']:
, as it raises
TypeError: string indices must be integers
. I checked that cycle separately, like (print(a)), and it gave me a proper result; I tried it also via:
for k in range(len(df)):
a = df['specializations'][k]
and again, separately it worked as I needed, but inside my function it raises TypeError. I feel like I'm going to give up on ['specialization'] column and try researching some others; but still I'm curious what's wrong here and how to solve this problem.
Huge thanks to all those who will try to advise, in advance.
CodePudding user response:
What you've encountered as a "string with pseudo-list of dictionaries" seems to be json data. You may use eval()
to convert it to an actual list of dicts and then operate with it normally. Use eval()
with caution, though. I tried to recreate that string and make it work:
str_dicts = str([{'id': '1.172', 'name': 'Beginner', 'profarea_id': '1', 'profarea_name': 'IT'},
{'id': '1.117', 'name': 'Testing', 'profarea_id': '1', 'profarea_name': 'IT'},
{'id': '15.93', 'name': 'IT', 'profarea_id': '15', 'profarea_name': 'Beginner'}])
dicts = list(eval(str_dicts))
names = [d['name'] for d in dicts]
print(names)
[0]: ['Beginner', 'Testing', 'IT']
If your column is a Series of strings that are in fact lists of dicts, then you may want to do such list comprehension:
df['specializations_names'] = [[d['name'] for d in list(eval(row))]
for row in df['specializations']]
I tried to partially reproduce what you tried to do from what you provided:
import pandas as pd
str_dicts = str([{'id': '1.172', 'name': 'Beginner', 'profarea_id': '1', 'profarea_name': 'IT'},
{'id': '1.117', 'name': 'Testing', 'profarea_id': '1', 'profarea_name': 'IT'},
{'id': '15.93', 'name': 'IT', 'profarea_id': '15', 'profarea_name': 'Beginner'}])
df = pd.DataFrame({'specializations': [str_dicts, str_dicts, str_dicts]})
df['specializations_names'] = [[d['name'] for d in list(eval(row))]
for row in df['specializations']]
print(df)
Which resulted in:
specializations | specializations_names | |
---|---|---|
0 | [{'id': '1.172', 'name': 'Beginner', 'profarea_id': '1', 'profarea_name': 'IT'}, {'id': '1.117', 'name': 'Testing', 'profarea_id': '1', 'profarea_name': 'IT'}, {'id': '15.93', 'name': 'IT', 'profarea_id': '15', 'profarea_name': 'Beginner'}] | ['Beginner', 'Testing', 'IT'] |
1 | [{'id': '1.172', 'name': 'Beginner', 'profarea_id': '1', 'profarea_name': 'IT'}, {'id': '1.117', 'name': 'Testing', 'profarea_id': '1', 'profarea_name': 'IT'}, {'id': '15.93', 'name': 'IT', 'profarea_id': '15', 'profarea_name': 'Beginner'}] | ['Beginner', 'Testing', 'IT'] |
2 | [{'id': '1.172', 'name': 'Beginner', 'profarea_id': '1', 'profarea_name': 'IT'}, {'id': '1.117', 'name': 'Testing', 'profarea_id': '1', 'profarea_name': 'IT'}, {'id': '15.93', 'name': 'IT', 'profarea_id': '15', 'profarea_name': 'Beginner'}] | ['Beginner', 'Testing', 'IT'] |
Consequently, there could be strings with lists of any number of dicts instead of the dummies I used, as many as the length of df
.