I have this dataframe:
df = pd.DataFrame({"c1":["[\"text\",\"text2\"]","[\"bla\",\"bla\",\"bla\"]"]})
and I'm removind []
and ""
:
df["c2"] = df["c1"].apply(lambda x:re.sub('[\["\]]', "", x))
then I want to add df['c2']
to a list:
list = df['c2'].to_list()
Then I get this: ['text,text2', 'bla,bla,bla']
So far so good. But then I want a list with only unique values, what I could to using set(list)
.
The proble is that Instead of ['text,text2', 'bla,bla,bla']
I needed to get ['text','text2', 'bla','bla','bla']
so when I apply `set(list) I would get what I am expecting:
['text','text2','bla']
CodePudding user response:
First, don't use list
as a variable. Second, once you get ['text,text2',...]
you can use str.split
. So your set would be
{y for x in df['c2'].str.split(',') for y in x}
Output:
{'bla', 'text', 'text2'}
Note: You can use regex directly to extract all patterns between the \"
:
set(df['c1'].str.extractall('\"([^"] )\"')[0])
CodePudding user response:
Try this:
new = []
for l in list:
new.extend(l.split(',') )
new = list(set(new))
which results in new
to be
['text2', 'text', 'bla']