I have a data set that looks like this:
sentiment | text |
---|---|
positive | ['chewy', 'what', 'dhepburn', 'said'] |
neutral | ['chewy', 'plus', 'you', 've', 'added'] |
and I want to convert it to this:
sentiment | text |
---|---|
positive | chewy what dhepburn said |
neutral | chewy plus you ve added |
I basically want to convert the 'text' column, which is made up of lists, into a column of text.
I've done multiple versions of this code:
def joinr(words):
return ','.join(words)
#df['text'] = df.apply(lambda row: joinr(row['text']), axis=1)
#df['text'] = df['text'].apply(lambda x: ' '.join([x]))
df['text'] = df['text'].apply(joinr)
and I keep getting something that resembles this:
sentiment | text |
---|---|
positive | ['c h e w y', 'w h a t', 'd h e p b u r n', 's a i d'] |
neutral | ['c h e w y', 'p l u s', 'y o u', 'v e', 'a d d e d'] |
This is apart of data pre-processing for a ML model. I'm working in Google Colab (similar to Juypter Notebook).
CodePudding user response:
I believe your problem is the axis = 1 you don't need that
data = {
'sentiment' : ['positive', 'neutral'],
'text' : ["['chewy', 'what', 'dhepburn', 'said']", "['chewy', 'plus', 'you', 've', 'added']"]
}
df = pd.DataFrame(data)
df['text'] = df['text'].apply(lambda x : x.replace('[', '')).apply(lambda x : x.replace(']', '')).apply(lambda x : x.replace("'", ''))
df['text'] = df['text'].apply(lambda x : x.split(','))
df['text'] = df['text'].agg(' '.join)
df
CodePudding user response:
Use join
:
df['test'].str.join(' ')
Demonstration:
df = pd.DataFrame({'test': [['chewy', 'what', 'dhepburn', 'said']]})
df['test'].str.join(' ')
Output:
0 chewy what dhepburn said
Name: test, dtype: object
Based on the comment:
#Preparing data
string = """sentiment text
positive ['chewy', 'what', 'dhepburn', 'said']
neutral ['chewy', 'plus', 'you', 've', 'added']"""
data = [x.split('\t') for x in string.split('\n')]
df = pd.DataFrame(data[1:], columns = data[0])
#Solution
df['text'].apply(lambda x: eval(x)).str.join(' ')
Also, you can use more simply:
df['text'].str.replace("\[|\]|'|,",'')
Output:
0 chewy what dhepburn said
1 chewy plus you ve added
Name: text, dtype: object
CodePudding user response:
If you have a string representation of a list you can use:
from ast import literal_eval
df['text'] = df['text'].apply(lambda x: ' '.join(literal_eval(x)))
If really you just want to remove the brackets and commas, use a regex:
df['text'] = df['text'].str.replace('[\[\',\]]', '', regex=True)
Output:
sentiment text
0 positive chewy what dhepburn said
1 neutral chewy plus you ve added