Pandas converting Column of Lists to Column of Text Data Pre-Processing-CodePudding

I have a data set that looks like this:

sentiment	text
positive	['chewy', 'what', 'dhepburn', 'said']
neutral	['chewy', 'plus', 'you', 've', 'added']

and I want to convert it to this:

sentiment	text
positive	chewy what dhepburn said
neutral	chewy plus you ve added

I basically want to convert the 'text' column, which is made up of lists, into a column of text.

I've done multiple versions of this code:

def joinr(words):
   return ','.join(words)

#df['text'] = df.apply(lambda row: joinr(row['text']), axis=1)
#df['text'] = df['text'].apply(lambda x: ' '.join([x]))
df['text'] = df['text'].apply(joinr)

and I keep getting something that resembles this:

sentiment	text
positive	['c h e w y', 'w h a t', 'd h e p b u r n', 's a i d']
neutral	['c h e w y', 'p l u s', 'y o u', 'v e', 'a d d e d']

This is apart of data pre-processing for a ML model. I'm working in Google Colab (similar to Juypter Notebook).

CodePudding user response：

I believe your problem is the axis = 1 you don't need that

data = {
    'sentiment' : ['positive', 'neutral'],
    'text' : ["['chewy', 'what', 'dhepburn', 'said']", "['chewy', 'plus', 'you', 've', 'added']"]
}
df = pd.DataFrame(data)
df['text'] = df['text'].apply(lambda x : x.replace('[', '')).apply(lambda x : x.replace(']', '')).apply(lambda x : x.replace("'", ''))
df['text'] = df['text'].apply(lambda x : x.split(','))
df['text'] = df['text'].agg(' '.join)
df

CodePudding user response：

Use join:

df['test'].str.join(' ')

Demonstration:

df = pd.DataFrame({'test': [['chewy', 'what', 'dhepburn', 'said']]})
df['test'].str.join(' ')

Output:

0    chewy what dhepburn said
Name: test, dtype: object

Based on the comment:

#Preparing data
string = """sentiment   text
positive    ['chewy', 'what', 'dhepburn', 'said']
neutral ['chewy', 'plus', 'you', 've', 'added']"""
data = [x.split('\t') for x in string.split('\n')]
df = pd.DataFrame(data[1:], columns = data[0])

#Solution
df['text'].apply(lambda x: eval(x)).str.join(' ')

Also, you can use more simply:

df['text'].str.replace("\[|\]|'|,",'')

Output:

0    chewy what dhepburn said
1     chewy plus you ve added
Name: text, dtype: object

CodePudding user response：

If you have a string representation of a list you can use:

from ast import literal_eval

df['text'] = df['text'].apply(lambda x: ' '.join(literal_eval(x)))

If really you just want to remove the brackets and commas, use a regex:

df['text'] = df['text'].str.replace('[\[\',\]]', '', regex=True)

Output:

  sentiment                      text
0  positive  chewy what dhepburn said
1   neutral   chewy plus you ve added