Home > database >  Pandas parse text column
Pandas parse text column

Time:04-02

I have a csv table with a column that contains the text from a chat log. Each text row follows the same format of the name of the person and time of the message (with an additional front and back space padding) followed by the message content. An example of a single row of the text column:

'  Siri (3:15pm)  Hello how can I help you?  John Wayne (3:17pm)  what day of the week is today  Siri (3:18pm)  it is Monday.'

I would like to transform this single string column, into multiple columns (number of columns would depend on number of messages), with one column for each individual message like below:

  • Siri (3:15pm) Hello how can I help you
  • John Wayne (3:17pm) what day of the week is today
  • Siri (3:18pm) it is Monday

How can I parse this text in a pandas dataframe column to separate the chat logs into individual message columns?

CodePudding user response:

If you have this dataframe:

                                                                                                                     Messages
0  Siri (3:15pm)  Hello how can I help you?  John Wayne (3:17pm)  what day of the week is today  Siri (3:18pm)  it is Monday.

then you can do:

x = df["Messages"].str.split(r"\s{2,}").explode()

out = (x[::2]   " "   x[1::2]).to_frame()
print(out)

Prints:

                                            Messages
0            Siri (3:15pm) Hello how can I help you?
0  John Wayne (3:17pm) what day of the week is today
0                        Siri (3:18pm) it is Monday.

Note: It only works if there 2 spaces between the Name and Text.

CodePudding user response:

This is how I did it, took me a while but we got to it!

s = pd.Series(['  Siri (3:15pm)  Hello how can I help you?  John Wayne (3:17pm)  what day of the week is today  Siri (3:18pm)  it is Monday.'])
s = s.str.split(r"  ", expand=True)
s = s.drop(labels=[0], axis=1)
s = s.transpose()

for i in s.index:
    list_1 = list(s[0])

odd_i = []
even_i = []
for i in range(0, len(list_1)):
    if i % 2:
        even_i.append(list_1[i])
    else :
        odd_i.append(list_1[i])

d = {'Name': odd_i, 'Message': even_i}
df = pd.DataFrame(data=d)
df

Output:
                   Name                               Message
0         Siri (3:15pm)             Hello how can I help you?
1   John Wayne (3:17pm)         what day of the week is today
2         Siri (3:18pm)                         it is Monday.
  • Related