I extracted the data from whatsapp into a txt file I need to create 4 columns Date, Time, Name and Message in my output file
import pandas as pd
# read file by lines
with open('D:\Analysis\example_chat_whatsapp.txt', encoding="utf-8") as f:
data=f.readlines()
# # sanity stats
print('num lines: %s' %(len(data)))
# parse text and create list of lists structure
# remove first whatsapp info message
dataset = data[1:]
cleaned_data = []
for line in dataset:
# grab the info and cut it out
date = line.split(" ")[0]
line2 = line[len(date):]
time = line2.split(" ")[0][:2]
line3 = line2[len(time):]
name = line3.split(":")[0][:4]
line4 = line3[len(name):]
message = line4[6:-1] # strip newline charactor
#print(date, time, name, message)
cleaned_data.append([date, time, name, message])
#Create the DataFrame
df = pd.DataFrame(cleaned_data, columns = ['Date', 'Time', 'Name', 'Message'])
df
The issue that I am getting is with variable Time (empty) and Name with a wrong output. Date and Message are Ok with expected output
CodePudding user response:
If uncommented print(date, time, name, message)
prints valid data, then just add 4 spaces before cleaned_data.append([date, time, name, message])
.
for line in dataset:
# grab the info and cut it out
date = line.split(" ")[0]
line2 = line[len(date) 1:]
time = line2.split(" ")[0]
line3 = line2[len(time):]
name = line3.split(":")[0]
line4 = line3[len(name):]
message = line4
row = (date[1:], time[:-1], name[1:], message[2:-1])
# print("'%s', '%s', '%s', '%s'" % row)
cleaned_data.append(row)
s[1:]
returns s
with first character removed, s[:-1]
returns s with last character removed, and so on.
CodePudding user response:
dataset = data[1:]
cleaned_data = []
for line in dataset:
# grab the info and cut it out
date = line.split(" ")[0]
line2 = line[len(date) 1:]
time = line2.split(" ")[0]
line3 = line2[len(time):]
name = line3.split(":")[0]
line4 = line3[len(name):]
message = line4
row = (date[1:], time[:-1], name[1:], message[2:-1])
# print("'%s', '%s', '%s', '%s'" % row)
cleaned_data.append(row)
df=pd.DataFrame(cleaned_data, columns = ['date', 'time', 'name', 'message'])
df.count()
date 1
time 1
name 1
message 1
dtype: int64
Do you know why is taking only one row and is not appending remainings??