Good afternoon!
I have a .csv file like this (when opened with Notepad):
"2,"" Lorem ipsum dolor sit amet, consectetur adipiscing elit.
"""
"2,"" Proin a tortor leo. Morbi dictum laoreet nulla sit amet luctus. Donec euismod egestas velit, eget consequat ex porttitor vitae. Sed venenatis ornare enim sed rutrum. Aenean congue purus vitae congue rutrum. Ut ex felis, viverra imperdiet est vel, hendrerit luctus ligula.
"""
"2,"" estibulum consequat lorem enim, ut semper erat fringilla id.
"""
"2,"" Praesent a lobortis justo. Cras in sapien enim.
"""
...
I use this to get data from a file:
train = pd.read_csv('yelp_review_polarity_csv/train.csv',
header=None,
names=['Class', 'Review'],
encoding="cp1251",
sep=",")
Here is what I get: The second column filled with "Null" values. I need it to look something like this:
Class Review
2 Lorem ipsum dolor sit amet...
I mean that the data should be divided into two columns with a "," delimiter. How to fix it?
Note: I am using encoding cp1251 so that there are no problems with some characters from another language.
CodePudding user response:
You can iterate over the lines and try parts = s.split(',""', 1)
to split the input into 2 values and strip the bogus "" from the Review column value.
Assuming the format of each line in your "CSV" file is the same then you can parse the file like this.
import pandas as pd
val1 = []
val2 = []
with open("yelp_review_polarity_csv/train.csv") as fin:
for s in fin:
s = s.strip()
if s == '"""':
# skip lines with """
continue
if s[0] == '"':
# change "2 to just '2'
s = s[1:]
parts = s.split(',""', 1)
val1.append(parts[0])
val2.append(parts[1])
# construct a data frame from the 2 lists
df = pd.DataFrame({'Class': val1, 'Review': val2})
print(df)
Output:
Class Review
0 2 Lorem ipsum dolor sit amet, consectetur adipi...
1 2 Proin a tortor leo. Morbi dictum laoreet null...
2 2 estibulum consequat lorem enim, ut semper era...
3 2 Praesent a lobortis justo. Cras in sapien enim.
If the format varies then will need to tweak the code accordingly.
Alternatively, you could change the format of the text file from
old: "2,"" Lorem ipsum dolor sit amet, consectetur adipiscing elit. """
new: 2,"Lorem ipsum dolor sit amet, consectetur adipiscing elit."
then pd.read_csv()
will parse the input file correctly.