Home > database >  The separator in pandas.read_csv() does not separate
The separator in pandas.read_csv() does not separate

Time:07-09

Good afternoon!
I have a .csv file like this (when opened with Notepad):

"2,"" Lorem ipsum dolor sit amet, consectetur adipiscing elit.
"""
"2,"" Proin a tortor leo. Morbi dictum laoreet nulla sit amet luctus. Donec euismod egestas velit, eget consequat ex porttitor vitae. Sed venenatis ornare enim sed rutrum. Aenean congue purus vitae congue rutrum. Ut ex felis, viverra imperdiet est vel, hendrerit luctus ligula.
"""
"2,"" estibulum consequat lorem enim, ut semper erat fringilla id.
"""
"2,"" Praesent a lobortis justo. Cras in sapien enim.
"""
...

I use this to get data from a file:

train = pd.read_csv('yelp_review_polarity_csv/train.csv', 
                    header=None, 
                    names=['Class', 'Review'],
                    encoding="cp1251",
                    sep=",")

Here is what I get: pict The second column filled with "Null" values. I need it to look something like this:

Class     Review
2         Lorem ipsum dolor sit amet...

I mean that the data should be divided into two columns with a "," delimiter. How to fix it?
Note: I am using encoding cp1251 so that there are no problems with some characters from another language.

CodePudding user response:

You can iterate over the lines and try parts = s.split(',""', 1) to split the input into 2 values and strip the bogus "" from the Review column value.

Assuming the format of each line in your "CSV" file is the same then you can parse the file like this.

import pandas as pd

val1 = []
val2 = []
with open("yelp_review_polarity_csv/train.csv") as fin:
    for s in fin:
        s = s.strip()
        if s == '"""':
            # skip lines with """
            continue
        if s[0] == '"':
            # change "2 to just '2'
            s = s[1:]
        parts = s.split(',""', 1)
        val1.append(parts[0])
        val2.append(parts[1])

# construct a data frame from the 2 lists
df = pd.DataFrame({'Class': val1, 'Review': val2})
print(df)

Output:

  Class                                             Review
0     2   Lorem ipsum dolor sit amet, consectetur adipi...
1     2   Proin a tortor leo. Morbi dictum laoreet null...
2     2   estibulum consequat lorem enim, ut semper era...
3     2    Praesent a lobortis justo. Cras in sapien enim.

If the format varies then will need to tweak the code accordingly.

Alternatively, you could change the format of the text file from
old: "2,"" Lorem ipsum dolor sit amet, consectetur adipiscing elit. """
new: 2,"Lorem ipsum dolor sit amet, consectetur adipiscing elit."
then pd.read_csv() will parse the input file correctly.

  • Related