Getting "ParserError" when I try to read a .txt file using pd.read

I am trying to convert this dataset: COCOMO81 to arff.

Before converting to .arff, I am trying to convert it to .csv

I am following this LINK to do this.

I got that dataset from promise site. I copied the entire page to notepad as cocomo81.txt and now I am trying to convert that cocomo81.txt file to .csv using python. (I intend to convert the .csv file to .arff later using weka)

However, when I run

import pandas as pd
read_file = pd.read_csv(r"cocomo81.txt")

I get THIS ParserError.

To fix this, I followed this solution and modified my command to

read_file = pd.read_csv(r"cocomo81.txt",on_bad_lines='warn')

I got a bunch of warnings - you can see what it looks like here

and then I ran read_file.to_csv(r'.\cocomo81csv.csv',index=None)

But it seems that the fix for ParserError didn't work in my case because my cocomo81csv.csv file looks like THIS in Excel.

Can someone please help me understand where I am going wrong and how can I use datasets from the promise repository in .arff format?

CodePudding user response：

You first need to parse the txt file. Column names can be taken after @attribute

@attribute rely numeric
@attribute data numeric
@attribute cplx numeric
@attribute time numeric
..............................

And in the csv file, load only the data after @data which is at the end of the file. You can just copy/paste.

0.88,1.16,0.7,1,1.06,1.15,1.07,1.19,1.13,1.17,1.1,1,1.24,1.1,1.04,113,2040
0.88,1.16,0.85,1,1.06,1,1.07,1,0.91,1,0.9,0.95,1.1,1,1,293,1600
1,1.16,0.85,1,1,0.87,0.94,0.86,0.82,0.86,0.9,0.95,0.91,0.91,1,132,243
0.75,1.16,0.7,1,1,0.87,1,1.19,0.91,1.42,1,0.95,1.24,1,1.04,60,240
...................................................................

And then read the resulting csv file

pd.read_csv(file, names=["rely", "data", "cplx", ...])

CodePudding user response：

Looks like it's a csv file with comments as the first lines. The comment lines are indicated by % characters, but also @(?), and the actual csv data starts at line 230.

You should skip the first rows and manually set the column names, try something like this:

# set column names manually
col_names = ["rely", "data", "cplx", "time", "stor", "virt", "turn", "acap", "aexp", "pcap", "vexp", "lexp", "modp", "tool", "sced", "loc", "actual" ]
filename = "cocomo81.arff.txt"

# read csv data
df = pd.read_csv(filename, skiprows=229, sep=',', decimal='.', header=None, names=col_names)

print(df)