I am trying to convert this dataset: COCOMO81 to arff.
Before converting to .arff, I am trying to convert it to .csv
I am following this LINK to do this.
I got that dataset from promise site. I copied the entire page to notepad as cocomo81.txt and now I am trying to convert that cocomo81.txt file to .csv using python. (I intend to convert the .csv file to .arff later using weka)
However, when I run
import pandas as pd
read_file = pd.read_csv(r"cocomo81.txt")
I get THIS ParserError.
To fix this, I followed this solution and modified my command to
read_file = pd.read_csv(r"cocomo81.txt",on_bad_lines='warn')
I got a bunch of warnings - you can see what it looks like here
and then I ran
read_file.to_csv(r'.\cocomo81csv.csv',index=None)
But it seems that the fix for ParserError didn't work in my case because my cocomo81csv.csv file looks like THIS in Excel.
Can someone please help me understand where I am going wrong and how can I use datasets from the promise repository in .arff format?
CodePudding user response:
You first need to parse the txt file. Column names can be taken after @attribute
@attribute rely numeric
@attribute data numeric
@attribute cplx numeric
@attribute time numeric
..............................
And in the csv file, load only the data after @data which is at the end of the file. You can just copy/paste.
0.88,1.16,0.7,1,1.06,1.15,1.07,1.19,1.13,1.17,1.1,1,1.24,1.1,1.04,113,2040
0.88,1.16,0.85,1,1.06,1,1.07,1,0.91,1,0.9,0.95,1.1,1,1,293,1600
1,1.16,0.85,1,1,0.87,0.94,0.86,0.82,0.86,0.9,0.95,0.91,0.91,1,132,243
0.75,1.16,0.7,1,1,0.87,1,1.19,0.91,1.42,1,0.95,1.24,1,1.04,60,240
...................................................................
And then read the resulting csv file
pd.read_csv(file, names=["rely", "data", "cplx", ...])
CodePudding user response:
Looks like it's a csv file with comments as the first lines. The comment lines are indicated by %
characters, but also @
(?), and the actual csv data starts at line 230.
You should skip the first rows and manually set the column names, try something like this:
# set column names manually
col_names = ["rely", "data", "cplx", "time", "stor", "virt", "turn", "acap", "aexp", "pcap", "vexp", "lexp", "modp", "tool", "sced", "loc", "actual" ]
filename = "cocomo81.arff.txt"
# read csv data
df = pd.read_csv(filename, skiprows=229, sep=',', decimal='.', header=None, names=col_names)
print(df)