Home > Blockchain >  how to append multiple columns of a huge dataset in a csv file to a data frame
how to append multiple columns of a huge dataset in a csv file to a data frame

Time:06-17

I have a dataset in a csv file and I need to append some of its columns to a list. the dataset is very huge and its length is 2222678 rows.Here is my code to append its columns to list. However it gets stuck and my computer runs slowly. I appreciate if anyone can tell me is there a way to append that huge data into a data frame faster?

 dataset = pd.read_csv(r'/Users/ha/GTruth/sumo-output-w/file_name.csv')

 line=0
 for line in range(len(dataset)):
    input_data.append(dataset.loc[line,["pos_x","pos_y"]])
    output_data.append(dataset.loc[line,["labels"]])

CodePudding user response:

You can make a list from a pandas column (or Series) pretty easily.

labels = list(dataset.labels)
pos_x = list(dataset.pos_x)
pos_y = list(dataset.pos_y)

input_data.extend(labels)
output_data.extend(pos_x)
output_data.extend(pos_x)

CodePudding user response:

It seems you are appending your column data line by line, hence the large amount of time. I would suggest transforming those columns into lists like so:

input_data_x = list(dataset["pos_x"])
input_data_y = list(dataset["pos_y"])

If you want tuples of x and y you can do it like this (time consuming):

input_data = [(x, y) for x,y in zip(input_data_x, input_data_y)]

For the output data:

output_data = list(dataset["labels"])

Additionally, I do not know what this data will be used for, but numpy arrays operations are inherently faster than lists, so I would suggest using numpy arrays.

  • Related