Improving time efficiency of code, working with a big Data Set

I’m trying to figure out a way to “optimize” my code and reduce (by a lot ideally) the time it takes to run through the whole data set.

I’m working with a simple .csv file that has 3 columns: time_UTC, vmag2D and vdir. The data set as around 1420000 lines (one million, four hundred and 20 thousand). I wrote this simple for cycle and it took around 15/20 minutes to run. I am running this on my Mac with the M1 processor, so I’m not sure if I complicated something somewhere for it to take so much time to (meaning, I don't believe that the processor is that "bad" and that it is good enough to run this little piece of code faster) If someone has any suggestion on how I can improve it, please let me know!

import pandas as pd

path_data = '" *insert a path here* "'
file = path_data   ' *name of the .csv file* '

data = pd.read_csv(file)

time_UTC = []
vmag2D = []
vdir = []

for i in range(len(data)):
    x = data.iloc[i][0]
    x1 = x.split(' ')
    x2 = x1[1].split(';')
    date = x.split(' ')[0]
    time_UTC.append(x2[0])
    vmag2D.append(x2[1])
    vdir.append(x2[2])

The code is parsing each of the lines in the .csv file, and each of them has the same “template”: '1994-01-01 00:05:00;0.52;193'

Thank you for any help!

Cheers!

CodePudding user response：

It shouldn't be necessary to use any type of for loop for your code. You are reading the CSV using pandas, but you don't seem to specify the correct parameters.

import pandas as pd

path_data = '" *insert a path here* "'
file = path_data   ' *name of the .csv file* '

df = pd.read_csv(file, sep=';', parse_dates=[0], engine='c', header=None)

time_UTC = df.iloc[:, 0]
vmag2D = df.iloc[:, 1]
vdir = df.iloc[:, 2]

If you run this, your resulting variables (time_UTC, ...) will be of type pandas.Series. You can convert those to list with .to_list() or access the numpy array using .values.

Note that I am specifying engine='c' here in the pandas CSV parser, which is using a native C parser that is faster than its python equivalent, as you are processing a large file here.

CodePudding user response：

You can split the entire column at once

import pandas as pd
import numpy as np

df = pd.DataFrame({"all": ["1994-01-01 00:05:00;0.52;193"]*1000})

# split at space " "
df[["date", "time vmag vdir"]] = df["all"].str.split(" ", expand=True)

# split at ";"
df[["time", "vmag2D", "vdir"]] = df['time vmag vdir'].str.split(';', expand=True)

date = pd.to_datetime(df["date"]).to_list()
time_UTC = pd.to_datetime(df["time"]).to_list()
vmag2D = pd.to_numeric(df["vmag2D"]).to_list()
vdir = pd.to_numeric(df["vdir"]).to_list()