Home > Software design >  Does GPU accelerate data preprocessing in ML tasks?
Does GPU accelerate data preprocessing in ML tasks?

Time:05-17

I am doing a machine learning (value prediction) task. While I am preprocessing data, it takes a very long time. I have a csv file with around 640000 rows, and I am trying to subtract the dates of consecutive rows and calculate the time duration. The csv file looks as attached. For example, 2011-08-17 to 2011-08-19 takes 2 days, and I would like to write 2 to the "time duration" column. I've used the python datetime function to do this. And it costs a lot of time.

data = pd.read_csv(f'{proj_dir}/raw data/measures.csv', encoding="cp1252") 

file = data[['ID', 'date', 'value1', 'value2', 'duration']]

def time_subtraction(date, prev_date):
  diff = datetime.strptime(date, '%Y-%m-%d') - datetime.strptime(prev_date, '%Y-%m-%d')
  diff_days = diff.days
  return diff_days

def calculate_time_duration(dataframe, set_0_indices):
  for i in range(dataframe.shape[0]):
    # For each patient, sets "Time Duration" at the first measurement to be 0
    if i in set_time_0_indices.values:
      dataframe.iloc[i, 4] = 0 # set time duration to 0 (beginning of this patient)
    else: # time subtraction
      dataframe.iloc[i, 4] = time_subtraction(date=dataframe.iloc[i, 1], prev_date=dataframe.iloc[i-1, 1])
  return dataframe

# I am running on Google Colab. This line takes very long.
result = calculate_time_duration(dataframe = file, set_0_indices = set_time_0_indices)

I wonder if there are any ways to accelerate this process. Does using a GPU help? I have access to a remote GPU, but I don't know if using a GPU helps with data preprocessing. By the way, under what scenario can GPUs really make things faster? Thanks in advance!

what my data looks like

CodePudding user response:

Regarding updating your data in a faster fashion please see this post. Regarding speed improvements using the GPU: You can only use the GPU if there are optimized operations which can actually be run on the CPU. Preprocessing like you do it are normally not in the scope. You must also consider that you would need to transfer data to the GPU first, before computing anything and then transferring the results back. In your case, this would take much longer than the actual speedup, especially since your operation on the data is quite simple. I'm sure using the correct pandas syntax will lead to your desired speed up in preprocessing.

  • Related