I'm trying to change the date format of a column in a CSV. I know there are easier ways to do this, but the goal here is to get the threads working properly. I work with Spyder and Python 3.8. My code works as follows:
- I create a thread class with a function to change the date format
- I split my dataframe in several dataframes according to the number of threads
- I assign to each thread a part of the dataframe
- each thread changes the date formats in its dataframe
- at the end, I concatenate all the dataframes into one
"serie" is my original dataframe. Here is my code:
import pandas as pd
import numpy as np
import threading
import time
from datetime import datetime
from threading import Thread
from time import process_time
serie=pd.read_csv('XXX.csv')
in_format = "%d/%m/%Y"
out_format = "%Y-%m-%d"
class MonThread (threading.Thread):
def __init__(self, num_thread):
threading.Thread.__init__(self)
self.num_thread = num_thread
#Thread function
def run(self):
for self.i in range(dataframes[self.num_thread].index[0], dataframes[self.num_thread].index[0] dataframes[self.num_thread].shape[0]):
date_formatee = datetime.strptime(dataframes[self.num_thread].loc[self.i, 'Date'], in_format).strftime(out_format)
dataframes[self.num_thread].loc[self.i, 'Date'] = date_formatee
nb_thread = 80
dataframes = []
#Df divided in several
for j in range(nb_thread):
a = j * (serie.shape[0] // nb_thread)
if j != nb_thread - 1 :
b = (j 1) * (serie.shape[0] // nb_thread)
df = serie.iloc[a:b,:]
else:
df = serie.iloc[a:,:]
b = serie.shape[0]
dataframes.append(df)
print("Intervalle", j, ": [", a, ",", b, "]")
tps1 = process_time()
print(tps1)
threads = []
for n in range(nb_thread):
t = MonThread(n)
t.start()
threads.append(t)
for t in threads:
t.join()
dataframe_finale = pd.concat(dataframes)
print("\n\n\n")
tps2 = process_time()
print(tps2)
print("temps d'éxécution : ")
print(tps2 - tps1)
It's working, but I find the execution time quite long, for a total of 100000 values it takes me about 1min30 to process with no threads, but with 80 threads it takes me about 30 seconds, and with 200 or 400 threads I stagnate at 30 seconds. Is my code bad or am I limited by something?
CodePudding user response:
Have you tried just letting Pandas do the work over the series?
import pandas as pd
df = pd.read_csv('XXX.csv')
in_format = "%d/%m/%Y"
out_format = "%Y-%m-%d"
df['Date'] = pd.to_datetime(df['Date'], format=in_format).dt.strftime(out_format)
On my Macbook, this processes a million entries in 5 seconds.
Another way to do the same (without date validation, though), is
df['Date'] = df['Date'].str.replace(r"(\d )/(\d )/(\d )", r"\3-\2-\1", regex=True)
which finishes the job in about 3.3 seconds.