I have around 150 CSV files on the following format:
Product Name | Cost | Manufacturer | Country |
---|---|---|---|
P_0 | 5 | Pfizer | Finland |
P_1 | 10 | BioNTech | Sweden |
P_2 | 12 | Pfizer | Denmark |
P_3 | 11 | J&J | Finland |
Each CSV represents daily data. So the file for the previous date would look like:
Product Name | Cost | Manufacturer | Country |
---|---|---|---|
P_0 | 7 | Pfizer | Finland |
P_1 | 15 | BioNTech | Sweden |
P_2 | 17 | Pfizer | Denmark |
P_3 | 10 | J&J | Finland |
I would like to create a time series dataset where I can track the price of a product given a manufacturer in a given country over time.
So for example I want to be able to show the price development of product P_1 made by BioNTech in Sweden as:
Date | Price |
---|---|
17/10/2022 | 15 |
18/10/2022 | 10 |
My attempt:
Each CSV has the date as a part of its name (e.g., 'data_17-10_2022'). So I have created a list that contains the path to all of the CSV files and then I iterate through this list, convert each CSV to a pandas dataframe, add each of them to a list and then concatenate this after which I perform some groupby operation.
def create_ts(data):
df_list = []
for file in data:
match = re.search(r'\d{2}-\d{2}-\d{4}', file) # get date from file name
date = datetime.strptime(match.group(), '%d-%m-%Y').date()
df = pd.read_csv(file, sep = ";")
df["date"] = date # create a new column in each df that contains the date
df_list.append(df)
return df_list
df_concat = pd.concat(create_ts(my_files))
df_group = df_concat.groupby(["Manufacturer", "Country", "Product Name"])
This returns what I am after. However, it is very slow (when I tried it for a random country, manufacturer and product name it took nearly 10 minutes to run).
The problem (I think) is that each CSV is approximately 40MB (180000 rows and 20 columns, of which I have drop around 10 irrelevant columns).
Is there anything I can do to speed this up? I tried installing modin but I got an error saying I need VS C v.14 and my work computer does not allow me to install programs without going through a very long process with the IT department.
CodePudding user response:
Fundamentally your reading approach is fine: as far as I know reading then concatenating the dataframes is the best approach. There are some marginal improvements you can get if you use the usecols
and dtype
parameters in read_csv
but this is ever dependant on what your data looks like:
Method | Time | Relative |
---|---|---|
Original | 0.1512130000628531 | 1.5909069397118787 |
Only load columns you need | 0.09676750004291534 | 1.0180876465175188 |
Use dtype parameter | 0.09504829999059439 | 1.0 |
I think to get significant performance improvement you probably want to look at caching at some point in the process as dankal444 mentions.
What you cache depends on how the data is changing, but assuming the files do not change once you have received them I would probably cache the loaded dataframe with a set of included files something like:
import pickle
dst = './fastreading.pkl'
contained_files = set()
with open(dst, 'wb') as f:
pickle.dump((contained_files, df), f)
with open(dst, 'rb') as f:
contained_files2, df2 = pickle.load(f)
You could then check if the file is in the list of contained files in your loading process. I am using pickle here, but there are other faster ways of loading/saving dataframes, there is some benchmark data here.
If you are worried that the files will chance, you could include a timestamp or a checksum in your contained files list.
The other thing I would recommend is running a profiler. This should give you a good idea where the time is spent.
read_csv
test code:
import pandas as pd
import numpy as np
import timeit
iterations = 10
item_count = 5000
path = './fasterreading.csv'
data = {c: [i/2 for i in range(item_count)] for c in [chr(c) for c in range(ord('a'), ord('z') 1)]}
dtypes = {c: np.float64 for c in data.keys()}
df = pd.DataFrame(data)
df.to_csv(path)
# attempt to negate file system caching effect
timeit.timeit(lambda: pd.read_csv(path), number=5)
t0 = timeit.timeit(lambda: pd.read_csv(path), number=iterations)
t1 = timeit.timeit(lambda: pd.read_csv(path, usecols=['a', 'b', 'c']), number=iterations)
t2 = timeit.timeit(lambda: pd.read_csv(path, usecols=['a', 'b', 'c'], dtype=dtypes), number=iterations)
tmin = min(t0, t1, t2)
print(f'| Method | Time | Relative |')
print(f'|------------------ |----------------------|')
print(f'| Original | {t0} | {t0 / tmin} |')
print(f'| Only load columns you need | {t1} | {t1 / tmin} |')
print(f'| Use dtype parameter | {t2} | {t2 / tmin} |')