Efficient way of concatenating and grouping pandas dataframes-CodePudding

I have around 150 CSV files on the following format:

Product Name	Cost	Manufacturer	Country
P_0	5	Pfizer	Finland
P_1	10	BioNTech	Sweden
P_2	12	Pfizer	Denmark
P_3	11	J&J	Finland

Each CSV represents daily data. So the file for the previous date would look like:

Product Name	Cost	Manufacturer	Country
P_0	7	Pfizer	Finland
P_1	15	BioNTech	Sweden
P_2	17	Pfizer	Denmark
P_3	10	J&J	Finland

I would like to create a time series dataset where I can track the price of a product given a manufacturer in a given country over time.

So for example I want to be able to show the price development of product P_1 made by BioNTech in Sweden as:

Date	Price
17/10/2022	15
18/10/2022	10

My attempt:

Each CSV has the date as a part of its name (e.g., 'data_17-10_2022'). So I have created a list that contains the path to all of the CSV files and then I iterate through this list, convert each CSV to a pandas dataframe, add each of them to a list and then concatenate this after which I perform some groupby operation.

def create_ts(data):
    df_list = []

    for file in data:
        match = re.search(r'\d{2}-\d{2}-\d{4}', file) # get date from file name
        date = datetime.strptime(match.group(), '%d-%m-%Y').date()

        df = pd.read_csv(file, sep = ";")
        df["date"] = date # create a new column in each df that contains the date
        df_list.append(df)

    return df_list

df_concat = pd.concat(create_ts(my_files))
df_group = df_concat.groupby(["Manufacturer", "Country", "Product Name"])

This returns what I am after. However, it is very slow (when I tried it for a random country, manufacturer and product name it took nearly 10 minutes to run).

The problem (I think) is that each CSV is approximately 40MB (180000 rows and 20 columns, of which I have drop around 10 irrelevant columns).

Is there anything I can do to speed this up? I tried installing modin but I got an error saying I need VS C v.14 and my work computer does not allow me to install programs without going through a very long process with the IT department.

CodePudding user response：

Fundamentally your reading approach is fine: as far as I know reading then concatenating the dataframes is the best approach. There are some marginal improvements you can get if you use the usecols and dtype parameters in read_csv but this is ever dependant on what your data looks like:

Method	Time	Relative
Original	0.1512130000628531	1.5909069397118787
Only load columns you need	0.09676750004291534	1.0180876465175188
Use dtype parameter	0.09504829999059439	1.0

I think to get significant performance improvement you probably want to look at caching at some point in the process as dankal444 mentions.

What you cache depends on how the data is changing, but assuming the files do not change once you have received them I would probably cache the loaded dataframe with a set of included files something like:

import pickle

dst = './fastreading.pkl'
contained_files = set()

with open(dst, 'wb') as f:
    pickle.dump((contained_files, df), f)

with open(dst, 'rb') as f:
    contained_files2, df2 = pickle.load(f)

You could then check if the file is in the list of contained files in your loading process. I am using pickle here, but there are other faster ways of loading/saving dataframes, there is some benchmark data here.

If you are worried that the files will chance, you could include a timestamp or a checksum in your contained files list.

The other thing I would recommend is running a profiler. This should give you a good idea where the time is spent.

read_csv test code:

import pandas as pd
import numpy as np
import timeit

iterations = 10
item_count = 5000

path = './fasterreading.csv'

data = {c: [i/2 for i in range(item_count)]  for c in [chr(c) for c in range(ord('a'), ord('z')   1)]}
dtypes = {c: np.float64 for c in data.keys()}

df = pd.DataFrame(data)
df.to_csv(path)

# attempt to negate file system caching effect
timeit.timeit(lambda: pd.read_csv(path), number=5)

t0 = timeit.timeit(lambda: pd.read_csv(path), number=iterations)
t1 = timeit.timeit(lambda: pd.read_csv(path, usecols=['a', 'b', 'c']), number=iterations)
t2 = timeit.timeit(lambda: pd.read_csv(path, usecols=['a', 'b', 'c'], dtype=dtypes), number=iterations)
tmin = min(t0, t1, t2)

print(f'| Method                                  | Time | Relative      |')
print(f'|------------------                       |----------------------|')
print(f'| Original                                | {t0} | {t0 / tmin}   |')
print(f'| Only load columns you need              | {t1} | {t1 / tmin}   |')
print(f'| Use dtype parameter                     | {t2} | {t2 / tmin}   |')