Home > Back-end >  Efficient way of concatenating and grouping pandas dataframes
Efficient way of concatenating and grouping pandas dataframes

Time:10-22

I have around 150 CSV files on the following format:

Product Name Cost Manufacturer Country
P_0 5 Pfizer Finland
P_1 10 BioNTech Sweden
P_2 12 Pfizer Denmark
P_3 11 J&J Finland

Each CSV represents daily data. So the file for the previous date would look like:

Product Name Cost Manufacturer Country
P_0 7 Pfizer Finland
P_1 15 BioNTech Sweden
P_2 17 Pfizer Denmark
P_3 10 J&J Finland

I would like to create a time series dataset where I can track the price of a product given a manufacturer in a given country over time.

So for example I want to be able to show the price development of product P_1 made by BioNTech in Sweden as:

Date Price
17/10/2022 15
18/10/2022 10

My attempt:

Each CSV has the date as a part of its name (e.g., 'data_17-10_2022'). So I have created a list that contains the path to all of the CSV files and then I iterate through this list, convert each CSV to a pandas dataframe, add each of them to a list and then concatenate this after which I perform some groupby operation.

def create_ts(data):
    df_list = []

    for file in data:
        match = re.search(r'\d{2}-\d{2}-\d{4}', file) # get date from file name
        date = datetime.strptime(match.group(), '%d-%m-%Y').date()

        df = pd.read_csv(file, sep = ";")
        df["date"] = date # create a new column in each df that contains the date
        df_list.append(df)

    return df_list

df_concat = pd.concat(create_ts(my_files))
df_group = df_concat.groupby(["Manufacturer", "Country", "Product Name"])

This returns what I am after. However, it is very slow (when I tried it for a random country, manufacturer and product name it took nearly 10 minutes to run).

The problem (I think) is that each CSV is approximately 40MB (180000 rows and 20 columns, of which I have drop around 10 irrelevant columns).

Is there anything I can do to speed this up? I tried installing modin but I got an error saying I need VS C v.14 and my work computer does not allow me to install programs without going through a very long process with the IT department.

CodePudding user response:

Fundamentally your reading approach is fine: as far as I know reading then concatenating the dataframes is the best approach. There are some marginal improvements you can get if you use the usecols and dtype parameters in read_csv but this is ever dependant on what your data looks like:

Method Time Relative
Original 0.1512130000628531 1.5909069397118787
Only load columns you need 0.09676750004291534 1.0180876465175188
Use dtype parameter 0.09504829999059439 1.0

I think to get significant performance improvement you probably want to look at caching at some point in the process as dankal444 mentions.

What you cache depends on how the data is changing, but assuming the files do not change once you have received them I would probably cache the loaded dataframe with a set of included files something like:

import pickle

dst = './fastreading.pkl'
contained_files = set()

with open(dst, 'wb') as f:
    pickle.dump((contained_files, df), f)

with open(dst, 'rb') as f:
    contained_files2, df2 = pickle.load(f)

You could then check if the file is in the list of contained files in your loading process. I am using pickle here, but there are other faster ways of loading/saving dataframes, there is some benchmark data here.

If you are worried that the files will chance, you could include a timestamp or a checksum in your contained files list.

The other thing I would recommend is running a profiler. This should give you a good idea where the time is spent.

read_csv test code:

import pandas as pd
import numpy as np
import timeit

iterations = 10
item_count = 5000

path = './fasterreading.csv'

data = {c: [i/2 for i in range(item_count)]  for c in [chr(c) for c in range(ord('a'), ord('z')   1)]}
dtypes = {c: np.float64 for c in data.keys()}

df = pd.DataFrame(data)
df.to_csv(path)

# attempt to negate file system caching effect
timeit.timeit(lambda: pd.read_csv(path), number=5)

t0 = timeit.timeit(lambda: pd.read_csv(path), number=iterations)
t1 = timeit.timeit(lambda: pd.read_csv(path, usecols=['a', 'b', 'c']), number=iterations)
t2 = timeit.timeit(lambda: pd.read_csv(path, usecols=['a', 'b', 'c'], dtype=dtypes), number=iterations)
tmin = min(t0, t1, t2)

print(f'| Method                                  | Time | Relative      |')
print(f'|------------------                       |----------------------|')
print(f'| Original                                | {t0} | {t0 / tmin}   |')
print(f'| Only load columns you need              | {t1} | {t1 / tmin}   |')
print(f'| Use dtype parameter                     | {t2} | {t2 / tmin}   |')

  • Related