Faster way to create Pandas DataFrame from a Numpy nd array with descriptions?-CodePudding

I would like to transform a numpy nd array with dimension descriptions into a pandas dataframe. The following solutions works, but seems a bit slow for 360000 rows (1.5s on my machine, your results may differ).

import pandas as pd
import numpy as np
from itertools import product
import time

# preparation of data
nd_data = np.random.random((5, 3, 100, 10, 4, 6))
dimension_descriptions = {
    'floaty': [0.1,0.2,0.3,0.4,0.5],
    'animal': ['ducks', 'horses', 'elephants'],
    'ramp': range(100),
    'another_ramp': range(10),
    'interesting number': [12, 15, 29, 42],
    'because': ['why', 'is', 'six', 'afraid', 'of', 'seven']
}

t_start = time.time()
# create dataframe from list of dictionairies containing data and permuted descriptions
df = pd.DataFrame([{**{'data': data}, **dict(zip(dimension_descriptions.keys(), permuted_description))}
                   for data, permuted_description in zip(nd_data.flatten(), product(*dimension_descriptions.values()))])
print(f'elapsed time: {time.time()- t_start:.1f}s')

Is there a faster way to achieve the same result?

CodePudding user response：

On my machine, I put the original way to create the df in a function and timed it.

def create_df1(nd_data, dimension_descriptions):
    return pd.DataFrame([{**{'data': data}, **dict(zip(dimension_descriptions.keys(), permuted_description))}
                   for data, permuted_description in zip(nd_data.flatten(), product(*dimension_descriptions.values()))])

%timeit create_df1(nd_data, dimension_descriptions)
991 ms ± 37.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

You can avoid creating a temporary dict and adding that to a new dict by just assigning the nd_data to the DataFrame after the original permuted data is first created. This gives a slight boost in speed.

def create_df2(nd_data, dimension_descriptions):
    df = pd.DataFrame([dict(zip(dimension_descriptions.keys(), permuted_description))
                       for permuted_description in product(*dimension_descriptions.values())])
    df["data"] = nd_data.flatten()
    return df

%timeit create_df2(nd_data, dimension_descriptions)
822 ms ± 42.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

If you need the data column to be the first column in the dataframe, you can use df.insert(0, "data", nd_data.flatten()) instead, which gets similar speed results on my machine.

It also seems wasteful to create a dict with the same column names every time. Pandas offers a way to avoid this by allowing you to pass in the list of columns as a separate argument, and you can pass the data in as a list of lists instead. This saves a lot of time.

def create_df3(nd_data, dimension_descriptions):
    df = pd.DataFrame(list(product(*dimension_descriptions.values())), columns=dimension_descriptions.keys())
    df["data"] = nd_data.flatten()
    return df

%timeit create_df3(nd_data, dimension_descriptions)
281 ms ± 9.88 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)