Looking for a faster way to create a new column in a data frame containing a dictionary values from-CodePudding

Currently I have achieved the title by using .apply() with a lambda function:

calFilteredDf['startTime'] = calFilteredDf['start'].apply(lambda x: x['dateTime'])

This is very slow and I was wondering how I could achieve the same results in less time. calFilteredDf['start'] is a Pandas Series and the data from the 'start' column looks like this:

1       {'date': None, 'dateTime': '2021-08-11T15:00:0...
2       {'date': None, 'dateTime': '2021-08-12T09:30:0...
3       {'date': None, 'dateTime': '2021-08-12T10:00:0...
4       {'date': None, 'dateTime': '2021-08-18T11:00:0...
                              ...                        
1692    {'date': None, 'dateTime': '2023-08-09T14:00:0...
1693    {'date': None, 'dateTime': '2023-08-09T15:00:0...
1694    {'date': None, 'dateTime': '2023-08-10T11:30:0...
1695    {'date': None, 'dateTime': '2023-08-10T16:00:0...
1696    {'date': None, 'dateTime': '2023-08-10T17:00:0...
Name: start, Length: 1697, dtype: object

and the data from the new 'startTime' column needs to look like this:

1       2021-08-11T15:00:00-04:00
2       2021-08-12T09:30:00-04:00
3       2021-08-12T10:00:00-04:00
4       2021-08-18T11:00:00-04:00
                  ...            
1692    2023-08-09T14:00:00-04:00
1693    2023-08-09T15:00:00-04:00
1694    2023-08-10T11:30:00-04:00
1695    2023-08-10T16:00:00-04:00
1696    2023-08-10T17:00:00-04:00
Name: startTime, Length: 1697, dtype: object

Is there a way to do this quickly? I have tried to set

calFilteredDf['startTime'] = calFilteredDf['startTime']['dateTime']

I've also tried using .loc which didn't work because the rows of 'start' aren't the right data type and I've tried using the swifter library to parallelize the process that .apply() is doing but since the dataset is not very large it actually made it slower because of the extra steps the library preforms to determine what the best way to process the data is.

CodePudding user response：

pd.json_normalize is more convenient to use, but it turns out to be the slowest. The list generator has become the fastest of all. Below is the pd.json_normalize code. And tests using different approaches.

import pandas as pd

aaa = [[{'date': None, 'dateTime': '2021-08-12T09:30'}],
[{'date': None, 'dateTime': '2021-08-12T10:00'}],
[{'date': None, 'dateTime': '2021-08-18T11:00'}],
[{'date': None, 'dateTime': '2023-08-09T14:00'}],
[{'date': None, 'dateTime': '2023-08-09T15:00'}],
[{'date': None, 'dateTime': '2023-08-10T11:30'}],
[{'date': None, 'dateTime': '2023-08-10T16:00'}],
[{'date': None, 'dateTime': '2023-08-10T17:00'}]]


calFilteredDf = pd.DataFrame(aaa)

print(calFilteredDf)

calFilteredDf = pd.json_normalize(calFilteredDf[0])
calFilteredDf['startTime'] = calFilteredDf['dateTime']

print(calFilteredDf)

Input

                                                0
0  {'date': None, 'dateTime': '2021-08-12T09:30'}
1  {'date': None, 'dateTime': '2021-08-12T10:00'}
2  {'date': None, 'dateTime': '2021-08-18T11:00'}
3  {'date': None, 'dateTime': '2023-08-09T14:00'}
4  {'date': None, 'dateTime': '2023-08-09T15:00'}
5  {'date': None, 'dateTime': '2023-08-10T11:30'}
6  {'date': None, 'dateTime': '2023-08-10T16:00'}
7  {'date': None, 'dateTime': '2023-08-10T17:00'}

Output

   date          dateTime         startTime
0  None  2021-08-12T09:30  2021-08-12T09:30
1  None  2021-08-12T10:00  2021-08-12T10:00
2  None  2021-08-18T11:00  2021-08-18T11:00
3  None  2023-08-09T14:00  2023-08-09T14:00
4  None  2023-08-09T15:00  2023-08-09T15:00
5  None  2023-08-10T11:30  2023-08-10T11:30
6  None  2023-08-10T16:00  2023-08-10T16:00
7  None  2023-08-10T17:00  2023-08-10T17:00

Yes, indeed json_normalize is twice as slow. Below is the code where apply, json_normalize, transform, list generator are used.

now = datetime.datetime.now()
for i in range(10000):
    calFilteredDf[0].apply(lambda x: x['dateTime'])

time_ = datetime.datetime.now() - now
print('apply', time_)

now = datetime.datetime.now()
for i in range(10000):
    pd.json_normalize(calFilteredDf[0])

time_ = datetime.datetime.now() - now
print('json_normalize', time_)


now = datetime.datetime.now()
for i in range(10000):
    calFilteredDf[0].transform(lambda x: x['dateTime'])

time_ = datetime.datetime.now() - now
print('transform', time_)

now = datetime.datetime.now()
for i in range(10000):
    a = [i['dateTime'] for i in calFilteredDf[0]]

time_ = datetime.datetime.now() - now
print('list generator', time_)

Output

apply 0:00:01.707580
json_normalize 0:00:03.666553
transform 0:00:01.896933
list generator 0:00:00.056657

The output is to use a list generator.

CodePudding user response：

apply, transform and list comprehension methods produce similar speeds on large datasets (with more than 2000 rows), and they all are quite fast! On smaller datasets (especially < 1000 rows) list comprehension beats other methods.

Timings using the perfplot package:

def gen(n):
    return pd.Series([{'date': None, 'dateTime': '2021-08-12T09:30'}] * n)

def using_apply(s):
    return s.apply(lambda x: x['dateTime'])

def using_transform(s):
    return s.transform(lambda x: x['dateTime'])

def using_list_comprehension(s):
    return pd.Series([i['dateTime'] for i in s])

import perfplot

perfplot.plot(
    setup=gen,
    kernels=[using_apply, using_transform, using_list_comprehension],
    n_range=[2**k for k in range(4, 22)],
    equality_check=None
)

Comparison with ps.json_normalize() on smaller datasets:

def using_json_normalize(s):
    return pd.json_normalize(s)['dateTime']

perfplot.plot(
    setup=gen,
    kernels=[using_apply, using_transform, using_list_comprehension, using_json_normalize],
    n_range=[2**k for k in range(4, 12)],
    equality_check=None
)