This loop is currently taking almost 3 hours on my desktop running at 5ghz (OC). How would I go about speeding it up?
df = pd.DataFrame(columns=['clientId', 'url', 'count'])
idx = 0
for row in rows:
df.loc[idx] = pd.Series({'clientId': row.clientId, 'url': row.pagePath, 'count': row.count})
idx = 1
Rows is JSON data stored in (BigQuery) RowIterator.
<google.cloud.bigquery.table.RowIterator object at 0x000001ADD93E7B50>
<class 'google.cloud.bigquery.table.RowIterator'>
JSON data looks like:
Row(('xxxxxxxxxx.xxxxxxxxxx', '/en-us/index.html', 45), {'clientId': 0, 'pagePath': 1, 'count': 2})
Row(('xxxxxxxxxx.xxxxxxxxxx', '/en-us/contact.html', 65), {'clientId': 0, 'pagePath': 1, 'count': 2})
Row(('xxxxxxxxxx.xxxxxxxxxx', '/en-au/index.html', 64), {'clientId': 0, 'pagePath': 1, 'count': 2})
Row(('xxxxxxxxxx.xxxxxxxxxx', '/en-au/products.html', 56), {'clientId': 0, 'pagePath': 1, 'count': 2})
Row(('xxxxxxxxxx.xxxxxxxxxx', '/en-us/employees.html', 54), {'clientId': 0, 'pagePath': 1, 'count': 2})
Row(('xxxxxxxxxx.xxxxxxxxxx', '/en-us/contact/cookies.html', 44), {'clientId': 0, 'pagePath': 1, 'count': 2})
Row(('xxxxxxxxxx.xxxxxxxxxx', '/en-au/careers.html', 91), {'clientId': 0, 'pagePath': 1, 'count': 2})
Row(('xxxxxxxxxx.xxxxxxxxxx', '/en-ca/careers.html', 42), {'clientId': 0, 'pagePath': 1, 'count': 2})
Row(('xxxxxxxxxx.xxxxxxxxxx', '/en-us/contact.html', 44), {'clientId': 0, 'pagePath': 1, 'count': 2})
Row(('xxxxxxxxxx.xxxxxxxxxx', '/', 115), {'clientId': 0, 'pagePath': 1, 'count': 2})
Row(('xxxxxxxxxx.xxxxxxxxxx', '/suppliers', 51), {'clientId': 0, 'pagePath': 1, 'count': 2})
Row(('xxxxxxxxxx.xxxxxxxxxx', '/en-us/search.html', 60), {'clientId': 0, 'pagePath': 1, 'count': 2})
Row(('xxxxxxxxxx.xxxxxxxxxx', '/en-au/careers.html', 50), {'clientId': 0, 'pagePath': 1, 'count': 2})
CodePudding user response:
I ran across the to_dataframe() method in BigQuery. Extremely fast. Took 3 hours down to 3 seconds.
df = query_job.result().to_dataframe()
google.cloud.bigquery.table.RowIterator
Downloading BigQuery data to pandas using the BigQuery Storage API
CodePudding user response:
This is not how you use the pandas dataframe. The dataframe represents data vertically, meaning each column is a Series under the hood, which uses a fixed-sized numpy array (although columns of same data type have their arrays contiguous to others).
Everytime you append a new row to the dataframe, every column's array is resized (i.e., reallocation) and that itself is expensive. You are doing this for every row meaning you have n
iterations of array reallocations for each column of a unique datatype and this is extremely inefficient. Furthermore, you are also creating a pd.Series for each row, which incurs more allocations that is not useful when the dataframe represents data vertically.
You can verify this by looking at the id
of the columns
>>> import pandas as pd
>>> df = pd.DataFrame(columns=['clientId', 'url', 'count'])
# Look at the ID of the DataFrame and the columns
>>> id(df)
1494628715776
# These are the IDs of the empty Series for each column
>>> id(df['clientId']), id(df['url']), id(df['count'])
(1494628789264, 1494630670400, 1494630670640)
# Assigning a series at an index that didn't exist before
>>> df.loc[0] = pd.Series({'clientId': 123, 'url': 123, 'count': 100})
# ID of the dataframe remains the same
>>> id(df)
1494628715776
# However, the underlying Series objects are different (newly allocated)
>>> id(df['clientId']), id(df['url']), id(df['count'])
(1494630712656, 1494630712176, 1494630712272)
By iteratively adding a new row, you are re-creating new Series objects every iteration, hence why it is slow. This is also warned in the pandas documentation under the .append()
method (the argument holds although it is deprecated): https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.append.html#pandas.DataFrame.append
Iteratively appending rows to a DataFrame can be more computationally intensive than a single concatenate. A better solution is to append those rows to a list and then concatenate the list with the original DataFrame all at once.
You'd be better off doing the iterations and appending into a data structure more suited for dynamic-sized operations like the native Python list
before calling pd.DataFrame
on it. For your simple case, however, you can just pass a generator into the pd.DataFrame
call:
# No need to specify columns since you provided the dictionary with the keys
df = pd.DataFrame({'clientId': row.clientId, 'url': row.pagePath, 'count': row.count} for row in rows)
To demonstrate the difference in jupyter notebook:
def reallocating_way(rows):
df = pd.DataFrame(columns=['clientId', 'url', 'count'])
for idx, row in enumerate(rows):
df.loc[idx] = pd.Series({'clientId': row.clientId, 'url': row.pagePath, 'count': row.count})
return df
def better_way(rows):
return pd.DataFrame({'clientId': row.clientId, 'url': row.pagePath, 'count': row.count} for row in rows)
# Making an arbitrary list of 1000 rows
rows = [Row() for _ in range(1000)]
%timeit reallocating_way(rows)
%timeit better_way(rows)
2.45 s ± 118 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
1.8 ms ± 112 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# Making an arbitrary list of 10000 rows
rows = [Row() for _ in range(10000)]
%timeit reallocating_way(rows)
%timeit better_way(rows)
27.3 s ± 1.88 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
12.4 ms ± 142 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
more than 1000x faster for 1000 rows and more than 2000x faster for 10000 rows