I'm a Jr. Data Scientist and I'm trying to solve a problem that may be simple for experienced programmers. I'm dealing with Big Data on GCP and I need to optimize my code.
[...]
def send_to_bq(self, df):
result = []
for i, row in df[["id", "vectors", "processing_timestamp"]].iterrows():
data_dict = {
"processing_timestamp": str(row["processing_timestamp"]),
"id": row["id"],
"embeddings_vector": [str(x) for x in row["vectors"]],
}
result.append(data_dict)
[...]
Our DataFrame have the following pattern:
id name \
0 3498001704 roupa natal flanela animais estimacao traje ma...
vectors \
0 [0.4021441, 0.45425776, 0.3963987, 0.23765437,...
processing_timestamp
0 2021-10-26 23:48:57.315275
Using iterrows on a DataFrame is too slow. I've been studying alternatives and I know that:
- I can use apply
- I can vectorize it through Pandas Series (better than apply)
- I can vectorize it through Numpy (better that Pandas vectorization)
- I can use Swifter - which uses apply method and then decides the better solution for you between Dask, Ray and vectorization
But I don't know how I can transform my code for those solutions.
Can anyone help me demonstrating a solution for my code? One is enough, but if someone could show more than one solution would be really educational for this matter.
Any help I will be more than grateful!
CodePudding user response:
So you basically convert everything to string and then transform your DataFrame to a list of dict
For the second part, there is a pandas method to_dict
. For the first part, I would use astype
and apply
only to convert the type
df["processing_timestamp"] = df["processing_timestamp"].astype(str)
df["embeddings_vector"] = df["vectors"].apply(lambda row: [str(x) for x in row])
result = df[["id", "vectors", "processing_timestamp"]].to_dict('records')
A bit hard to test without sample data but hopefully this helps ;) Also, like I did with the lambda
function you could basdically wrap your entire loop body inside an apply
, but that would create far to many temporary dicitionaries to be fast.
CodePudding user response:
You can use pandas.DataFrame
methods to convert it to other types such as DataFrame.to_dict()
and more.