Pandas iterrows too slow, how can I vectorize this code?-CodePudding

I'm a Jr. Data Scientist and I'm trying to solve a problem that may be simple for experienced programmers. I'm dealing with Big Data on GCP and I need to optimize my code.

                                      [...]
    def send_to_bq(self, df):
        result = []
        for i, row in df[["id", "vectors", "processing_timestamp"]].iterrows():
            data_dict = {
                "processing_timestamp": str(row["processing_timestamp"]),
                "id": row["id"],
                "embeddings_vector": [str(x) for x in row["vectors"]],
            }
            result.append(data_dict)
                                      [...]

Our DataFrame have the following pattern:

           id                                               name  \
0  3498001704  roupa natal flanela animais estimacao traje ma...   

                                             vectors  \
0  [0.4021441, 0.45425776, 0.3963987, 0.23765437,...   

        processing_timestamp  
0 2021-10-26 23:48:57.315275

Using iterrows on a DataFrame is too slow. I've been studying alternatives and I know that:

I can use apply
I can vectorize it through Pandas Series (better than apply)
I can vectorize it through Numpy (better that Pandas vectorization)
I can use Swifter - which uses apply method and then decides the better solution for you between Dask, Ray and vectorization

But I don't know how I can transform my code for those solutions.

Can anyone help me demonstrating a solution for my code? One is enough, but if someone could show more than one solution would be really educational for this matter.

Any help I will be more than grateful!

CodePudding user response：

So you basically convert everything to string and then transform your DataFrame to a list of dict

For the second part, there is a pandas method to_dict. For the first part, I would use astype and apply only to convert the type

df["processing_timestamp"] = df["processing_timestamp"].astype(str)
df["embeddings_vector"] = df["vectors"].apply(lambda row: [str(x) for x in row])
result = df[["id", "vectors", "processing_timestamp"]].to_dict('records')

A bit hard to test without sample data but hopefully this helps ;) Also, like I did with the lambda function you could basdically wrap your entire loop body inside an apply, but that would create far to many temporary dicitionaries to be fast.

CodePudding user response：

You can use pandas.DataFrame methods to convert it to other types such as DataFrame.to_dict() and more.