Issue with pd.DataFrame.apply with arguments-CodePudding

I want to create augmented data in a new dataframe for every row of an original dataframe.

So, I've defined augment method which I want to use in apply as following:

def augment(row: pd.Series, column_name: str, target_df: pd.DataFrame, num_samples: int):
    # print(type(row))
    target_df_start_index = target_df.shape[0]
    raw_img = row[column_name].astype('uint8')
    bin_image = convert_image_to_binary_image(raw_img)
    bin_3dimg = tf.expand_dims(input=bin_image, axis=2)
    bin_img_reshaped = tf.image.resize_with_pad(image=bin_3dimg, target_width=128, target_height=128, method="bilinear")

    for i in range(num_samples   1):
        new_row = row.copy(deep=True)

        if i == 0:
            new_row[column_name] = np.squeeze(bin_img_reshaped, axis=2)
        else:
            aug_image = data_augmentation0(bin_img_reshaped)
            new_row[column_name] = np.squeeze(aug_image, axis=2)

        # display.display(new_row)
        target_df.loc[target_df_start_index   i] = new_row

    # print(target_df.shape)
    # display.display(target_df)

When I call this as following, everything works:

tmp_df = pd.DataFrame(None, columns=testDF.columns)
augment(testDF.iloc[0], column_name='binMap', target_df=tmp_df, num_samples=4)
augment(testDF.iloc[1], column_name='binMap', target_df=tmp_df, num_samples=4)

However, when I try it using 'apply' method, I get the prints or the display working fine but the resultant dataframe shows error

tmp_df = pd.DataFrame(None, columns=testDF.columns)
testDF.apply(augment, args=('binMap', tmp_df, 4, ), axis=1)

This is how the o/p data looks like after the apply call -

,data
<Error>, <Error>
<Error>, <Error>

What am I doing wrong?

CodePudding user response：

Your test is very nice, thank you for the clear exposition. I am happy to be your rubber duck.

In test A, you (successfully) mess with testDF.iloc[0] and [1], using kind of a Fortran-style API for augment(), leaving a side effect result in tmp_df.

Test B is carefully constructed to be "the same" except for the .apply() call. So let's see, what's different? Hard to say. Let's go examine the docs.

Oh, right! We're using the .apply() API, so we'd better follow it. Down at the end it explains:

Returns: Series or DataFrame

Result of applying func along the given axis of the DataFrame.

But you're offering return None instead.

Now, I'm not here to pass judgement on whether it's best to have side effects on a target df -- that's up to you. But .apply() will be bent out of shape until you give it something nice to store as its own result. Happy hunting!

CodePudding user response：

This change worked for me -

def augment(row: pd.Series, column_name: str, target_df: pd.DataFrame, num_samples: int) -> pd.Series:
    # print(type(row))
    target_df_start_index = target_df.shape[0]
    raw_img = row[column_name].astype('uint8')
    bin_image = convert_image_to_binary_image(raw_img)
    bin_3dimg = tf.expand_dims(input=bin_image, axis=2)
    bin_img_reshaped = tf.image.resize_with_pad(image=bin_3dimg, target_width=128, target_height=128, method="bilinear")

    for i in range(num_samples   1):
        new_row = row.copy(deep=True)

        if i == 0:
            new_row[column_name] = np.squeeze(bin_img_reshaped, axis=2)
        else:
            aug_image = data_augmentation0(bin_img_reshaped)
            new_row[column_name] = np.squeeze(aug_image, axis=2)

        # display.display(new_row)
        target_df.loc[target_df_start_index   i] = new_row

    # print(target_df.shape)
    # display.display(target_df)
    return row

And updated call to apply as following:

testDF = testDF.apply(augment, args=('binMap', tmp_df, 4, ), result_type='broadcast', axis=1)

Thank you @J_H. If there are better to way to achieve what I'm doing, please feel free to suggest the improvements.