"Cannot combine the series or dataframe because it comes from a different dataframe" while-CodePudding

I'm using the Databricks. For my data I created a DeltaLake. Then I tried to modify the column using pandas API but for some reason the following error message pops up:

ValueError: Cannot combine the series or dataframe because it comes from a different dataframe. In order to allow this operation, enable 'compute.ops_on_diff_frames' option.

I use the following code to rewrite data in the table:

df_new = spark.read.format('delta').load(f"abfss://{container}@{storage_account_name}.dfs.core.windows.net/{delta_name}")

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from math import *
from pyspark.pandas.config import set_option
import pyspark.pandas as ps
%matplotlib inline

from pyspark.pandas.config import set_option
import pyspark.pandas as ps

win_len = 5000

# For this be sure you have runtime 1.11 or earlier version

df_new = df_new.pandas_api()
print('Creating Average active power for U1 and V1...')
df_new['p_avg1'] = df_new.Current1.mul(df_new['Voltage1']).rolling(min_periods=1, window=win_len).mean()

print('Creating Average active power for U2 and V2...')
df_new['p_avg2'] = df_new.Current2.mul(df_new['Voltage2']).rolling(min_periods=1, window=win_len).mean()

print('Creating Average active power for U3 and V3...')
df_new['p_avg3'] = df_new.Current3.mul(df_new['Voltage3']).rolling(min_periods=1, window=win_len).mean()

print('Creating Average active power for U4 and V4...')
df_new['p_avg4'] = df_new.Current4.mul(df_new['Voltage4']).rolling(min_periods=1, window=win_len).mean()

print('Converting to Spark dataframe')
df_new = df_new.to_spark()

print('Complete')

Previously with pandas API there were no problem, I'm using the lastest Runtime 11.2. Only one dataframe was loaded while I was using cluster.

Thank you in advance.

CodePudding user response：

The error message is suggesting this: In order to allow this operation, enable 'compute.ops_on_diff_frames' option

Here's how to enable this option per the docs:

import pyspark.pandas as ps
ps.set_option('compute.ops_on_diff_frames', True)

The docs have this important warning:

Pandas API on Spark disallows the operations on different DataFrames (or Series) by default to prevent expensive operations. It internally performs a join operation which can be expensive in general.