I'm using the Databricks. For my data I created a DeltaLake. Then I tried to modify the column using pandas API but for some reason the following error message pops up:
ValueError: Cannot combine the series or dataframe because it comes from a different dataframe. In order to allow this operation, enable 'compute.ops_on_diff_frames' option.
I use the following code to rewrite data in the table:
df_new = spark.read.format('delta').load(f"abfss://{container}@{storage_account_name}.dfs.core.windows.net/{delta_name}")
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from math import *
from pyspark.pandas.config import set_option
import pyspark.pandas as ps
%matplotlib inline
from pyspark.pandas.config import set_option
import pyspark.pandas as ps
win_len = 5000
# For this be sure you have runtime 1.11 or earlier version
df_new = df_new.pandas_api()
print('Creating Average active power for U1 and V1...')
df_new['p_avg1'] = df_new.Current1.mul(df_new['Voltage1']).rolling(min_periods=1, window=win_len).mean()
print('Creating Average active power for U2 and V2...')
df_new['p_avg2'] = df_new.Current2.mul(df_new['Voltage2']).rolling(min_periods=1, window=win_len).mean()
print('Creating Average active power for U3 and V3...')
df_new['p_avg3'] = df_new.Current3.mul(df_new['Voltage3']).rolling(min_periods=1, window=win_len).mean()
print('Creating Average active power for U4 and V4...')
df_new['p_avg4'] = df_new.Current4.mul(df_new['Voltage4']).rolling(min_periods=1, window=win_len).mean()
print('Converting to Spark dataframe')
df_new = df_new.to_spark()
print('Complete')
Previously with pandas API there were no problem, I'm using the lastest Runtime 11.2. Only one dataframe was loaded while I was using cluster.
Thank you in advance.
CodePudding user response:
The error message is suggesting this: In order to allow this operation, enable 'compute.ops_on_diff_frames' option
Here's how to enable this option per the docs:
import pyspark.pandas as ps
ps.set_option('compute.ops_on_diff_frames', True)
The docs have this important warning:
Pandas API on Spark disallows the operations on different DataFrames (or Series) by default to prevent expensive operations. It internally performs a join operation which can be expensive in general.