Dataframe on PyPy-CodePudding

PyPy does a great job in accelerating my code. However, when it comes to using Pandas on PyPy, as expected, it is not speeding up the code that much. I am looking for a way to replace that part of the code with few lines of code that does not rely on Pandas such that I can benefit from the full power of PyPy. The task is really simple with Pandas: I have four data frames df_AB, df_CD, df_AC, and df_BD. I first built a merged data frame out of AB and CD, df_tot. I remove the rows that contain repeating values and sort the obtained data frame. Then, I compare the obtained df_tot with df_AC and df_BD and preserve a certain row in df_tot if values in columns A and C are present in df_AC and those in B and D are present in df_AD:

df_tot = df_AB.merge(df_CD, how='cross')
df_tot = df_tot[~df_tot.apply(lambda x: x.duplicated().any(), axis=1)]
df_tot = df_tot.drop_duplicates()
df_tot = df_tot.sort_values(["A","B", "C", "D"], axis=0, ascending=True, inplace=False, kind='quicksort', na_position='last')

df = pd.merge(df_tot, df_BD, on=['B','D'], how="inner")
df = df.drop_duplicates()

df_ACBD = pd.merge(df, df_AC, on=['A','C'], how="inner")
df_ACBD = df_ACBD.drop_duplicates()

How can I accelerate these few lines of code on PyPy? Many thanks!

CodePudding user response：

You complain that your code runs more slowly than desired.

benefit from the full power of PyPy.

Well, PyPy can loop through a million iterations faster than the cPython bytecode interpreter, and cython can go faster still.

But your difficulty seems to be with your algorithm, starting with allocating storage for giant Cartesian cross product in line 1. The A,B,C,D labels don't let me infer much about the business problem you're trying to solve, so it's hard to get an intuition for what you're really trying to compute. But repeatedly blowing up the number of rows with .merge() and then pruning dups just seems wasteful. If pandas gets to exploit dataframe indexes during these operations, it's not apparent from the code that was posted.

Consider putting your rows into four RDBMS tables, perhaps sqlite, and then JOINing them. Two good things would come of this. You'd be forced to explicitly declare UNIQUE constraints such as PRIMARY KEY. And you'd have an opportunity to examine EXPLAIN PLAN to see if it makes sense or if you should make a tweak to get a better query plan.

tl;dr: big-Oh complexity matters, be sure to avoid quadratic if O(n log n) would suffice.

CodePudding user response：

PyPy achieves improvement by automatically translating some pure python code in C (or like). Notably loops.

But when using pandas, it is one of its main advantage, we can avoid writing any loops. And a well written pandas code spend 99 % of times executing pandas functions, and not python function. Those pandas functions are already written in C, and very well optimized (if pypy was able to yet optimize C-code, well, it would have been done already in pandas code!).

So only chance PyPy have to optimize your code, is to optimize the "100-99 %" of your cpu time spend in pure python code.

PyPy are for people who want to write for loops in python and want them to be fast nevertheless. Pandas are for people who don't want to write for loops in python.