I would like to know if the merge operation using how="cross"
will keep my lines order on the left and right side, to be more clear, I except something like that :
df1 = pd.DataFrame(["a", "b", "c"])
df2 = pd.DataFrame(["1", "2", "3"])
df1.merge(df2, how="cross")
# I except the result to be ALWAYS like this (with 1, 2, 3 repeating) :
0 a 1
1 a 2
2 a 3
3 b 1
4 b 2
5 b 3
6 c 1
7 c 2
8 c 3
I tested with few data, but I will have to use billions of rows, thus it's hard to check if the order stays the same.
In pandas doc, they say :
cross: creates the cartesian product from both frames, preserves the order of the left keys.
Left key is preserved, so should I assume right keys order is not ?
Thanks for your help
CodePudding user response:
The order of the right keys is preserved within each group of left keys. What the documentation describes is that the left keys will be the primary order.
Example:
df1 = pd.DataFrame(["c", "b", "a"])
df2 = pd.DataFrame(["1", "3", "2"])
df1.merge(df2, how="cross")
Output:
0_x 0_y
0 c 1 # within a group
1 c 3 # the order is always
2 c 2 # 1 -> 3 -> 2
3 b 1
4 b 3
5 b 2
6 a 1
7 a 3
8 a 2
how is it done internally?
pandas sets up dummy columns in each DataFrame with a value of 1 and performs an inner merge:
def _create_cross_configuration(
self, left: DataFrame, right: DataFrame
) -> tuple[DataFrame, DataFrame, str, str]:
"""
Creates the configuration to dispatch the cross operation to inner join,
e.g. adding a join column and resetting parameters. Join column is added
to a new object, no inplace modification
Parameters
----------
left : DataFrame
right : DataFrame
Returns
-------
a tuple (left, right, how, cross_col) representing the adjusted
DataFrames with cross_col, the merge operation set to inner and the column
to join over.
"""
cross_col = f"_cross_{uuid.uuid4()}"
how = "inner"
return (
left.assign(**{cross_col: 1}),
right.assign(**{cross_col: 1}),
how,
cross_col,
)