How to perform split/merge/melt with Python and polars?-CodePudding

I have a data transformation problem where the original data consists of "blocks" of three rows of data, where the first row denotes a 'parent' and the two others are related children. A minimum working example looks like this:

import polars as pl
df_original = pl.DataFrame(
    {
        'Order ID': ['A', 'foo', 'bar'],
        'Parent Order ID': [None, 'A', 'A'],
        'Direction': ["Buy", "Buy", "Sell"],
        'Price': [1.21003, None, 1.21003],
        'Some Value': [4, 4, 4],
        'Name Provider 1': ['P8', 'P8', 'P8'],
        'Quote Provider 1': [None, 1.1, 1.3],
        'Name Provider 2': ['P2', 'P2', 'P2'],
        'Quote Provider 2': [None, 1.15, 1.25],
        'Name Provider 3': ['P1', 'P1', 'P1'],
        'Quote Provider 3': [None, 1.0, 1.4],
        'Name Provider 4': ['P5', 'P5', 'P5'],
        'Quote Provider 4': [None, 1.0, 1.4]
    }
)

In reality, there are up to 15 Providers (so up to 30 columns), but they are not necessary for the example.

We would like to transform this into a format where each row represents both the Buy and Sell quote of a single provider for that parent. The desired result is as follows:

df_desired = pl.DataFrame(
    {
        'Order ID': ['A', 'A', 'A', 'A'],
        'Parent Direction': ['Buy', 'Buy', 'Buy', 'Buy'],
        'Price': [1.21003, 1.21003, 1.21003, 1.21003],
        'Some Value': [4, 4, 4, 4],
        'Name Provider': ['P8', 'P2', 'P1', 'P5'],
        'Quote Buy': [1.1, 1.15, 1.0, 1.0],
        'Quote Sell': [1.3, 1.25, 1.4, 1.4],
    }
)
df_desired

However, I'm having a hard time doing this in polars.

My first approach was splitting the data into parents and children, then joining them together on the respective ids:

df_parents = (
    df_original
    .filter(pl.col("Parent Order ID").is_null())
    .drop(columns=['Parent Order ID'])
)
df_ch =  (
    df_original
    .filter(pl.col("Parent Order ID").is_not_null())
    .drop(columns=['Price', 'Some Value'])
)

ch_buy = df_ch.filter(pl.col("Direction") == 'Buy').drop(columns=['Direction'])
ch_sell = df_ch.filter(pl.col("Direction") == 'Sell').drop(columns=['Direction'])

df_joined = (
    df_parents
    .join(ch_buy, left_on='Order ID', right_on='Parent Order ID', suffix="_Buy")
    .join(ch_sell, left_on='Order ID', right_on='Parent Order ID', suffix="_Sell")
    # The Name and Quote columns in the parent are all empty, so they can go, buy they had to be there for the suffix to work for the first join
    .drop(columns=[f'Name Provider {i}' for i in range(1, 5)])
    .drop(columns=[f'Quote Provider {i}' for i in range(1, 5)])
)

But this still leaves you with a mess where you somehow have to split this into four rows - not eight, as you could easily do with .melt(). Any tips on how to best approach this? Am I missing some obivous method here?

CodePudding user response：

Here's how I've attempted it:

fill the nulls in the Parent Order ID column and use that to .groupby()

>>> columns = ["Order ID", "Direction", "Price", "Some Value"]
... (
...    df_original
...    .with_column(pl.col("Parent Order ID").backward_fill())
...    .groupby("Parent Order ID")
...    .agg([
...       pl.col(columns).first(),                                    # Take values from first ("parent") row
...       pl.concat_list(pl.col("^Name .*$").first()).alias("Name"),  # Put all names into single column:  ["Name1", "Name2", ...]
...       pl.col("^Quote .*$").slice(1).list()                        # Create list for each quote column: [1.1, 1.3], [1.15, 1.25], ...
...    ])
...    .with_column(                                                  # Put all quote lists into single column:
...       pl.concat_list(pl.col("^Quote .*$").list()).alias("Quote")) # [[1.1, 1.3], [1.15, 1.25], ...]]
...    .select(columns   ["Name", "Quote"])                           # Remove no longer used Quote/Name columns:
...    .explode(["Name", "Quote"])                                    # explode Name/Quote lists into rows
...    .with_columns([                                                # split each Quote into individual columns
...       pl.col("Quote").arr.first().alias("Quote Buy"), 
...       pl.col("Quote").arr. last().alias("Quote Sell"),
...    ])
...    .drop("Quote")                                                 # Remove no longer used Quote column
... )
shape: (4, 7)
┌──────────┬───────────┬─────────┬────────────┬──────┬───────────┬────────────┐
│ Order ID | Direction | Price   | Some Value | Name | Quote Buy | Quote Sell │
│ ---      | ---       | ---     | ---        | ---  | ---       | ---        │
│ str      | str       | f64     | i64        | str  | f64       | f64        │
╞══════════╪═══════════╪═════════╪════════════╪══════╪═══════════╪════════════╡
│ A        | Buy       | 1.21003 | 4          | P8   | 1.1       | 1.3        │
├──────────┼───────────┼─────────┼────────────┼──────┼───────────┼────────────┤
│ A        | Buy       | 1.21003 | 4          | P2   | 1.15      | 1.25       │
├──────────┼───────────┼─────────┼────────────┼──────┼───────────┼────────────┤
│ A        | Buy       | 1.21003 | 4          | P1   | 1.0       | 1.4        │
├──────────┼───────────┼─────────┼────────────┼──────┼───────────┼────────────┤
│ A        | Buy       | 1.21003 | 4          | P5   | 1.0       | 1.4        │
└─//───────┴─//────────┴─//──────┴─//─────────┴─//───┴─//────────┴─//─────────┘