Selecting with Indexing is an anti-pattern in Polars: How to parse and transform (select/filter?) a-CodePudding

I would like to read the following (quite broken, IMO) CSV with Pola-rs, coming from a Rigol MSO5000 oscilloscope:

D7-D0,D15-D8,t0 = -25.01s, tInc = 2e-06,
 2.470000E 02,2.000000E 00,,
 1.590000E 02,1.600000E 01,,
 2.400000E 02,2.000000E 00,,
 2.470000E 02, 1.300000E 02,,
 1.590000E 02,1.800000E 01,,
 2.470000E 02, 1.300000E 02,,
9.500000E 01,1.800000E 01,,
9.500000E 01,1.800000E 01,,
 2.400000E 02,0.000000E 00,,
(...)

Here's my current Jupyter Notebook iteration/attempt before finding out that Selecting with indexing is discouraged in Pola-rs:

import polars as pl

df = pl.read_csv("normal0.csv")

# Grab initial condition and increments
t0 = df.columns[2]; assert "t0" in t0; t0 = float(t0.split('=')[1].replace('s', '').strip())
tinc = df.columns[3]; assert "tInc" in tinc; tinc = float(tinc.split('=')[1].strip())

# TODO: Generate Series() from t0 tinc and include it in the final DataFrame

# Reshape and cleanup
probes = df.with_column(df["D7-D0"].cast(pl.Float32).alias("D0-D7 floats")) \
             .with_column(df["D15-D8"].cast(pl.Float32).alias("D15-D8 floats")) \
             .drop(df.columns[2]) \
             .drop(df.columns[3])
(...)

def split_probes(probes: pl.Series):
    ''' Splits bundles of probe cables such as D0-D7 or D15-D8 into individual dataframe columns
    '''
    out = pl.DataFrame(columns=["D" str(line) for line in range(0,16)])
#    for row in range(probes.height):
#        for probe in range(0, 7):
#            out["D" str(probe)].with_columns(probes["D0-D7 floats"][row % (probe   1)])
#         for probe in reversed(range(9, 16)): # TODO: Fix future captures or parametrise this
#             outprobes["D15-D8 floats"][row % probe]

And here's my lower-level CSV parsing Rust pseudocode approach when I was told on Polars Discord that this problem might not be optimally solvable with dataframe libraries:

use csv;
use std::error::Error;
use std::io;
use std::process;
use serde::Deserialize;

#[derive(Debug, Deserialize)]
struct OrigOscilloscope {
    #[serde(rename = "D7-D0")]
    d7_d0: String, // TODO: Unfortunately those fields are "user-flippable" in order from the scope, i.e: d0_d7 vs d7_d0
    #[serde(rename = "D15-D8")]
    d15_d8: String,
    // Do not even register those on Serde as they are empty rows anyway
    // t0: Option<String>,
    // t_inc: Option<String>
}

#[derive(Debug, Deserialize)]
struct LAProbesOscilloscopeState {
    //Vec<d0...d15>: Vec<Vec<16*f32>>, // TODO: More appropriate struct representation for the target dataframe
    d0: f32,
    d1: f32,
    d2: f32,
    d3: f32,
    d4: f32,
    d5: f32,
    d6: f32,
    d7: f32,
    d8: f32,
    d9: f32,
    d10: f32,
    d11: f32,
    d12: f32,
    d13: f32,
    d14: f32,
    d15: f32,
    timestamp: f32
}

fn run() -> Result<(), Box<dyn Error>> {
    let mut rdr = csv::ReaderBuilder::new()
        .has_headers(false)
        .flexible(true) // ignore broken header
        .from_reader(io::stdin());

    // Get timeseries information from header... :facepalm: rigol
    // Initial timestamp...
    let header = rdr.headers()?.clone();
    let t0_header: Vec<&str> = header[2].split('=').collect();
    let t0 = t0_header[1].trim_start().replace('s', "").parse::<f32>()?;
    // ...and increments
    let tinc_header: Vec<&str> = header[3].split('=').collect();
    let tinc = tinc_header[1].trim_start().parse::<f32>()?;
    println!("Initial timestamp {t0} with increments of {tinc} seconds");

    // Now do the splitting of wires from its Dx-Dy "bundles"
    let mut timestamp = t0;
    for result in rdr.deserialize().skip(1) {
        let row: OrigOscilloscope = result?;

        // if rdr.position().line().rem_euclid(8) {
        // // Read D0-D15 field(s), expanding them into the current row, matching its column
        // }
        // println!("{:#?}", row.d7_d0.parse::<f32>()?);

        // update timestamp for this row
        timestamp = timestamp   tinc;
    }
    Ok(())
}

fn main() {
    if let Err(err) = run() {
        println!("{}", err);
        process::exit(1);
    }
}

I hope that it's clear that the target dataframe desired would look like this:

d0, d1, d2, d3, d4, d5, d6, d7, d8, d9, d10, d11, d12, d13, d14, d15, timestamp
95, 95, 247, 159, 247, 240, 159, 247 (...)                            -25.01 000000.2
(...)

And the resulting code should (ideally?) be using Polars efficiently. Also, as mentioned above, avoiding "Selecting with Indexing" since it might be completely deprecated in the future in Polars.

CodePudding user response：

Perhaps you are looking for unstack (which is incredibly performant).

Let's start with this data (which simply replicates values that you provided). I've also changed the column names, just to make things easier to inspect:

import polars as pl
from io import StringIO

normal_csv = """D7-D0,D15-D8,t0 = -25.01s, tInc = 2e-06,
 2.470000E 02,2.000000E 00,,
 1.590000E 02,1.600000E 01,,
 2.400000E 02,2.000000E 00,,
 2.470000E 02, 1.300000E 02,,
 1.590000E 02,1.800000E 01,,
 2.470000E 02, 1.300000E 02,,
9.500000E 01,1.800000E 01,,
9.500000E 01,1.800000E 01,,
 2.470000E 02,2.000000E 00,,
 1.590000E 02,1.600000E 01,,
 2.400000E 02,2.000000E 00,,
 2.470000E 02, 1.300000E 02,,
 1.590000E 02,1.800000E 01,,
 2.470000E 02, 1.300000E 02,,
9.500000E 01,1.800000E 01,,
9.500000E 01,1.800000E 01,,"""


df = pl.read_csv(file=StringIO(normal_csv),
                 new_columns=('D', 'E', 't0', 'tInc'))
df

shape: (16, 4)
┌───────┬───────┬──────┬──────┐
│ D     ┆ E     ┆ t0   ┆ tInc │
│ ---   ┆ ---   ┆ ---  ┆ ---  │
│ f64   ┆ f64   ┆ str  ┆ str  │
╞═══════╪═══════╪══════╪══════╡
│ 247.0 ┆ 2.0   ┆ null ┆ null │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 159.0 ┆ 16.0  ┆ null ┆ null │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 240.0 ┆ 2.0   ┆ null ┆ null │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 247.0 ┆ 130.0 ┆ null ┆ null │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 159.0 ┆ 18.0  ┆ null ┆ null │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 247.0 ┆ 130.0 ┆ null ┆ null │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 95.0  ┆ 18.0  ┆ null ┆ null │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 95.0  ┆ 18.0  ┆ null ┆ null │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 247.0 ┆ 2.0   ┆ null ┆ null │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 159.0 ┆ 16.0  ┆ null ┆ null │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 240.0 ┆ 2.0   ┆ null ┆ null │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 247.0 ┆ 130.0 ┆ null ┆ null │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 159.0 ┆ 18.0  ┆ null ┆ null │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 247.0 ┆ 130.0 ┆ null ┆ null │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 95.0  ┆ 18.0  ┆ null ┆ null │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 95.0  ┆ 18.0  ┆ null ┆ null │
└───────┴───────┴──────┴──────┘

Since I'm not familiar with your data source, I'm going to assume that your input data has a regularized pattern -- specifically that D7-D0 are always provided (i.e., no skipped lines in the csv file).

If so, here's some highly performant code that should get the ball rolling...

(
    df
    .reverse()
    .unstack(step=8,
             how='horizontal',
             columns=('D', 'E'),
             )
)

shape: (2, 16)
┌──────┬──────┬───────┬───────┬───────┬───────┬───────┬───────┬──────┬──────┬───────┬──────┬───────┬─────┬──────┬─────┐
│ D_0  ┆ D_1  ┆ D_2   ┆ D_3   ┆ D_4   ┆ D_5   ┆ D_6   ┆ D_7   ┆ E_0  ┆ E_1  ┆ E_2   ┆ E_3  ┆ E_4   ┆ E_5 ┆ E_6  ┆ E_7 │
│ ---  ┆ ---  ┆ ---   ┆ ---   ┆ ---   ┆ ---   ┆ ---   ┆ ---   ┆ ---  ┆ ---  ┆ ---   ┆ ---  ┆ ---   ┆ --- ┆ ---  ┆ --- │
│ f64  ┆ f64  ┆ f64   ┆ f64   ┆ f64   ┆ f64   ┆ f64   ┆ f64   ┆ f64  ┆ f64  ┆ f64   ┆ f64  ┆ f64   ┆ f64 ┆ f64  ┆ f64 │
╞══════╪══════╪═══════╪═══════╪═══════╪═══════╪═══════╪═══════╪══════╪══════╪═══════╪══════╪═══════╪═════╪══════╪═════╡
│ 95.0 ┆ 95.0 ┆ 247.0 ┆ 159.0 ┆ 247.0 ┆ 240.0 ┆ 159.0 ┆ 247.0 ┆ 18.0 ┆ 18.0 ┆ 130.0 ┆ 18.0 ┆ 130.0 ┆ 2.0 ┆ 16.0 ┆ 2.0 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌┤
│ 95.0 ┆ 95.0 ┆ 247.0 ┆ 159.0 ┆ 247.0 ┆ 240.0 ┆ 159.0 ┆ 247.0 ┆ 18.0 ┆ 18.0 ┆ 130.0 ┆ 18.0 ┆ 130.0 ┆ 2.0 ┆ 16.0 ┆ 2.0 │
└──────┴──────┴───────┴───────┴───────┴───────┴───────┴───────┴──────┴──────┴───────┴──────┴───────┴─────┴──────┴─────┘

This doesn't have your desired column names (which you can change). Nor does it contain the timing information. But it may help you get started in the right direction.