Home > database >  Is there a more dense method to conditionally separate data?
Is there a more dense method to conditionally separate data?

Time:12-18

I'm wondering if there is a more dense way to do the following, which is essentially splitting column-separated data, by row and into one of three categories depending on the final entry of the row:

xi_test_0 = [xi_test_sc[i] for i in range(len(xi_test_sc)) if y_test[i] == 0]
xii_test_0 = [xii_test_sc[i] for i in range(len(xii_test_sc)) if y_test[i] == 0]
y_test_0 = [y_test[i] for i in range(len(y_test)) if y_test[i] == 0]

xi_test_1 = [xi_test_sc[i] for i in range(len(xi_test_sc)) if y_test[i] == 1]
xii_test_1 = [xii_test_sc[i] for i in range(len(xii_test_sc)) if y_test[i] == 1]
y_test_1 = [y_test[i] for i in range(len(y_test)) if y_test[i] == 1]

xi_test_2 = [xi_test_sc[i] for i in range(len(xi_test_sc)) if y_test[i] == 2]
xii_test_2 = [xii_test_sc[i] for i in range(len(xii_test_sc)) if y_test[i] == 2]
y_test_2 = [y_test[i] for i in range(len(y_test)) if y_test[i] == 2]```

CodePudding user response:

Make a set of boolean arrays for each condition you want to break on:

split_index = {i: y_test == i for i in range(3)}

Break your data into groups based on this boolean index.

xi_test_split = {i: xi_test_sc[idx, :] for i, idx in split_index.items()}
xii_test_split = {i: xii_test_sc[idx, :] for i, idx in split_index.items()}
y_test_split = {i: y_test[idx, :] for i, idx in split_index.items()}

CodePudding user response:

xi_test_0, xi_test_1, etc are all initialized by an expression that differs only by a numerical constant -- so that's start by turning those into lists:

xi_test = [
    [xi_test_sc[i] for i in range(len(xi_test_sc)) if y_test[i] == j]
    for j in range(3)
]
xii_test = [
    [xii_test_sc[i] for i in range(len(xii_test_sc)) if y_test[i] == j] 
    for j in range(3)
]
y_test_n = [
    [y_test[i] for i in range(len(y_test)) if y_test[i] == j] 
    for j in range(3)
]

We can also simplify the individual list comprehensions by zipping the two lists we're iterating over instead of using range:

xi_test = [
    [a for a, b in zip(xi_test_sc, y_test) if b == j]
    for j in range(3)
]
xii_test = [
    [a for a, b in zip(xii_test_sc, y_test) if b == j]
    for j in range(3)
]
y_test_n = [
    [a for a, b in zip(y_test, y_test) if b == j]
    for j in range(3)
]

Now that we have less code to look at, it's easy to see that these also differ by a single value -- so let's maybe turn the whole thing into a dict:

test_data = {
    name: [[a for a, b in zip(sc_data, y_test) if b == j] for j in range(3)]
    for name, sc_data in (
        ('xi', xi_test_sc), ('xii', xii_test_sc), ('y', y_test)
    )
}

This is the same data produced by your original code, without all the copying and pasting. xi_test_0 is now test_data['xi'][0], y_test_1 is now test_data['y'][1], et cetera.

Or, if you prefer still having each list assigned to a different named variable rather than making the names keys in a dict:

xi, xii, y = (
    [[a for a, b in zip(sc_data, y_test) if b == j] for j in range(3)] 
    for sc_data in (xi_test_sc, xii_test_sc, y_test)
)

CodePudding user response:

If those are all numpy arrays, you could use a mask:

xi_test_0 = xi_test_sc[y_test==0]
y_test_0  = y_test[y_test==0]

xi_test_1  = xi_test_sc[y_test== 1]
xii_test_1 = xii_test_sc[y_test == 1]

... and so on ...

Or, if you want your results to be 2D matrices with the 0,1,2 insexes corresponding to your _0, _1, _2 variable name suffixes:

mask = y_test == np.arange(3)[:,None]
xi_test_n  = xi_test_sc[mask]
xii_test_n = xii_test_sc[mask]
...
  • Related