Why pandas doesn't offer a rubik's cube built-in transformation?-CodePudding

I'm trying to achieve this weird transformation :

The input is made of some random blocks including 6 cells (colored in the image above).

The order is not important in the output ! Here is the dataframe used :

import pandas as pd
from numpy import nan

dico = {0: ['aa', 'bb', 'ee', nan, 'cc', nan, nan, nan, nan, 'REF A', 'REF B'],
 1: ['dd', 'REF A', 'REF B', nan, nan, nan, nan, 'jj', nan, nan, nan],
 2: ['ff', nan, nan, nan, nan, 'REF A', 'REF B', nan, nan, 'val7', 9393],
 3: [nan, 'val1', 123, 'kk', nan, nan, nan, nan, nan, 'ff', nan],
 4: [nan, nan, nan, nan, nan, 'val2', 4194, 'hh', nan, 'REF A', 'REF B'],
 5: ['REF A', 'REF B', 'REF A', 'REF B', nan, nan, nan, nan, nan, nan, nan],
 6: [nan, nan, nan, nan, nan, 'REF A', 'REF B', nan, nan, 'val6', 113],
 7: ['val3', 2334, 'val4', 4488, nan, nan, nan, nan, nan, nan, nan],
 8: [nan, nan, 'gg', nan, nan, 'val5', 773, nan, nan, 'ii', nan]}

df = pd.DataFrame(dico)

Do you have some suggestions guys ? Thank you so much !

CodePudding user response：

Your excel file has colours and other metadata that show how the REF and val and numbers are connected. Let's take advantage of that with xlsx_cells from pyjanitor. Under the hood it uses openpyxl to generate a Pandas DataFrame with all the metadata that we can then use to reshape to the desired form:

# pip install pyjanitor
import pandas as pd
from janitor import xlsx_cells

path = 'Downloads/rubic.xlsx'
df = xlsx_cells(path, include_blank_cells=False, sheetnames=0, fill=True)
target = df.loc[df.fill.str['fgColor'].str['tint'].ne(0), ['value']]
cond = target.value.str.startswith('REF', na=False)
other = np.where(~cond, target.value, np.nan)
# fill the nulls above with the values below
target = target.assign(other = other).bfill()
target = target.loc[target.value != target.other]
index = target.groupby('value').cumcount()
(target
.assign(index = index)
.pivot(index = 'index', columns = 'value', values = 'other')
.rename_axis(index=None, columns=None)
)

  REF A REF B
0  val3  2334
1  val1   123
2  val4  4488
3  val2  4194
4  val5   773
5  val7  9393
6  val6   113

BREAKDOWN:

Here we read in the data - it puts every cell into its own row, along with some relevant metadata:

path = 'Downloads/rubic.xlsx'
df = xlsx_cells(path, include_blank_cells=False, sheetnames=0, fill=True)
df.head()

  value internal_value coordinate  row  column data_type  is_date number_format                                               fill
0     0              0         A1    1       1         n    False       General  {'patternType': None, 'fgColor': {'rgb': '0000...
1     1              1         B1    1       2         n    False       General  {'patternType': None, 'fgColor': {'rgb': '0000...
2     2              2         C1    1       3         n    False       General  {'patternType': None, 'fgColor': {'rgb': '0000...
3     3              3         D1    1       4         n    False       General  {'patternType': None, 'fgColor': {'rgb': '0000...
4     4              4         E1    1       5         n    False       General  {'patternType': None, 'fgColor': {'rgb': '0000...

We know the rows are connected with some colour; you can access the fill column to see the sub keys - internally fill is a dictionary per row:

df.fill[0]
{'patternType': None,
 'fgColor': {'rgb': '00000000', 'type': 'rgb', 'tint': 0.0},
 'bgColor': {'rgb': '00000000', 'type': 'rgb', 'tint': 0.0}}

The useful metadata available here is the tint - for the relevant rows that are coloured, the tint is not zero - let's use that to filter just for the rows we are interested in:

target = df.loc[df.fill.str['fgColor'].str['tint'].ne(0), ['value']]
target.head()
    value
12  REF A
13   val3
15  REF A
16   val1
17  REF B

We also know from the data that after each REF comes either val or a number; we'll use that pattern to reshape accordingly:

cond = target.value.str.startswith('REF', na=False)
other = np.where(~cond, target.value, np.nan)
# fill the nulls above with the values below
target = target.assign(other = other).bfill()
target = target.loc[target.value != target.other]
target.head()

    value other
12  REF A  val3
15  REF A  val1
17  REF B  2334
20  REF B   123
22  REF A  val4

All that's left is to pivot - we create a unique column with groupby.cumcount before pivoting:

index = target.groupby('value').cumcount()
(target
.assign(index = index)
.pivot(index = 'index', columns = 'value', values = 'other')
.rename_axis(index=None, columns=None)
)

  REF A REF B
0  val3  2334
1  val1   123
2  val4  4488
3  val2  4194
4  val5   773
5  val7  9393
6  val6   113

This is specific to the data you shared; for other forms, you might have to do more/less wrangling - and it is dependent on being able to identify the right pattern.

Another possible solution is with numpy; again this is specific to the data shared.

There is a pattern, REF A is paired with val*, while REF B is paired with the numbers. We also know that the numbers are immediately below the val. Let's use that pattern to generate the dataframe:

# get booleans for cells that start with `val`
# convert to numpy
# index within numpy space
bools = df.apply(lambda df: df.str.startswith("val")).to_numpy(na_value=False)
bools = bools.astype(np.bool8)
arr = df.to_numpy()
vals = arr[bools]
rows, cols = bools.nonzero()
# we know that the numbers are in the immediate next row
# directly underneath val
# hence the increment when indexing
numbers = arr[rows   1, cols]
pd.DataFrame({'REF A': vals, 'REF B' : numbers})
  REF A REF B
0  val3  2334
1  val1   123
2  val4  4488
3  val2  4194
4  val5   773
5  val7  9393
6  val6   113

CodePudding user response：

I came up with this solution (that obviously may not be the best one) :

First of all, just to make sure that every block has the two values for REF A and REF B :

df.fillna("Missing Value", inplace=True)

Then,

# --- Masking any value different from the 4 cells block

m1 = ~(df[df.columns].isin(['REF A', 'REF B']))
m2 = ~(df[df.columns].isin(['REF A', 'REF B'])).shift(2, axis=1).fillna(False)

out1 = (
        pd.DataFrame(df.mask(m1 & m2)
                       .stack()
                       .reset_index()
                       .groupby("level_1").agg(list)
                       .pop(0).tolist())
          .stack()
       )
#<class 'pandas.core.series.Series'>

# --- Getting rid of any row containing the expected columns names

out2 = (
        out1.loc[~out1.str.startswith("REF").fillna(False)]
            .reset_index(level=0)
            .drop(columns="level_0")
            .reset_index(drop=True)
       )
#<class 'pandas.core.frame.DataFrame'>

# --- Making the final dataframe by joining every row with the next one

out3 = (
            out2.rename(columns={0: "REF A"})
                .join(out2.shift(-1).rename(columns={0: 'REF B'}))
                .iloc[::2]
                .sort_values(by="REF A") #optionally
                .reset_index(drop=True)
       )
#<class 'pandas.core.frame.DataFrame'>

Output :

print(out3)

  REF A REF B
0  val1   123
1  val2  4194
2  val3  2334
3  val4  4488
4  val5   773
5  val6   113
6  val7  9393