Split pandas dataframe based on column value-CodePudding

Given the following data:

test_data = pd.DataFrame({
    "col": ["wall", "wall", "lamp", "lamp", "desk", "desk", "desk",
            "mug", "floor"],
    })

I would like to create three (two for edge cases) datasets, based on a particular value within a given column (in this case col).

For example, if the value col = lamp was given I would expect :

df 1
| col   |
|:------|
| wall  |
| wall  |

df 2
| col   |
|:------|
| lamp  |
| lamp  |

df 3
| col   |
|:------|
| desk  |
| desk  |
| desk  |
| mug   |
| floor |

I've tried using the following:

match_str = "mug"

match_start, match_end = (
    test_data["col"].eq(match_str).loc[lambda x: x].index.min(),
    test_data["col"].eq(match_str).loc[lambda x: x].index.max(),
)

df1_filt = pd.Series(test_data.index).lt(match_start)
df2_filt = pd.Series(test_data.index).between(match_start, match_end)
df3_filt = pd.Series(test_data.index).gt(match_end)

df1, df2, df3 = (
    test_data.loc[df1_filt],
    test_data.loc[df2_filt],
    test_data.loc[df3_filt],
)

Which seems to handle the requirement - it makes the assumption that col is ordered, but if it wasn't ordered there wouldn't be any sense to this operation anyway.

CodePudding user response：

This is like the behaviour of itertools.groupby, right? We need to group things that are next to each other, and depending on whether they are equal to the search value. So an imitation of Python's groupby in pandas is "diff-ne(0)-cumsum" idiom, so here we go:

In [301]: df
Out[301]:
     col
0   wall
1   wall
2   lamp
3   lamp
4   desk
5   desk
6   desk
7    mug
8  floor

In [302]: [sub_frame
           for _, sub_frame in df.groupby(df.col.eq("lamp").diff().ne(0).cumsum())]
Out[302]:
[    col
 0  wall
 1  wall,
     col
 2  lamp
 3  lamp,
      col
 4   desk
 5   desk
 6   desk
 7    mug
 8  floor]

It gave a list of 3 dataframes: before the "lamp stream", during the lamp stream, and after. This will respect the edge cases as well.

CodePudding user response：

Any time you see yourself trying to split something into an unknown number of variables dynamically it should probably set off a red flag. I would suggest creating a group flag in the data set, then using that to groupby or iterate over.

import pandas as pd
test_data = pd.DataFrame(
    {
        "col": ["wall", "wall", "lamp", "lamp", "desk", "desk", "desk", "mug", "floor"],
    }
)

test_data['group'] = test_data['col'].eq('mug').diff().ne(0).cumsum()
print(test_data)

Output

     col  group
0   wall      1
1   wall      1
2   lamp      1
3   lamp      1
4   desk      1
5   desk      1
6   desk      1
7    mug      2
8  floor      3

If you MUST split them for whatever reason, at least use a dictionary to store them so you can handle various numbers of dataframes coming back.

import pandas as pd
    test_data = pd.DataFrame(
        {
            "col": ["wall", "wall", "lamp", "lamp", "desk", "desk", "desk", "mug", "floor"],
        }
    )

output = {group:data for group,data in test_data.groupby(test_data['col'].eq('mug').diff().ne(0).cumsum())}

print(output[2])

Results

   col
7  mug

CodePudding user response：

match_str = 'lamp'
#breaking point
bp = test_data.loc[test_data['col'] == match_str, :].index

#before bp(smaller than bk's head)
b_bp = test_data.index < bp[0]

#after bp(greater than bk's tail)
a_bp = test_data.index >bp[-1]

df_1 = test_data.iloc[b_bp]
df_1
###
    col
0  wall
1  wall

df2 = test_data.iloc[bp]
df2
###
    col
2  lamp
3  lamp

df3 = test_data.iloc[a_bp]
df3
###
     col
4   desk
5   desk
6   desk
7    mug
8  floor