Selecting rows containing specific int stored in set column using pandas-CodePudding

I have tried and searched on web and am giving up... I would like to ask you for help with following (I am using Python 3 and Pandas):

Imagine df with the VALUES column containing integers in the sets:

In:

            VALUES
0           {2624, 1383, 2855, 1871, 16784, 9811, 2970}
1           {2898, 12, 1871, 223}
2           {2624, 2855, 519, 15662, 1230, 1871, 2970}
3          {2624, 12, 55, 519, 313, 1230, 10588, 2970}
4           {1230, 55, 89564, 1247}

Out desired (only the rows containing specific value e.g. 12 in sets in VALUES, everything else dropped):

            VALUES
1           {2898, 12, 1871, 223}
3          {2624, 12, 55, 519, 313, 1230, 10588, 2970}

In next step, I will need to move int 12 to new separate column. Also, I could not find a way how to convert these sets in lists.

               VALUES                                  NEW_COL
1           [2898, 1871, 223]                             12
3          [2624, 55, 519, 313, 1230, 10588, 2970]        12

Thank you.

CodePudding user response：

Filter for matches :

booleans = [12 in val for val in df.VALUES]
out = df.loc[booleans]

Remove 12 from VALUES to get final output:

out.assign(VALUES = out.VALUES - {12}, NEW_COL = 12)

                                    VALUES  NEW_COL
1                        {2898, 1871, 223}       12
3  {2624, 519, 1230, 55, 313, 2970, 10588}       12

if you want to convert to list:

out.assign(VALUES = (out.VALUES - {12}).map(list), NEW_COL = 12)
                                    VALUES  NEW_COL
1                        [2898, 1871, 223]       12
3  [2624, 519, 1230, 55, 313, 2970, 10588]       12

I'm not sure what you are working on, but performance wise, it is better to deal with structures such as list/sets/tuples... outside Pandas

CodePudding user response：

The key here is to break down your problem into smaller, easier steps:

1. select rows with the desired values
1. create a new dataframe with the selected rows
1. add a new column with the value
1. filter the "VALUES" column to remove the value you are looking for

Taking your example:

import pandas as pd

df = pd.DataFrame({
    "VALUES": [
    (2624, 1383, 2855, 1871, 16784, 9811, 2970),
    (2898, 12, 1871, 223),
    (2624, 2855, 519, 15662, 1230, 1871, 2970),
    (2624, 12, 55, 519, 313, 1230, 10588, 2970),
    (1230, 55, 89564, 1247)
] 
})

select the rows with the desired value val:

val = 12

b = df["VALUES"].apply(lambda tup: val in tup)
df2 = df.loc[b].copy()

create a new column "NEW_COL" and filter out val from the "VALUES" column:

df2["NEW_COL"] = val
df2["VALUES"] = df2["VALUES"].apply(lambda values: [v for v in values if v != val])

wrapping it all in 1 function

def create_df(df: pd.DataFrame, val: int, col_name_value = "VALUES", col_name_new = "NEW_COL"):
    b = df[col_name_value].apply(lambda tup: val in tup)
    df2 = df.loc[b].copy()
    df2[col_name_new] = val
    df2[col_name_value] = df2[col_name_value].apply(lambda values: [v for v in values if v != val])
    return df2

You get the desired output:

create_df(df, 12)

create_df(df, 2624)