I have tried and searched on web and am giving up... I would like to ask you for help with following (I am using Python 3 and Pandas):
Imagine df with the VALUES column containing integers in the sets:
In:
VALUES
0 {2624, 1383, 2855, 1871, 16784, 9811, 2970}
1 {2898, 12, 1871, 223}
2 {2624, 2855, 519, 15662, 1230, 1871, 2970}
3 {2624, 12, 55, 519, 313, 1230, 10588, 2970}
4 {1230, 55, 89564, 1247}
Out desired (only the rows containing specific value e.g. 12 in sets in VALUES, everything else dropped):
VALUES
1 {2898, 12, 1871, 223}
3 {2624, 12, 55, 519, 313, 1230, 10588, 2970}
In next step, I will need to move int 12 to new separate column. Also, I could not find a way how to convert these sets in lists.
VALUES NEW_COL
1 [2898, 1871, 223] 12
3 [2624, 55, 519, 313, 1230, 10588, 2970] 12
Thank you.
CodePudding user response:
Filter for matches :
booleans = [12 in val for val in df.VALUES]
out = df.loc[booleans]
Remove 12 from VALUES to get final output:
out.assign(VALUES = out.VALUES - {12}, NEW_COL = 12)
VALUES NEW_COL
1 {2898, 1871, 223} 12
3 {2624, 519, 1230, 55, 313, 2970, 10588} 12
if you want to convert to list:
out.assign(VALUES = (out.VALUES - {12}).map(list), NEW_COL = 12)
VALUES NEW_COL
1 [2898, 1871, 223] 12
3 [2624, 519, 1230, 55, 313, 2970, 10588] 12
I'm not sure what you are working on, but performance wise, it is better to deal with structures such as list/sets/tuples... outside Pandas
CodePudding user response:
The key here is to break down your problem into smaller, easier steps:
-
- select rows with the desired values
-
- create a new dataframe with the selected rows
-
- add a new column with the value
-
- filter the "VALUES" column to remove the value you are looking for
Taking your example:
import pandas as pd
df = pd.DataFrame({
"VALUES": [
(2624, 1383, 2855, 1871, 16784, 9811, 2970),
(2898, 12, 1871, 223),
(2624, 2855, 519, 15662, 1230, 1871, 2970),
(2624, 12, 55, 519, 313, 1230, 10588, 2970),
(1230, 55, 89564, 1247)
]
})
select the rows with the desired value val
:
val = 12
b = df["VALUES"].apply(lambda tup: val in tup)
df2 = df.loc[b].copy()
create a new column "NEW_COL"
and filter out val
from the "VALUES"
column:
df2["NEW_COL"] = val
df2["VALUES"] = df2["VALUES"].apply(lambda values: [v for v in values if v != val])
wrapping it all in 1 function
def create_df(df: pd.DataFrame, val: int, col_name_value = "VALUES", col_name_new = "NEW_COL"):
b = df[col_name_value].apply(lambda tup: val in tup)
df2 = df.loc[b].copy()
df2[col_name_new] = val
df2[col_name_value] = df2[col_name_value].apply(lambda values: [v for v in values if v != val])
return df2
You get the desired output:
create_df(df, 12)
create_df(df, 2624)