Home > Software engineering >  get intersection in single column of pandas dataframe
get intersection in single column of pandas dataframe

Time:07-30

I have a dataframe like this:

>>> df = pd.DataFrame({
    'Date': [
        pd.to_datetime("2022-07-01"),
        pd.to_datetime("2020-07-02"),
        pd.to_datetime("2020-07-03"),
        ],
    "Price": [
        {24.9, 23.0, 22.5, 23.5},
        {24.9, 25.0, 26.5, 23.7},
        {25.2, 24.5, 23.6, 23.8},
  ]})

>>> df
        Date                     Price
0 2022-07-01  {24.9, 23.5, 22.5, 23.0}
1 2020-07-02  {24.9, 25.0, 26.5, 23.7}
2 2020-07-03  {24.5, 25.2, 23.8, 23.6}

I want to add a new column 'intersec' and get the intersection of the Price column and its shift value. But when I use

df['intersec'] = df.price[1:]&df.price.shift()[1:] 

It doesn't work, I get the following error:

TypeError: unsupported operand type(s) for &: 'set' and 'bool'

what should I do? My expected result is:

>>> df
       Date                     Price  intersec
0  2022/7/1  {24.9, 23.5, 22.5, 23.0}       NaN
1  2020/7/2  {24.9, 25.0, 26.5, 23.7}      24.9
2  2020/7/3  {24.5, 25.2, 23.8, 23.6}       NaN

CodePudding user response:

shift the price

df["shifted_price"] = df.Price.shift()

find intersect

df["intersec"] = df[1:].apply(lambda x: list(set(x["Price"]) & set(x["shifted_price"])), axis=1)

sample output

        Date                     Price             shifted_price intersec
0 2022-07-01  {24.9, 23.5, 22.5, 23.0}                       NaN      NaN
1 2020-07-02  {24.9, 25.0, 26.5, 23.7}  {24.9, 23.5, 22.5, 23.0}   [24.9]
2 2020-07-03  {24.5, 25.2, 23.8, 23.6}  {24.9, 25.0, 26.5, 23.7}       []

CodePudding user response:

One way to do this would be:

df["intersec"] = [np.NaN]   [p.intersection(pp) if p.intersection(pp) else np.NaN 
                             for p, pp in zip(df["Price"][1:], df["Price"])]

CodePudding user response:

ans = []
for a, b in zip(df['Price'].values, df['Price'].shift().values):
    try:
        intersection = a & b
    except:
        intersection = None
    ans.append(intersection)

This outputs [None, {24.9}, set()]

CodePudding user response:

Since the values of the cells are sets, there is no vectorised operation for this, so I suggest you use a list comprehension:

import numpy as np
df["intersec"] = [i if pd.notna(q) and (i := p & q) else np.nan for p, q in zip(df["Price"], df["Price"].shift())]
print(df)

Output

        Date                     Price intersec
0 2022-07-01  {24.9, 23.0, 22.5, 23.5}      NaN
1 2020-07-02  {24.9, 25.0, 26.5, 23.7}   {24.9}
2 2020-07-03  {24.5, 25.2, 23.6, 23.8}      NaN
  • Related