Home > Software engineering >  How to apply a function to dataframe with data clusters/neighborhoods seperately?
How to apply a function to dataframe with data clusters/neighborhoods seperately?

Time:10-16

Consider the following table. The first column, Data1, contains data values that are clustered in groups: there are values around 100 and 200. I am wondering how I can apply a function that deals with each data grouping separately, perhaps by applying an if statement that excludes data points with values too far apart to be considered a neighboring data point.

Data1      Value1
99         1     
100        2
101        3
102        4
199        5
200        6
201        7
...        ...

For example, if I want to generate a third column called "Result1" that adds every Data1 cluster's corresponding Value1 together. The result would look something like this, where 1 2 3 4=10 and 5 6 7=18:

Data1      Value1      Result1
99         1           10
100        2           10
101        3           10
102        4           10
199        5           18
200        6           18
201        7           18
...        ...         ...

CodePudding user response:

Try merge_asof:

data = [100,200]
labels = pd.merge_asof(df, pd.DataFrame({'label':data}), 
                       left_on='Data1', right_on='label',
                       direction='nearest')['label']

df['Result1'] = df.groupby(labels)['Value1'].transform('sum')

Output:

   Data1  Value1  Result1
0     99       1       10
1    100       2       10
2    101       3       10
3    102       4       10
4    199       5       18
5    200       6       18
6    201       7       18

CodePudding user response:

In your case, a simple mask aught to do.

mask = df[“Data1”]<150
df.loc[mask,”Result1”] = df.loc[mask,”Value1”].sum()
df.loc[~mask,”Result1”] = ”df.loc[~mask,”Value1”].sum()
  • Related