Home > Software engineering >  Group by date and divide columns by ID variable in python
Group by date and divide columns by ID variable in python

Time:06-25

I have the a data frame like this:

data = {'Result1':[10, 11, 12, 44],
'Result2':[6, 7, 55, 114],
    'ID':['A', 'B', 'A', 'B'],
    'anoter_id':["one", "one", "two", "two"],
    'date' : ['2022-06-06', '2022-06-06', '2022-06-01', '2022-06-01']
    }

df = pd.DataFrame(data)

And I want to group by date and anoter_id and keep the lines where Result1 and Result2 of id==B is 10% bigger or lower than Result1andResult2of id==A`.

I also want to have a column with this difference, like "Result1_B/Result1_A" and "Result2_B/Result2_A"


My initial idea was to do this:

df.query("ID == 'A'").sort_values('date').Result1/df.query("ID == 'B'").sort_values('date').Result1

Moreover is super verbose and I have to write on line for each column on my dataset.

Any ideas?

EDIT: An example of output is below. On Result1_B/Result1_A_perDate the results come from 11/10 and 7/6, that corresponds to Result1 of id B divided by Result1 of id A for anoter_id 'one' . The column inside_10%_limit is TRUE if both columns Result1_B/Result1_A_perDate and Result2_B/Result2_A_perDate are less or more than maximum 10%

output = {'Result1_B/Result1_A_perDate':[1.1, 1.16],
    'Result2_B/Result2_A_perDate':[3.6, 2.0],
    'anoter_id':["one", "two"],
    'date' : ['2022-06-06', '2022-06-01'],
      'inside_10%_limit': ['True', 'False']
    }

pd.DataFrame(output)

CodePudding user response:

The logic is still not fully clear, but here is my assumption:

# pivot DataFrame
df2 = df.pivot(index=['anoter_id', 'date'], columns='ID')

# Use MultiIndex selection to divide B by A
out = df2.xs('B', level=1, axis=1).div(df2.xs('A', level=1, axis=1)).add_suffix('_B/A')

# get top  10% percentile
thresh = out.stack().quantile(0.9, interpolation='nearest')
# 3.6666

# identify rows with both values above or below threshold
out['10%lim'] = out.gt(thresh).all(1) | out.lt(thresh).all(1)

# index as columns
out = out.reset_index()

output:

  anoter_id        date  Result1_B/A  Result2_B/A  10%lim
0       one  2022-06-06     1.100000     1.166667    True
1       two  2022-06-01     3.666667     2.072727   False
  • Related