I'm having two dataframes dfSpecs and dfData. dfSpecs contains the spec limits for parameter columns in dfData. (LSL = lower spec limit, USL = upper spec limit) The parameter rows in dfSpecs are not necessarily in the same order as the parameter columns in dfData.
dfSpecs = df = pd.DataFrame(
{'param': ['parameter1', 'parameter2', 'parameter3'],
'LSL': [0, 10, 200],
'USL': [1, 20, 300]}
)
print(dfSpecs)
Output:
param LSL USL
0 parameter1 0 1
1 parameter2 10 20
2 parameter3 200 300
dfData = df = pd.DataFrame(
{'id': ['foo', 'foo', 'bar', 'bar'],
'x': [1, 1, 1, 1],
'y': [1, 2, 1, 2],
'parameter1': [0.5, 0.6, 1.8, 0.4],
'parameter2': [12, 14, 21, 15],
'parameter3': [199, 301, 256, 234]}
)
print(dfData)
Output:
id x y parameter1 parameter2 parameter3
0 foo 1 1 0.5 12 199
1 foo 1 2 0.6 14 301
2 bar 1 1 1.8 21 256
3 bar 1 2 0.4 15 234
Now I want to get a new dataframe containing the count of rows outside the spec limits for parameter and id.
The expected output looks as follows:
id param fail_count
0 foo parameter1 0
1 foo parameter2 0
2 foo parameter3 2
3 bar parameter1 1
4 bar parameter2 1
5 bar parameter3 0
The following code works, but is very slow (up to several dozens of minutes, depending on the dataframe size):
fails=[]
for id, df in dfData.groupby('id'):
for i, spec in dfSpecs.iterrows():
fail_count = len(df[spec['param']].loc[(df[spec['param']]<spec['LSL']) | (df[spec['param']]>spec['USL'])])
fails.append(id, [spec['param'], fail_count])
dfFails = pd.DataFrame(fails, columns=['id','parameter','fail_count'])
Is there a simpler, faster way how to achieve this? I suppose there is better ways in pandas.
Edit: rephrasing and adding example output
CodePudding user response:
Thank you for the response, based on your edits I would suggest the following:
# Create an empty dataframe with the columns 'id', 'param', and 'fail_count'
dfResults = pd.DataFrame(columns=['id', 'param', 'fail_count'])
# Get a list of unique ids in the 'dfData' dataframe
unique_ids = dfData['id'].unique()
# Iterate through the unique ids
for id in unique_ids:
# Filter the 'dfData' dataframe for the current id
df_id = dfData[dfData['id'] == id]
# Get a list of unique parameters in the filtered dataframe
unique_params = df_id.columns.difference(['id', 'x', 'y'])
# Iterate through the unique parameters
for param in unique_params:
# Get the LSL and USL values for the current parameter
LSL = dfSpecs[dfSpecs['param'] == param]['LSL'].values[0]
USL = dfSpecs[dfSpecs['param'] == param]['USL'].values[0]
# Calculate the number of failures for the current parameter and id
fail_count = df_id[(df_id[param] < LSL) | (df_id[param] > USL)].shape[0]
# Append the current id, parameter, and fail_count to the 'dfResults' dataframe
dfResults = dfResults.append({'id': id, 'param': param, 'fail_count': fail_count}, ignore_index=True)
CodePudding user response:
First you can melt
the dfData
dataframe and then use dfSpecs
as a look up for left, right to use with pd.Series.between
, in a transform
and assign that as a new column fail_count
with assign
and then groupby.sum()
with id
, param
to get count.
specs = dfSpecs.set_index("param")
t = dfData.melt(
value_vars=["parameter1", "parameter2", "parameter3"],
id_vars=["id"],
var_name="param"
)
out = t.assign(
fail_count=t.groupby(["param"])["value"].transform(
lambda x: ~x.between(specs.loc[x.name, "LSL"], specs.loc[x.name, "USL"])
)
).groupby(["id", "param"])["fail_count"].sum().reset_index()
print(out)
id param fail_count
0 bar parameter1 1
1 bar parameter2 1
2 bar parameter3 0
3 foo parameter1 0
4 foo parameter2 0
5 foo parameter3 2