I have the following Dataframe:
TEST_NUMBER D1 D10 D50 D90 D99 Q3_15 Q3_10 l-Q3_63 RPM PRODUCT_NO
0 11 4.77 12.7 34.9 93.7 213.90 13.74 5.98 21.44 0.0 BQ0066
1 21 4.43 10.8 25.2 39.8 49.73 20.04 8.45 0.10 953.5 BQ0066
2 22 4.52 11.3 27.7 48.0 60.51 17.58 7.50 0.58 904.2 BQ0066
3 23 4.67 11.5 27.2 44.8 56.64 17.49 7.24 0.25 945.2 BQ0066
4 24 4.41 10.9 26.8 44.5 57.84 18.95 8.31 0.54 964.2 BQ0066
5 25 28.88 47.3 71.8 140.0 249.40 0.26 0.12 63.42 964.2 BQ0066
6 31 16.92 23.1 34.3 48.4 92.41 0.51 0.13 1.78 1694.5 BQ0066
7 32 16.35 22.2 33.0 45.8 59.14 0.53 0.11 0.64 1735.4 BQ0066
8 33 16.42 21.9 32.6 45.9 56.91 0.51 0.10 0.36 1754.4 BQ0066
9 34 3.47 7.3 14.1 20.7 26.52 56.59 23.71 0.00 1754.4 BQ0066
10 11 5.16 14.2 38.6 123.4 263.80 11.03 4.82 26.90 0.0 BQ0067
11 21 4.72 11.6 27.5 44.5 54.91 17.05 7.05 0.20 964.0 BQ0067
12 22 4.48 11.2 26.4 42.4 52.22 18.38 7.68 0.12 983.5 BQ0067
13 23 NaN NaN NaN NaN NaN NaN NaN NaN 983.5 BQ0067
14 31 14.80 22.4 33.2 45.5 58.11 1.05 0.36 0.56 1753.4 BQ0067
15 32 16.30 22.1 32.1 44.7 55.12 0.57 0.13 0.23 1773.8 BQ0067
16 33 3.44 7.2 14.0 21.0 26.34 56.72 24.69 0.00 1773.8 BQ0067
17 11 NaN NaN NaN NaN NaN NaN NaN NaN 0.0 BQ0068
18 21 4.83 11.9 28.1 44.2 54.63 16.76 6.70 0.19 953.7 BQ0068
19 22 4.40 10.7 26.3 43.4 57.55 19.85 8.59 0.53 974.9 BQ0068
20 23 17.61 43.8 67.9 122.6 221.20 0.75 0.33 58.27 974.9 BQ0068
21 31 15.09 22.3 33.3 45.6 59.45 0.98 0.38 0.73 1773.7 BQ0068
I wish to do the following things: Steps:
Whenever the
TEST_NUMBER 11
is NaN(null values), I need to remove all rows of perticularPRODUCT_NO
. For example, in the given dataframe,PRODUCT_NO.
BQ0068
hasTEST_NUMBER 11
withNaN
values, hence all rows ofBQ0068
should be removed.If any
TEST_NUMBER
other thanTEST_NUMBER 11
hasNaN
values, then only that perticularTEST_NUMBER's
row should be removed. For example,PRODUCT_NO BQ0067
has row ofTEST_NUMBER 23
withNaN
values. Hence only that perticular row ofTEST_NUMBER 23
should be removed.After doing the above steps, I need to the computation, for example, for
PRODUCT_NO BQ0066
, I need compute the the difference between rows in following way,TEST_NUMBER 21
-TEST_NUMBER 11
,TEST_NUMBER 22
-TEST_NUMBER 11
,TEST_NUMBER 23
-TEST_NUMBER 11
,TEST_NUMBER 24
-TEST_NUMBER 25
,TEST_NUMBER 21
-TEST_NUMBER 11
. And thenTEST_NUMBER 31
-TEST_NUMBER 25
,TEST_NUMBER 32
-TEST_NUMBER 25
,TEST_NUMBER 33
-TEST_NUMBER 25
,TEST_NUMBER 34
-TEST_NUMBER 25
. And carry on the same procedure for successivePRODUCT_NO
. As you can seeTEST_NUMBERS
frequency is different for eachPRODUCT_NO
. But in all cases, everyPRODUCT_NO
will have only oneTEST_NUMBER 11
and the other TEST_NUMBERS will be in range of21 to 29
i.e.21, 22, 23, 24, 25, 26, 27, 28, 29
and31, 32, 33 ,34, 35, 36, 37, 38, 39
PYTHON CODE
def pick_closest_sample(sample_list, sample_no):
sample_list = sorted(sample_list)
buffer = []
for number in sample_list:
if sample_no // 10 == number// 10:
break
else:
buffer.append(number)
if len(buffer) > 0:
return buffer[-1]
return sample_no
def add_closest_sample_df(df):
unique_product_nos = list(df['PRODUCT_NO'].unique())
out = []
for product_no in unique_product_nos:
subset = df[df['PRODUCT_NO'] == product_no]
if subset.iloc[0].isnull().sum() == 0:
subset.dropna(inplace = True)
sample_list = subset['TEST_NUMBER'].to_list()
subset['target_sample'] = subset['TEST_NUMBER'].apply(lambda x: pick_closest_sample(sample_list,x))
out.append(subset)
if len(out)>0:
out = pd.concat(out)
out.dropna(inplace=True)
return out
Output of above two functions:
TEST_NUMBER D1 D10 D50 D90 D99 Q3_15 Q3_10 l-Q3_63 RPM PRODUCT_NO target_sample
0 11 4.77 12.7 34.9 93.7 213.90 13.74 5.98 21.44 0.0 BQ0066 11
1 21 4.43 10.8 25.2 39.8 49.73 20.04 8.45 0.10 953.5 BQ0066 11
2 22 4.52 11.3 27.7 48.0 60.51 17.58 7.50 0.58 904.2 BQ0066 11
3 23 4.67 11.5 27.2 44.8 56.64 17.49 7.24 0.25 945.2 BQ0066 11
4 24 4.41 10.9 26.8 44.5 57.84 18.95 8.31 0.54 964.2 BQ0066 11
5 25 28.88 47.3 71.8 140.0 249.40 0.26 0.12 63.42 964.2 BQ0066 11
6 31 16.92 23.1 34.3 48.4 92.41 0.51 0.13 1.78 1694.5 BQ0066 25
7 32 16.35 22.2 33.0 45.8 59.14 0.53 0.11 0.64 1735.4 BQ0066 25
8 33 16.42 21.9 32.6 45.9 56.91 0.51 0.10 0.36 1754.4 BQ0066 25
9 34 3.47 7.3 14.1 20.7 26.52 56.59 23.71 0.00 1754.4 BQ0066 25
10 11 5.16 14.2 38.6 123.4 263.80 11.03 4.82 26.90 0.0 BQ0067 11
11 21 4.72 11.6 27.5 44.5 54.91 17.05 7.05 0.20 964.0 BQ0067 11
12 22 4.48 11.2 26.4 42.4 52.22 18.38 7.68 0.12 983.5 BQ0067 11
14 31 14.80 22.4 33.2 45.5 58.11 1.05 0.36 0.56 1753.4 BQ0067 22
15 32 16.30 22.1 32.1 44.7 55.12 0.57 0.13 0.23 1773.8 BQ0067 22
16 33 3.44 7.2 14.0 21.0 26.34 56.72 24.69 0.00 1773.8 BQ0067 22
As you can see, all rows of PRODUCT_NO BQ0068
are removed as TEST_NUMBER 11
had NaN
values. Also only row of TEST_NUMBER 23
of PRODUCT_NO BQ0067
is removed as it had NaN values. So the requirements mentioned in the first two steps are met. Now the computation for PRODUCT_NO BQ0067
will be like TEST_NUMBER 31
- TEST_NUMBER 22
, TEST_NUMBER 32
- TEST_NUMBER 22
, TEST_NUMBER 33
- TEST_NUMBER 22
PYTHON CODE
def compute_df(df):
unique_product_nos = list(df['PRODUCT_NO'].unique())
out = []
for product_no in unique_product_nos:
subset = df[df['PRODUCT_NO'] == product_no]
target_list = list(subset['target_sample'].unique())
for target in target_list:
target_df = subset[subset['target_sample'] == target]
target_subset = [subset[subset['TEST_NUMBER'] == target]]*len(target_df)
target_subset = pd.concat(target_subset)
if len(target_subset)> 0:
target_subset.index = target_df.index
diff_cols = ['D1','D10','D50','D90','D99','Q3_15','Q3_10','l-Q3_63','RPM']
for col in diff_cols:
target_df[col '_diff'] = target_df[col] - target_subset[col]
out.append(target_df)
if len(out)>0:
out = pd.concat(out)
return out
Output of the above function:
TEST_NUMBER D1 D10 D50 D90 D99 Q3_15 Q3_10 l-Q3_63 RPM ... target_sample D1_diff D10_diff D50_diff D90_diff D99_diff Q3_15_diff Q3_10_diff l-Q3_63_diff RPM_diff
0 11 4.77 12.7 34.9 93.7 213.90 13.74 5.98 21.44 0.0 ... 11 0.00 0.0 0.0 0.0 0.00 0.00 0.00 0.00 0.0
1 21 4.43 10.8 25.2 39.8 49.73 20.04 8.45 0.10 953.5 ... 11 -0.34 -1.9 -9.7 -53.9 -164.17 6.30 2.47 -21.34 953.5
2 22 4.52 11.3 27.7 48.0 60.51 17.58 7.50 0.58 904.2 ... 11 -0.25 -1.4 -7.2 -45.7 -153.39 3.84 1.52 -20.86 904.2
3 23 4.67 11.5 27.2 44.8 56.64 17.49 7.24 0.25 945.2 ... 11 -0.10 -1.2 -7.7 -48.9 -157.26 3.75 1.26 -21.19 945.2
4 24 4.41 10.9 26.8 44.5 57.84 18.95 8.31 0.54 964.2 ... 11 -0.36 -1.8 -8.1 -49.2 -156.06 5.21 2.33 -20.90 964.2
5 25 28.88 47.3 71.8 140.0 249.40 0.26 0.12 63.42 964.2 ... 11 24.11 34.6 36.9 46.3 35.50 -13.48 -5.86 41.98 964.2
6 31 16.92 23.1 34.3 48.4 92.41 0.51 0.13 1.78 1694.5 ... 25 -11.96 -24.2 -37.5 -91.6 -156.99 0.25 0.01 -61.64 730.3
7 32 16.35 22.2 33.0 45.8 59.14 0.53 0.11 0.64 1735.4 ... 25 -12.53 -25.1 -38.8 -94.2 -190.26 0.27 -0.01 -62.78 771.2
8 33 16.42 21.9 32.6 45.9 56.91 0.51 0.10 0.36 1754.4 ... 25 -12.46 -25.4 -39.2 -94.1 -192.49 0.25 -0.02 -63.06 790.2
9 34 3.47 7.3 14.1 20.7 26.52 56.59 23.71 0.00 1754.4 ... 25 -25.41 -40.0 -57.7 -119.3 -222.88 56.33 23.59 -63.42 790.2
10 11 5.16 14.2 38.6 123.4 263.80 11.03 4.82 26.90 0.0 ... 11 0.00 0.0 0.0 0.0 0.00 0.00 0.00 0.00 0.0
11 21 4.72 11.6 27.5 44.5 54.91 17.05 7.05 0.20 964.0 ... 11 -0.44 -2.6 -11.1 -78.9 -208.89 6.02 2.23 -26.70 964.0
12 22 4.48 11.2 26.4 42.4 52.22 18.38 7.68 0.12 983.5 ... 11 -0.68 -3.0 -12.2 -81.0 -211.58 7.35 2.86 -26.78 983.5
14 31 14.80 22.4 33.2 45.5 58.11 1.05 0.36 0.56 1753.4 ... 22 10.32 11.2 6.8 3.1 5.89 -17.33 -7.32 0.44 769.9
15 32 16.30 22.1 32.1 44.7 55.12 0.57 0.13 0.23 1773.8 ... 22 11.82 10.9 5.7 2.3 2.90 -17.81 -7.55 0.11 790.3
16 33 3.44 7.2 14.0 21.0 26.34 56.72 24.69 0.00 1773.8 ... 22 -1.04 -4.0 -12.4 -21.4 -25.88 38.34 17.01 -0.12 790.3
Kindly help me optimize the code of three functions I posted, so I could write them in more pythonic way.
CodePudding user response:
Points 1. and 2. can be achieved in a few line with pandas functions.
You can then calculate "target_sample" and your diff_col in the same loop using groupby:
# 1. Whenever TEST_NUMBER == 11 has D1 value NaN, remove all rows with this PRODUCT_NO
drop_prod_no = df[(df.TEST_NUMBER==11) & (df.D1.isna())]["PRODUCT_NO"]
df.drop(df[df.PRODUCT_NO.isin(drop_prod_no)].index, axis=0, inplace=True)
# 2. Drop remaining rows with NaN values
df.dropna(inplace=True)
# 3. set column "target_sample" and calculate diffs
new_df = pd.DataFrame()
diff_cols = ['D1','D10','D50','D90','D99','Q3_15','Q3_10','l-Q3_63','RPM']
for _, subset in df.groupby("PRODUCT_NO"):
closest_sample = last_sample = 11
for index, row in subset.iterrows():
if row.TEST_NUMBER // 10 > closest_sample // 10 1:
closest_sample = last_sample
subset.at[index, "target_sample"] = closest_sample
last_sample = row.TEST_NUMBER
for col in diff_cols:
subset.at[index, col "_diff"] = subset.at[index, col] - float(subset[subset.TEST_NUMBER==closest_sample][col])
new_df = pd.concat([new_df, subset])
print(new_df)
Output:
TEST_NUMBER D1 D10 D50 D90 D99 Q3_15 Q3_10 l-Q3_63 ... D1_diff D10_diff D50_diff D90_diff D99_diff Q3_15_diff Q3_10_diff l-Q3_63_diff RPM_diff
0 11 4.77 12.7 34.9 93.7 213.90 13.74 5.98 21.44 ... 0.00 0.0 0.0 0.0 0.00 0.00 0.00 0.00 0.0
1 21 4.43 10.8 25.2 39.8 49.73 20.04 8.45 0.10 ... -0.34 -1.9 -9.7 -53.9 -164.17 6.30 2.47 -21.34 953.5
2 22 4.52 11.3 27.7 48.0 60.51 17.58 7.50 0.58 ... -0.25 -1.4 -7.2 -45.7 -153.39 3.84 1.52 -20.86 904.2
3 23 4.67 11.5 27.2 44.8 56.64 17.49 7.24 0.25 ... -0.10 -1.2 -7.7 -48.9 -157.26 3.75 1.26 -21.19 945.2
4 24 4.41 10.9 26.8 44.5 57.84 18.95 8.31 0.54 ... -0.36 -1.8 -8.1 -49.2 -156.06 5.21 2.33 -20.90 964.2
5 25 28.88 47.3 71.8 140.0 249.40 0.26 0.12 63.42 ... 24.11 34.6 36.9 46.3 35.50 -13.48 -5.86 41.98 964.2
6 31 16.92 23.1 34.3 48.4 92.41 0.51 0.13 1.78 ... -11.96 -24.2 -37.5 -91.6 -156.99 0.25 0.01 -61.64 730.3
7 32 16.35 22.2 33.0 45.8 59.14 0.53 0.11 0.64 ... -12.53 -25.1 -38.8 -94.2 -190.26 0.27 -0.01 -62.78 771.2
8 33 16.42 21.9 32.6 45.9 56.91 0.51 0.10 0.36 ... -12.46 -25.4 -39.2 -94.1 -192.49 0.25 -0.02 -63.06 790.2
9 34 3.47 7.3 14.1 20.7 26.52 56.59 23.71 0.00 ... -25.41 -40.0 -57.7 -119.3 -222.88 56.33 23.59 -63.42 790.2
10 11 5.16 14.2 38.6 123.4 263.80 11.03 4.82 26.90 ... 0.00 0.0 0.0 0.0 0.00 0.00 0.00 0.00 0.0
11 21 4.72 11.6 27.5 44.5 54.91 17.05 7.05 0.20 ... -0.44 -2.6 -11.1 -78.9 -208.89 6.02 2.23 -26.70 964.0
12 22 4.48 11.2 26.4 42.4 52.22 18.38 7.68 0.12 ... -0.68 -3.0 -12.2 -81.0 -211.58 7.35 2.86 -26.78 983.5
14 31 14.80 22.4 33.2 45.5 58.11 1.05 0.36 0.56 ... 10.32 11.2 6.8 3.1 5.89 -17.33 -7.32 0.44 769.9
15 32 16.30 22.1 32.1 44.7 55.12 0.57 0.13 0.23 ... 11.82 10.9 5.7 2.3 2.90 -17.81 -7.55 0.11 790.3
16 33 3.44 7.2 14.0 21.0 26.34 56.72 24.69 0.00 ... -1.04 -4.0 -12.4 -21.4 -25.88 38.34 17.01 -0.12 79