I have been trying to select the most similar molecule in the data below using python. Since I'm new to python programming, I couldn't do more than plotting. So how could we consider all factors, such as surface area, volume, and ovality, for choosing the best molecule? The most similar molecule should replicate the drug V0L in all aspects. V0L IS THE ACTUAL DRUG (the last row), The rest are the molecules.
Mol Su Vol Su/Vol PSA Ov D A Mw Vina
1. 1 357.18 333.9 1.069721473 143.239 1.53 5 10 369.35 -8.3
2. 2 510.31 496.15 1.028539756 137.388 1.68 6 12 562.522 -8.8
3. 3 507.07 449.84 1.127223013 161.116 1.68 6 12 516.527 -9.0
4. 4 536.54 524.75 1.022467842 172.004 1.71 7 13 555.564 -9.8
5. 5 513.67 499.05 1.029295662 180.428 1.69 7 13 532.526 -8.9
6. 6 391.19 371.71 1.052406446 152.437 1.56 6 11 408.387 -8.9
7. 7 540.01 528.8 1.021198941 149.769 1.71 7 13 565.559 -9.4
8. 8 534.81 525.99 1.01676838 174.741 1.7 7 13 555.564 -9.3
9. 9 533.42 520.67 1.024487679 181.606 1.7 7 14 566.547 -9.7
10. 10 532.52 529.47 1.005760477 179.053 1.68 8 14 571.563 -9.4
11. 11 366.72 345.89 1.060221458 159.973 1.54 6 11 385.349 -8.2
12. 12 520.75 504.36 1.032496629 168.866 1.7 6 13 542.521 -8.7
13. 13 512.69 499 1.02743487 179.477 1.69 7 13 532.526-8.6
14. 14 542.78 531.52 1.021184527 189.293 1.71 7 14 571.563 -9.6
15. 15 519.04 505.7 1.026379276 196.982 1.69 8 14 548.525 -8.8
16. 16 328.95 314.03 1.047511384 125.069 1.47 4 9 339.324 -6.9
17. 17 451.68 444.63 1.01585588 118.025 1.6 5 10 466.47 -9.4
18. 18 469.67 466.11 1.007637682 130.99 1.62 5 11 486.501 -8.3
19. 19 500.79 498.09 1.005420707 146.805 1.65 6 12 525.538 -9.8
20. 20 476.59 473.03 1.00752595 149.821 1.62 6 12 502.5 -8.4
21. 21 357.84 347.14 1.030823299 138.147 1.5 5 10 378.361 -8.6
22. 22 484.15 477.28 1.014394066 129.93 1.64 6 11 505.507 -10.2
23. 23 502.15 498.71 1.006897796 142.918 1.65 6 12 525.538 -9.3
24. 24 526.73 530.31 0.993249232 154.106 1.66 7 13 564.575 -9.9
25. 25 509.34 505.64 1.007317459 161.844 1.66 7 13 541.537 -9.2
26. 26 337.53 320.98 1.051560845 144.797 1.49 5 10 355.323 -7.1
27. 27 460.25 451.58 1.019199256 137.732 1.62 5 11 482.469 -9.6
28. 28 478.4 473.25 1.010882198 155.442 1.63 6 12 502.5 -8.9
29. 29 507.62 505.68 1.003836418 161.884 1.65 6 13 541.537 -9.2
30. 30 482.27 479.07 1.006679608 171.298 1.63 7 13 518.499 -9.1
31.V0L 355.19 333.42 1.065293024 59.105 1.530 0 9 345.37 -10.4
- Su = Surface Area in squared angstrom
- Vol = Volume in cubic angstrom
- PSA = Polar Surface Area in squared angstrom
- Ov = Ovality
- D= Number of Hydrogen Bond Donating group
- A = Number of Hydrogen Bond Donating group
- Vina = Binding affinity (lower is better)
- Mw = Molecular Weight
- Mol = The number of molecule candidate
CodePudding user response:
I have done that and plotted some basic plots.
but I wanted the program to consider all factors and give me a plot or any other form of data and pick the most similar.
CodePudding user response:
In order to find the most similar molecule we can use euclidean
distance between all rows and the last one, and pick up the row having minimal distance value:
# make the last row as a new dataframe named `df1`
df1 = df[30:31]
# And the first rows in another dataframe:
df2 = df[0:31]
And use scipy.spatial
package :
import scipy.spatial
ary = scipy.spatial.distance.cdist(df2, df1, metric='euclidean')
df[ary==ary.min()]
Output
This output is by using the previous dataframe before new edits of the question :
Molecule SurfaceAr Volume PSA Ovality HBD HBA Mw Vina BA Su/Vol
15 RiboseGly 1.047511 314.03 125.069 1.47 4 9 339.324 -6.9 0.003336