ACN58946.1|FEATURES|NCBI|macrolide-lincosamide-streptogramin|macB|antibiotic target protection|0\sMAEPVLSVKDLDIRFTTPDGNVHAVKKVSFDIAPGECLGVVGESGSGKSQLFMACIGLLAGNGKATGSVTYRGQELLGQPAAKLNAIRGAKITMIFQDPLTSLTPHMRIGDQIVESLRTHSKLSKGEAEKRAIQALELVRIPEAKRRMRQYPHELSGGMRQRVMIAMATACGPDLLIADEPTTALDVTVQAQILDIMRDLRKELGTSIALISHDMGVIASICDRVQVMRYGEFVETGPADDIFYHPQHPYTRMLLEAMPRIDQPVREGRAALKPLAPQEARTTLLEVDNVKVHFPIQMGGVFFGKYKPLRAVDGVSFTLHQGETIGIVGESGCGKSTLARAVLELLPKTTGGVVWMGRDLGALPPAELRRARKDFQIVFQDPLASLDPRMTIGQSIAEPLQSLEPELSKHEVQSRVRAIMEKVGLDPDWINRYPHEFSGGQNQRVGIARAMILKPKLIVCDEAVSALDVSIQAQIVDLILSLQAEFGMSIIFISHDLSVVRQVSHRVMVLYLGRVVELASRDAIYEDARHPYTKALISAVTVPDPRAERLKKRRELPGELPSPLDTRSALMFLKSKRIDDPDAEQYVPKLIEVAPGHFVAEHDPFEVVEMTG\e>ACN58871.1|FEATURES|NCBI|macrolide-lincosamide-streptogramin|macB|antibiotic efflux|0\sMADYLLEMKNIVKEFGGVRALNGIDIKLKAGECAGLCGENGAGKSTLMKVLSAVYPHGTWDGEILWDGKPLRAQSIRETEAAGIVIIHQELMLVPELSVAENIFLGNEIKLPGGRMDYAAMNRRAEELLAELDIRDVNVVLPVKQYGGGYQQLIEIAKALNKNARLLILDEPSSSLTASEIKVLLRIIHSLKAKGVTCVYISHKLDEVADICDTIVVIRDGQHIATTPMADMNIERIIAQMVGREMNQLYPERSHVPGEVIFEARNVSCYDADNPQRKRVDNISFKLRKGEILGIAGLVGAGRTELVSALFGAYPGPSEAEVWLNGVKLDTRTPLKAIRAGLAMVPEDRKQHGIVPDLGVGHNMTLAVLNDFVRATRIDQQAELATIHKEIKSVKLKTATPFLPITSLSGGNQQKAVLSKMLLTKPKILILDEPTRGVDVGAKFEIYQLMFDLAAQGMSIIMVSSELAEVLGISDRVLVVGEGKLRGDFVNDNLSQETVLAAALDHTQPALH\e>ACN58991.1|FEATURES|NCBI|multidrug|cmeB|antibiotic efflux|0\sMKNDRGEMVPFSAFMTIKKKQGANEINRYNMYNTAAIRGGPATGYSSGEAIKAVQEVAAKNLPNGFDIDWAALSYDETRRGNEAVYIFLIVLAFVYLVLAAQYESFIIPLAVVFSLPAGVFGSFLLIKGMGLANDIYAQVGLVMLVGLLGKNAVLIVEFAVQKQQQGATVFEAAIEGARVRFRPILMTSFAFIAGLIPLVFAHGAGAIGNKTIGSSALGGMFFGTVFGVIVVPGLYYVFGSWAEGRKLIRGEDHDPLTENLVHQMDNFPQSDDK\e>ACN58776.1|FEATURES|NCBI|macrolide-lincosamide-streptogramin|macA|antibiotic efflux|0\sMGNLPRPTLSPSLSGIRPTMNRETTTRVDSSTPAARLGMRVPSTSRAALVGVAALVVILGGWYGIKRWRAHVASEGQYIFAAIQKGDIEDLVTATGSLQPRDYVDVGAQVSGQLDKILVEVGSDVKEGDLLAEIDADVAAARVDASRAQLRSQQAQLVQQQANLTKAERDLTRQQNLMKEDATTAEQVQNAETTLDTTKAQINALKAQMEQLRASMRVDESNLNYTKILAPMSGTVVSISAKQGQTLNTNQQAPTILRIADLSTMTVQTQVSEADVSKLRSGMQAYFTTLGSAGKRWYGQLKKIEPTPTVTNNVVLYNALFEVPNDNKQLLPQMTAQVFFVAAAAHDVLVVPMSAVSLQRTPPGGIPNAAAAQAAGARGAGAQGAGAQGAQGASAQGAGAQSGQGGQGAAALTPEQIARREARRQQRMQSNGGSATGGAIEGGPPRGGFGASMAARGPRHATVRVQAADGKIEERQITIGVTNRVHAEVLSGLKEGERVVAGTKEPEKAPATAGGQQGAGGQRNNIGGFPGGGLGGGFGR\e
I am working with protein sequences right now. I have these string data which parts I want to convert to one-hot encode. The parts that I want to convert starts after '\s' and end before '\e', then I want to do it for the whole string data. Since the I have thousands of datasets, using pure Python code seems impossible, since it will take a long time to finish the process. Is there any machine learning library for this problem?
Thank you for your help in advance!
CodePudding user response:
Afterwards you can use sklearn.preprocessing.OneHotEncoder or pandas.get_dummies to encode your protein sequences. I show the pandas version here:
df = pd.get_dummies(data=df, columns=[6])
This is an example, if your data is available in a variable. If its a textfile, the io methods are not needed and should be replaced by loading the txt file. Also be aware that "\e>" should be replaced by an newline character e.g. "\r" or "\n" to get the samples seperated.