1. About This Work

Gram-positive bacteria have developed secretion systems to transport proteins across their cell wall, a process that plays an important role during host infection. These secretion mechanisms have also been harnessed for therapeutic purposes in many biotechnology applications. Accordingly, the identification of features that select a protein for efficient secretion from these microorganisms has become an important task. Among all the secreted proteins, "non-classical" secreted proteins are difficult to identify as they lack discernable signal peptide sequences and can make use of diverse secretion pathways. Currently, several computational methods have been developed to facilitate the discovery of such non-classical secreted proteins; however, the existing methods are based on either simulated or limited experimental datasets. In addition, they often employ basic features to train the models in a simple and coarse-grained manner. The availability of more experimentally validated datasets, advanced feature engineering techniques and novel machine learning approaches creates new opportunities for the development of improved predictors of "non-classical" secreted proteins from sequence data.

In this work, we first constructed a high-quality dataset of experimentally verified "non-classical" secreted proteins, which we then used to create benchmark datasets. Using these benchmark datasets, we comprehensively analyzed a wide range of features and assessed their individual performance. Subsequently, we developed a two-layer LightGBM ensemble model that integrates several single-feature based models into an overall prediction framework. At this stage, LightGBM, a gradient boosting machine, was used as a machine learning approach and the necessary parameter optimization was performed by a particle swarm optimization (PSO) strategy. All single feature-based LightGBM models were then integrated into a unified ensemble model to further improve the predictive performance. Consequently, the final ensemble model achieved a superior performance with an Accuracy of 0.900, an F-value of 0.903, Matthew's correlation coefficient of 0.803, and an area under the curve value of 0.963, and outperforming previous state-of-the-art predictors on the independent test. Based on our proposed optimal ensemble model, we further developed an accessible online predictor, PeNGaRoo, to serve users' demands. We believe this online web server, together with our proposed methodology, will expedite the discovery of non-classically secreted effector proteins in Gram-positive bacteria and further inspire the development of next-generation predictors.

2. Architecture

The overall workflow of the PeNGaRoo methodology is illustrated in the following figure. There exist five major stages in the development of PeNGaRoo: (1) Data collection and preprocessing, (2) Feature extraction, (3) Parameterization and ensemble model construction, (4) Performance assessment, and (5) Web server development.

1. Training Dataset

The construction of a high-quality benchmark dataset for training and validating the prediction model is a prerequisite of successful machine learning approaches. In this study, we used experimentally validated non-classically secreted proteins of Gram-positive bacteria, which were obtained from a recent work (Wang, et al., 2016). Specifically, 253 non-classically secreted protein sequences were extracted from the literature, which have been identified by at least three different research groups in at least three different bacterial species (Wang, et al., 2016) (Table S1). For the negative training set, we chose the entire set of 1,084 proteins (Firmicutes, annotated to be localized in the cytoplasm) in the work of Bendtsen et al. (Bendtsen, et al., 2005). Subsequently, CD-HIT (Huang, et al., 2010) was applied to the initial dataset to remove any redundancy at the cutoff threshold of 80% sequence identity to avoid any potential bias. We obtained 157 positive samples (Table S1) and 446 negative samples. In view of the scarcity of the positive samples, we partitioned the dataset into training and independent test datasets by adopting the following procedure: nine tenths of the above positive samples were used as the training dataset, while the remaining one tenth were used as the independent dataset. As a result, the training dataset included 141 positive and 446 negative sequences.

2. Independent Test Datasets

In order to objectively evaluate the predictive performance of the proposed method, we further constructed an independent test dataset. For the positive samples, we included experimentally validated non-classical secreted proteins collected by previous studies and ours. For the negative samples, we collected proteins from UniProt (UniProt, 2015) by extracting those entries whose annotations contained the key words "cytoplasm" or "cytoplasmic" but did not have any annotations of "secreted". After removing the overlapping sequences in the training dataset, we finally obtained 34 positive samples and 34 negative samples as the independent test dataset.

1. PeNGaRoo

To maximize user convenience, particularly for the amenability of biomedical scientists and biotechnologists, this user-friendly and publicly accessible web server has been established for the wider research community to perform predictions of novel putative non-classical secreted proteins in Gram-positive bacteria. PeNGaRoo is a user-friendly and effective platform hosted by the Monash University cloud computing facility, freely accessible at http://pengaroo.erc.monash.edu/.

1.1 Important settings of the PeNGaRoo server

- PSSM-based features generated by PeNGaRoo is based on PSSM profiles, which are obtained by searching each target protein sequence against the uniref50 database;

- The ensemble model currently used in the PeNGaRoo server is trained based on the training dataset in line with its paper. It will be updated by combining all the experimentally non-classical secreted proteins as the positive dataset to re-train the prediction model. This operation would be continuously conducted through the retrieval of new experimentally validated non-classical secreted proteins on a regular basis.

2. Using the PeNGaRoo web server

As an online server implemented with a user-friendly interface, PeNGaRoo is very easy to use. Users can submit query sequence data in the two following ways: (i) fill in the sequence input form, or (ii) upload a query sequence file in the raw or FASTA format. After the job submission, the web server will provide a unique URL link to users to view the results. The user will also be given the option to fill in the mailbox, via which the user will receive an email containing the URL link for retrieving the prediction output upon the completion of the submitted task.

2.1 Input Formats

Two types of input are accepted by PeNGaRoo: sequences in FASTA format (strongly recommended) and raw sequences.

In the case of input sequences in the FASTA format, you can prepare and input them as follows:

>sp|Q81LS2_p|DNAK_BACAN Chaperone protein DnaK OS=Bacillus anthracis GN=dnaK PE=3 SV=1
MSKIIGIDLGTTNSCVAVMEGGEPKVIPNPEGNRTTPSVVAFKNEERQVGEVAKRQAITNPNTIMSVKRHMGTDYKVEVEGKDYTPQEISAIILQNLKASAEAYLGETVTKAVITVPAYFNDAERQATKDAGRIAGLEVERIINEPTAAALAYGLEKQDEEQKILVYDLGGGTFDVSILELADGTFEVISTAGDNRLGGDDFDQVIIDHLVAEFKKENNIDLSQDKMALQRLKDAAEKAKKDLSGVTQTQISLPFISAGAAGPLHLELTLTRAKFEELSAGLVERTLEPTRRALKDAGFAPSELDKVILVGGSTRIPAVQEAIKRETGKEPYKGVNPDEVVALGAAVQGGVLTGDVEGVLLLDVTPLSLGIETMGGVFTKLIERNTTIPTSKSQVFSTAADNQPAVDIHVLQGERPMSADNKTLGRFQLTDLPPAPRGIPQIEVTFDIDANGIVNVRAKDLGTSKEQAITIQSSSGLSDEEVERMVQEAEANADADQKRKEEVELRNEAD
>sp|P9WNK5|ESXB_MYCTU ESAT-6-like protein EsxB OS=Mycobacterium tuberculosis (strain ATCC 25618 / H37Rv) OX=83332 GN=esxB PE=1 SV=1
MAEMKTDAATLAQEAGNFERISGDLKTQIDQVESTAGSLQGQWRGAAGTAAQAAVVRFQEAANKQKQELDEISTNIRQAGVQYSRADEEQQQALSSQMGF
>sp|P50846
MESKVVENRLKEAKLIAVIRSKDKQEACQQIESLLDKGIRAVEVTYTTPGASDIIESFRNREDILIGAGTVISAQQAGEAAKAGAQFIVSPGFSADLAEHLSFVKTHYIPGVLTPSEIMEALTFGFTTLKLFPSGVFGIPFMKNLAGPFPQVTFIPTGGIHPSEVPDWLRAGAGAVGVGSQLGSCSKEDLQAVFQV
>sp|O53078
MNNERLRRTMMFVPGNNPAMVKDAGIFGADSIMFDLEDAVSLAEKDSARYLVYEALQTVDYGSSELVVRINGLDTPFYKNDIKAMVKAGIDVIRLPKVETAAMMHELESLITDAEKEFGRPVGTTHMMAAIESALGVVNAVEIANASDRMIGIALSAEDYTTDMKTHRYPDGQELLYARNVILHAARAAGIAAFDTVFTNLNDEEGFYRETQLIHQLGFDGKSLINPRQIEMVNKVYAPTEKEINNAQNVIAAIEEAKQKGSGVISMNGQMVDRPVVLRAQRVMKLANANHLVDSEGNYIEK

In addition, the following input sequence, which is in the original format downloadable from the UniProt database:

>gi|16421415|gb|AAL21748.1| putative cytoplasmic protein [Salmonella enterica subsp. enterica serovar Typhimurium str. LT2] GN=OrgC |1|validated|14573697|
MIPGTIPTSYLVPTADTEATGVVSLSARAAMLNNMDSAPLSNGGDVDLYDAFYQRLLALPESASSETLKDSIYQEMNA
FKDPNSGDSAFVSFEQQTAMLQNMLAKVEPGTHLYEALNGVLVGSMNAQSQMTSWMQEIILSGGENKEAIDW
>tr|O30783|O30783_CHLCIInclusion membrane protein C OS=Chlamydia caviae GN=IncC PE=4 SV=1|19390696|
MTSVRTDLTPGDTSLQSSLLNPSDLTTQLSNLQTVLAGIQQQHPLNGGWPQHHPTGAADQNYLMRLMQSHMAS
TVSAVSELRTEVTAIKTKLHGLSTPANVCSGPMALAAFLLAISLVAIIIIVLASLGLAGILPQAAAILVNTANSIWA
IVSASIVTVICLISVLCITLIRHHKPLPIETRPTGH
>gi|56416452|ref|YP_153526.1| ribonuclease H [Anaplasma marginale str. St. Maries]|-1
MSLYYVRYWNTIKNDGRMVLMGKSRVAIYTDGACSGNPGPGGWGAVLRFGDGGERRISGGSDDTTNN
RMELTAVIMALAALSGPCSVCVNTDSTYVKNGITEWIRKWKLNGWRTSNKSAVKNVDLWVELERLTLLHSIEWRWVKAH
AGNEYNEEADMLARGEVERRMVIPK
>sp|P37033|Y1689_LEGPHUncharacterized protein lpg1689 OS=Legionella pneumophila subsp. pneumophila (strain Philadelphia 1 / ATCC 33152 / DSM 7513) GN=lpg1689 PE=4 SV=1|24064423|
MYHYLFSCHKSQESIDGLIEQVKQLLNHVEMEQKAYFLNLLTARVAEFQNELKSEASNTINK
QQILIQYEKFAKTLLICIKQPERTSYAIHNYHKGFYYPVAIHDKIKPDPTIENAAIATLGVSLAL
LLGSIPTFIFNPLFGVIMVSLAVTLLLPSGFYLLIPDSPDTTSKKEEEKRIFMEGAKIINPDVRIEEFDEQPYLSSSLIKT

will be formatted (in order to remove those line breaks within the sequence) as follows:

>gi|16421415|gb|AAL21748.1| putative cytoplasmic protein [Salmonella enterica subsp. enterica serovar Typhimurium str. LT2] GN=OrgC |1|validated|14573697|
MIPGTIPTSYLVPTADTEATGVVSLSARAAMLNNMDSAPLSNGGDVDLYDAFYQRLLALPESASSETLKDSIYQEMNAFKDPNSGDSAFVSFEQQTAMLQNMLAKVEPGTHLYEALNGVLVGSMNAQSQMTSWMQEIILSGGENKEAIDW
>tr|O30783|O30783_CHLCIInclusion membrane protein C OS=Chlamydia caviae GN=IncC PE=4 SV=1|19390696|
MTSVRTDLTPGDTSLQSSLLNPSDLTTQLSNLQTVLAGIQQQHPLNGGWPQHHPTGAADQNYLMRLMQSHMASTVSAVSELRTEVTAIKTKLHGLSTPANVCSGPMALAAFLLAISLVAIIIIVLASLGLAGILPQAAAILVNTANSIWAIVSASIVTVICLISVLCITLIRHHKPLPIETRPTGH
>gi|56416452|ref|YP_153526.1| ribonuclease H [Anaplasma marginale str. St. Maries]|-1
MSLYYVRYWNTIKNDGRMVLMGKSRVAIYTDGACSGNPGPGGWGAVLRFGDGGERRISGGSDDTTNNRMELTAVIMALAALSGPCSVCVNTDSTYVKNGITEWIRKWKLNGWRTSNKSAVKNVDLWVELERLTLLHSIEWRWVKAHAGNEYNEEADMLARGEVERRMVIPK
>sp|P37033|Y1689_LEGPHUncharacterized protein lpg1689 OS=Legionella pneumophila subsp. pneumophila (strain Philadelphia 1 / ATCC 33152 / DSM 7513) GN=lpg1689 PE=4 SV=1|24064423|
MYHYLFSCHKSQESIDGLIEQVKQLLNHVEMEQKAYFLNLLTARVAEFQNELKSEASNTINKQQILIQYEKFAKTLLICIKQPERTSYAIHNYHKGFYYPVAIHDKIKPDPTIENAAIATLGVSLALLLGSIPTFIFNPLFGVIMVSLAVTLLLPSGFYLLIPDSPDTTSKKEEEKRIFMEGAKIINPDVRIEEFDEQPYLSSSLIKT

In the case of raw sequences, you can input them as follows:

MIPGTIPTSYLVPTADTEATGVVSLSARAAMLNNMDSAPLSNGGDVDLYDAFYQRLLALPESASSETLKDSIYQEMNAFKDPNSGDSAFVSFEQQTAMLQNMLAKVEPGTHLYEALNGVLVGSMNAQSQMTSWMQEIILSGGENKEAIDW
MTSVRTDLTPGDTSLQSSLLNPSDLTTQLSNLQTVLAGIQQQHPLNGGWPQHHPTGAADQNYLMRLMQSHMASTVSAVSELRTEVTAIKTKLHGLSTPANVCSGPMALAAFLLAISLVAIIIIVLASLGLAGILPQAAAILVNTANSIWAIVSASIVTVICLISVLCITLIRHHKPLPIETRPTGH
MSLYYVRYWNTIKNDGRMVLMGKSRVAIYTDGACSGNPGPGGWGAVLRFGDGGERRISGGSDDTTNNRMELTAVIMALAALSGPCSVCVNTDSTYVKNGITEWIRKWKLNGWRTSNKSAVKNVDLWVELERLTLLHSIEWRWVKAHAGNEYNEEADMLARGEVERRMVIPK
MYHYLFSCHKSQESIDGLIEQVKQLLNHVEMEQKAYFLNLLTARVAEFQNELKSEASNTINKQQILIQYEKFAKTLLICIKQPERTSYAIHNYHKGFYYPVAIHDKIKPDPTIENAAIATLGVSLALLLGSIPTFIFNPLFGVIMVSLAVTLLLPSGFYLLIPDSPDTTSKKEEEKRIFMEGAKIINPDVRIEEFDEQPYLSSSLIKT

which will be formated by PeNGaRoo as follows:

>input1
MIPGTIPTSYLVPTADTEATGVVSLSARAAMLNNMDSAPLSNGGDVDLYDAFYQRLLALPESASSETLKDSIYQEMNAFKDPNSGDSAFVSFEQQTAMLQNMLAKVEPGTHLYEALNGVLVGSMNAQSQMTSWMQEIILSGGENKEAIDW
>input2
MTSVRTDLTPGDTSLQSSLLNPSDLTTQLSNLQTVLAGIQQQHPLNGGWPQHHPTGAADQNYLMRLMQSHMASTVSAVSELRTEVTAIKTKLHGLSTPANVCSGPMALAAFLLAISLVAIIIIVLASLGLAGILPQAAAILVNTANSIWAIVSASIVTVICLISVLCITLIRHHKPLPIETRPTGH
>input3
MSLYYVRYWNTIKNDGRMVLMGKSRVAIYTDGACSGNPGPGGWGAVLRFGDGGERRISGGSDDTTNNRMELTAVIMALAALSGPCSVCVNTDSTYVKNGITEWIRKWKLNGWRTSNKSAVKNVDLWVELERLTLLHSIEWRWVKAHAGNEYNEEADMLARGEVERRMVIPK
>input4
MYHYLFSCHKSQESIDGLIEQVKQLLNHVEMEQKAYFLNLLTARVAEFQNELKSEASNTINKQQILIQYEKFAKTLLICIKQPERTSYAIHNYHKGFYYPVAIHDKIKPDPTIENAAIATLGVSLALLLGSIPTFIFNPLFGVIMVSLAVTLLLPSGFYLLIPDSPDTTSKKEEEKRIFMEGAKIINPDVRIEEFDEQPYLSSSLIKT
2.2 Input sequence limits

- The length of each submitted sequence should be in the range of 50 and 5000.

- Considering that the prediction is a little bit time-consuming (especially in PSSM profile generation), the current maximum number of sequences allowed for each submission by the PeNGaRoo server should be no more than 500.

3. PeNGaRoo Prediction Result Instructions

PeNGaRoo contains a built-in list (continuously updated to keep in pace with new experimentally validated non-classical secreted proteins) of non-classical secreted proteins, as to annotate the prediction results after jobs are processed, through which we aim to distinguish the known non-classical secreted proteins from the computationally predicted ones.

For a computationally predicted non-classical secreted proteins, the results are marked as Pred, while the detailed prediction results (including those predicted by the single method-based models and those predicted by the final ensemble model) will also be presented to users.

For a known non-classical secreted proteins, the results are marked as Exp (an example is provided in the following figure).