How many clusters yang




















Figure 3. B Results of final cluster number C on 6 benchmark data sets with Q ranges from 15 to The aim of ensemble clustering is to improve the stability, robustness, and accuracy of the final results by integrating multiple clustering results.

The limitation of many ensemble clustering is that the final cluster result depends on the selection of the basis cluster methods. In order to validate the robustness of EC-PGMGR, we do some comparative experiments from two aspects: 1 Consider various number of base cluster methods.

As Figure 4A depicted, we find that the clustering performance doesn't change too much when the number of base clustering methods increasing.

For example, in Baron-mouse1, we choose the best basis clustering method as each combination of ensemble clustering with EC-PGMGR and the result stays stable when the number of basis methods increases. Secondly, because it is convenient to adjust the results of method of SC3 by setting different parameters, we choose to take the experiments on all data sets and set the initial cluster number k k from 15 to 20 with 1 step size for Baron-human1, Baron-human2, Baron-human3, Baron-human4; k from 10 to 15 with 1 step size for Baron-mouse1, Baron-mouse2 for SC3 clustering method to observe whether the difference generated by SC3 would affect the results of EC-PGMGR.

The other results on the other data sets are shown in Supplementary Figure 3. The algorithm could balance base clustering results and original data through graph Laplacian regularization to keep robust. Figure 4. A The performance of different combination of basis cluster on Baron-mouse1. Different k will influence the results of SC3. The final results validate the effectiveness of the process of regularization.

Besides, we also apply the different methods on a large PBMC data set. The comparison results are presented in Figure 7. We further count the clustering results deriving from different clustering methods on the PBMC data set shown in Table 2. We find that the result of SC3 method is not very well. With the default parameters, the SC3 method divide cells into clusters and this will influence the performance of SAFE. With these setting, we find that even the performance of base clustering result is not well, our method can still achieve good performance.

Figure 5. Performance on different data sets with different methods in terms of ARI. Figure 6. Performance on different data sets with different methods in term of NMI. Figure 7. The performance with different methods on PBMC data set. A The ARI performance. B The NMI performance. For the purpose of the evaluation of the biological significance, we do some correlation analysis among some marker genes of cells.

In terms of visual quality, the UMAP algorithm has a competitive advantage with t-SNE, but it retains more global structure, superior operating performance, and better scalability. As shown in Figure 8 , the visualization of three ensemble methods and the true label. We can see that our method can achieve better result than other two methods in Baron-human1. We see that all three ensemble methods can achieve good clustering performance according to the true label.

Besides, Figure 9 shows the heat map of the top 50 standard deviation genes in the Baron-human1 experimental results. It can be seen that there are clearly high-expressed genes in the results we gathered. The results illustrates that our method achieves a good performance in clustering. The other experiments are listed in Supplementary Figures 4 , 5 , respectively. Figure 8. The visualization of performance of different methods on Baron-human1.

A True label. C SAFE. D SAME. Figure 9. The heat map shows the top 50 standard deviation genes in the Baron-human1 experimental results.

Each row represents the genes and each column represents the cells. Unlike conventional ensemble clustering algorithms that treat each base clustering result equally, EC-PGMGR is a weighted ensemble clustering algorithm which can automatically equip with weights for different base clustering results by a pre-learning process. Therefore, base clustering results that obtain higher weights may be more reliable and can be regarded as active clustering method.

On the contrary, base clustering results with lower weights may be less reliable and they may be far away from real cases. Our EC-PGMGR method can integrate different kinds of single cell clustering results and obtain an optimal consensus clustering results.

Considering that the proper single cell clustering algorithm require the number of clusters, EC-PGMGR can effectively and adaptively optimizing the number of clusters, which is more reasonable for practical scientific research. To avoid the undesirable ensemble result which could be caused by the base clustering results, graph Laplacian regularization is used in EC-PGMGR to preserve the information of original data, which can balance the base results and the original information to reduce the effect deriving from some inactive base clustering results.

We take experiments on seven single-cell data sets which have different sizes, species, and platforms. The ARI and NMI show that our method is better than the other comparative methods including individual and ensemble clustering methods on different data sets. We find that the some experimental performances on Baron-mouse2 are always not satisfying. Considering that the data type we use is in-drop data, there are many zeros in its expression matrix due to some technical reasons.

Although part of zero data is the true expression of cells, there is still some data which doesn't reflect the real expression level van Dijk David et al. The zero-inflated data will influence the final clustering results since the data is partly inaccurate. We calculate the ratio of 0 values in each data set to find out if there are relationships between the data and the not good performances.

Results shows that the ratio of 0 in Baron-human4 and Baron-mouse2 is higher than the others Baron-human1: 0. It may explain why the performance is not very good on Baron-mouse2.

Too much missing in the original data will influence the base cluster results and graph regularization term. The further researches would integrate more single-cell clustering methods and perform preliminary screening for base clustering methods, and then perform integrated analysis. The missing scRNA-seq data should be filled first so that the downstream analysis could be more accurate and reasonable.

Besides, we estimate the overall time cost of the updating process in Equations 7 and 8. The time cost for updating H is O n 2 Q , where n is the number of cells, and Q is the number of initial clusters. Since the parameter H is sparse, the real time cost is much smaller than O n 2 QT.

In addition, before performing our ensemble algorithm, we need to compute Laplacian matrix L which is time consuming.

It can be improved by some computational techniques in the further study. As an ensemble clustering algorithm, our model is more flexible. It is of great interest to use this model to undertake other clustering-based tasks such as exploring modules in gene regulatory networks and cell signaling networks. YZ and D-XZ conceived and designed the work and wrote the original manuscript.

D-XZ carried out computer implementation and data analysis. MY supervised the project. All authors contributed to the article and approved the submitted version. The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest. Baron, M. A single-cell transcriptomic map of the human and mouse pancreas reveals inter-and intra-cell population structure.

Cell Syst. Deng, C. Graph regularized nonnegative matrix factorization for data representation. In this paper we show how bootstrap can be implemented in hierarchical clustering algorithms as a strategy to estimate the number of clusters k.

Wards algorithm was chosen as an example. The … Expand. As for cluster analysis, the key problem is to determine the number of clusters. This paper presents an entropy regularized likelihood ERL learning principle for cluster analysis based on a mixture … Expand. View 1 excerpt, cites background. Estimating the number of data clusters via the contrast statistic. A new method the Contrast statistic for estimating the number of clusters in a set of data is proposed. The technique uses the output of self-organising map clustering algorithm, comparing the … Expand.

View 1 excerpt, cites methods. Studies of model selection and regularization for generalization in neural networks with applications. Sequential clustering by statistical methodology. View 2 excerpts, cites background and methods. Using general regression with local tuning for learning mixture models from incomplete data sets.

A study of regularized Gaussian classifier in high-dimension small sample set case based on MDL principle with application to spectrum recognition.

Automatic Clustering with Single Optimal Solution. An algorithm for unsupervised learning and optimization of finite mixture models.

The selection of the number of clusters is an important and challenging issue in cluster analysis. A number of attempts have been made to estimate the number of clusters in a given data set. Clustering is the main method to analyse the large numbers of data, but when the data's dimension is higher, the consumed time increases exponentially. We put forward an effective clustering method … Expand. Abstract A uni ed statistical learning approach called Bayesian Ying Yang BYY system and theory has been developed by the present author in recent years This paper is the rst part of a recent e ort … Expand.

Highly Influenced. View 9 excerpts, cites background, methods and results. View 1 excerpt, cites background. How many clusters? View 3 excerpts, references background. New advances on Bayesian Ying-Yang learning system with Kullback and non-Kullback separation functionals.

View 5 excerpts, references methods and background. Rival penalized competitive learning for clustering analysis, RBF net, and curve detection. View 2 excerpts, references background. Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper. Vibratory power unit for vibrating conveyers and screens comprising an asynchronous polyphase motor, at least one pair of associated unbalanced masses disposed on the shaft of said motor, with the … Expand.

Brain-like Computing and Intelligent Information Systems.



0コメント

  • 1000 / 1000