A Two-Phase Recall-and-Select Framework for Fast Model Selection

Jianwei Cui, Wenhang Shi, Honglin Tao, Wei Lu, Xiaoyong Du *Corresponding author Renmin University of China, Beijing, China
{cuijianwei, wenhangshi, honglintao, lu-wei, duyong}@ruc.edu.cn
Abstract

As the ubiquity of deep learning in various machine learning applications has amplified, a proliferation of neural network models has been trained and shared on public model repositories. In the context of a targeted machine learning assignment, utilizing an apt source model as a starting point typically outperforms the strategy of training from scratch, particularly with limited training data. Despite the investigation and development of numerous model selection strategies in prior work, the process remains time-consuming, especially given the ever-increasing scale of model repositories. In this paper, we propose a two-phase (coarse-recall and fine-selection) model selection framework, aiming to enhance the efficiency of selecting a robust model by leveraging the models’ training performances on benchmark datasets. Specifically, the coarse-recall phase clusters models showcasing similar training performances on benchmark datasets in an offline manner. A light-weight proxy score is subsequently computed between this model cluster and the target dataset, which serves to recall a significantly smaller subset of potential candidate models in a swift manner. In the following fine-selection phase, the final model is chosen by fine-tuning the recalled models on the target dataset with successive halving. To accelerate the process, the final fine-tuning performance of each potential model is predicted by mining the model’s convergence trend on the benchmark datasets, which aids in filtering lower performance models more earlier during fine-tuning. Through extensive experimentation on tasks covering natural language processing and computer vision, it has been demonstrated that the proposed methodology facilitates the selection of a high-performing model at a rate about 3x times faster than conventional baseline methods. Our code is available at https://github.com/plasware/two-phase-selection.

Index Terms:
model selection, model clustering

I Introduction

Nowadays, a plethora of neural networks, meticulously trained in diverse fields such as natural language processing and computer vision, are readily available. These models are commonly hosted on public repositories or model hubs [1][2]. Given the wide range of these models’ training data, it is plausible that for any specific downstream task, there exists a trained model whose domain distribution of the training dataset is well-transferable for the target task. Employing such a pre-trained model for parameter initialization, followed by fine-tuning on the target dataset often leads to an enhanced performance. This is attributable to the effective transfer and adaptation of the knowledge garnered from the original model to the target task. Therefore, how to select the optimal pre-trained model from a vast collection is crucial for achieving superior training results in a new task, especially when the training data is limited [3][4].

Refer to caption


Figure 1: Fine-tuning performance of 44 and 25 pre-trained models on NLP task MNLI [5] and CV task CC6204-Hackaton-Cu [6]. The x and y-axis show pre-trained models’ ID and their performances on the dataset, respectively. It’s noted that for each dataset, the ids are sorted by the model accuracy desc.

The principal objective of model selection is to efficiently identify a well-suited pre-trained model from a repository for a novel machine learning task. However, the growing repository, though providing potential for improved task initialization, yet escalates the challenge of pinpointing the optimal model. Fig. 1 illustrates the fine-tuning results of different models on two distinct machine learning tasks. Although the model pool contains a few models that exhibit commendable performance on the target task, they are markedly outnumbered by models that perform poorly. This discrepancy underscores the increasing complexity of model selection as the volume of available models surges.

The current body of research on this challenge can be bifurcated into two main categories. The first category is centered on the development of lightweight proxy tasks to predict the post-fine-tuning performance of pre-trained models. While the methods offer computational efficiency by obviating the need for direct fine-tuning, they tend to be more prone to selecting sub-optimal models [7]. Conversely, the second category of methods employ a model selection strategy during the fine-tuning process on the target dataset, utilizing a success-halving approach to retain only high-performance models at each training iteration [3][4]. However, as illustrated in Fig. 1, only a minor fraction of models in the repository are appropriate for a specific downstream task. Therefore, even with the successive halving strategy, it is computationally inefficient to fine-tune all models in the repository, given that each model needs to be loaded and trained for at least one iteration. Additionally, the efficacy and efficiency of these existing methods decline as the number of pre-trained models continues to expand.

Hence, we propose a two-phase model selection framework, a hybrid approach that amalgamates the advantages of both aforementioned categories, enabling the efficient selection of a suitable pre-trained model for a novel task. We split the model selection to two phases: the first phase, referred as the coarse-recall phase, is designed to identify a handful of promising model candidates based on lightweight proxy tasks. Following this, the second fine-selection phase only necessitates the fine-tuning of models recalled from the first phase to identify the most optimal model. This method significantly improves the efficiency of selecting a suitable model from a large repository, as fine-tuning is only carried out on a substantially reduced subset of models.

The coarse-call phase computes a light proxy task for each model on the target dataset, keeping only the models with high scores for fast filtering. Although the use of proxy tasks avoids fine-tuning, computing a score for each model still makes model loading and inference necessary and inefficient, especially when the model number increases. To speed up, we propose to cluster similar models based on their performances on a set of benchmark datasets. This is inspired by the fact that there is overlap in the training data of public models, so that models that perform similarly on the standard dataset will perform similarly on the new dataset. Specifically, we construct a performance matrix by training each model offline on all benchmark datasets and saving the corresponding performances. Then we cluster the models based on their performance vectors, and each time a new task arrives, we only compute scores for the clusters’ representative model. By mining the similarity among models’ training, we avoid repeated online computation of proxy scores for similar models and make the model selection more efficient.

As fine-tuning is more time-consuming, the second fine-selection phase needs to apply more efficient filter strategy to avoid wasting time training low-performance models at early training steps. This is motivated by the model performance consistency at the beginning and end of the training [8]. In this paper, we also apply the successive halving algorithm to filter at least half number of models with lower performance at each training iteration. Meanwhile, as illustrated in Fig. 2(b), it is plausible to filter more than half number of models if we can predict the final training performance. Again, we resort to mine the convergence processes between a pre-trained model and benchmark datasets as illustrated in Fig. 2(b). Specifically, for every recalled model, we collect the training processes on benchmark datasets, and cluster training processes with similar validation accuracy to form a convergence trend which could predict final test performance range at each iteration step. Then, after the model is fine-tuned on the target dataset for a few steps, we can assign a convergence trend to the model if the current training performance is closed to the training performance of the convergence trend at current step. By this way, the final training performance of a pre-trained model on the target dataset could be predicted more accurate, which could helps to filter more models at early steps.

Refer to caption


Figure 2: The framework of two-phase model selection: (a) performance matrix, (b) model clustering based on the performance matrix, and convergence trends mining by clustering convergence processes of a pre-trained model on benchmark datasets, (c) coarse-recall phase running recall strategy based on the proxy score computation between a model cluster and the target dataset, and (d) fine-selection phase fine-tunes the recalled models and filters poorly-performance models according to convergence trend. Both (a) and (b) are maintained offline and could be used for any new task.

To this end, we summarize the contributions of this paper as follows:

  • We propose a two-phase model selection framework. The first coarse-recall phase employs a lightweight proxy score to identify a considerably smaller set of promising model candidates. Subsequently, the second fine-selection phase exclusively fine-tunes and filters models from this refined set.

  • To further accelerate the process, we propose mining the training performances of pre-trained models on a collection of benchmark datasets, subsequently clustering similar models. Through clustering, we circumvent computing proxy scores online for each model in the first phase and enhance the accuracy of performance predictions for the left models in the second phase, thereby achieving more efficient and precise selection.

  • We conduct extensive experiments on a substantial variety of training models, encompassing both natural language processing and computer vision domains. The results demonstrate that our proposed framework can effectively identify superior performing models with increasing efficiency, about 3x compared to successive halving and 5x compared to brute force methods.

II The Framework

II-A Preliminaries

Model Repository. The model repository is a set of pre-trained models, denoted as M={m1,m2,,mn}𝑀subscript𝑚1subscript𝑚2subscript𝑚𝑛M=\{m_{1},m_{2},...,m_{n}\}italic_M = { italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_m start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_m start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT }. Here, a pre-trained model mjsubscript𝑚𝑗m_{j}italic_m start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT is a neural network model already trained on an upstream dataset with different learning methods, such as masked language model in natual language processing [9] or image classification for computer vision.

Benchmark Datasets. The benchmark datasets comprise representative datasets from the respective domain, such as the GLUE [5] for natural language process and various subsets of ImageNet [10] for computer vision. We use D={d1,d2,,dm}𝐷subscript𝑑1subscript𝑑2subscript𝑑𝑚D=\{d_{1},d_{2},...,d_{m}\}italic_D = { italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_d start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT } to denote the benchmark datasets.

Performance Matrix. The performance matrix records the test results of pre-trained models fine-tuned on benchmark datasets, denoted as Matrix(D,M)𝑀𝑎𝑡𝑟𝑖𝑥𝐷𝑀Matrix(D,M)italic_M italic_a italic_t italic_r italic_i italic_x ( italic_D , italic_M ) with the value Matrix(D,M)[i][j]𝑀𝑎𝑡𝑟𝑖𝑥𝐷𝑀delimited-[]𝑖delimited-[]𝑗Matrix(D,M)[i][j]italic_M italic_a italic_t italic_r italic_i italic_x ( italic_D , italic_M ) [ italic_i ] [ italic_j ] is the training performance of the pre-trained model mjsubscript𝑚𝑗m_{j}italic_m start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT on the benchmark dataset disubscript𝑑𝑖d_{i}italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, also denoted as p(di|mj)𝑝conditionalsubscript𝑑𝑖subscript𝑚𝑗p(d_{i}|m_{j})italic_p ( italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_m start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ). The training performance could be measured through different metrics for different tasks, like accuracy for classification tasks.

Model Cluster. A model cluster contains a group of models having similar training performances on benchmark datasets. Fig. 2(b) illustrates two model clusters, where C1subscript𝐶1C_{1}italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT contains three models mi,mj,mksubscript𝑚𝑖subscript𝑚𝑗subscript𝑚𝑘{m_{i},m_{j},m_{k}}italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_m start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_m start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT and C2subscript𝐶2C_{2}italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT contains two models mx,mysubscript𝑚𝑥subscript𝑚𝑦{m_{x},m_{y}}italic_m start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_m start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT respectively.

Convergence Trend. The convergence trend clusters datasets into different classes on which the model has a similar training process. We use CT(mj)t𝐶𝑇subscriptsubscript𝑚𝑗𝑡CT(m_{j})_{t}italic_C italic_T ( italic_m start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT to denote the class of convergence trend to which the dataset belongs at the t𝑡titalic_t validation for model mjsubscript𝑚𝑗m_{j}italic_m start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT. Fig. 2(b) illustrates two convergence trends for mjsubscript𝑚𝑗m_{j}italic_m start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT. The first convergence trend CT(mj)t[0]𝐶𝑇subscriptsubscript𝑚𝑗𝑡delimited-[]0CT(m_{j})_{t}[0]italic_C italic_T ( italic_m start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT [ 0 ] represents a convergence process which could achieve relative higher final training performance, and the second convergence trend CT(mj)t[1]𝐶𝑇subscriptsubscript𝑚𝑗𝑡delimited-[]1CT(m_{j})_{t}[1]italic_C italic_T ( italic_m start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT [ 1 ] achieves lower final training performance. Based on the mined convergence trend, the final performance after training could be predicted more accurate at early training steps.

Proxy Score. The proxy score is computed based on the proxy task which predicts the training performance p(di|mj)𝑝conditionalsubscript𝑑𝑖subscript𝑚𝑗p(d_{i}|m_{j})italic_p ( italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_m start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) without actually fine-tuning mjsubscript𝑚𝑗m_{j}italic_m start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT on disubscript𝑑𝑖d_{i}italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, denoted as proxy_score(di|mj)𝑝𝑟𝑜𝑥𝑦_𝑠𝑐𝑜𝑟𝑒conditionalsubscript𝑑𝑖subscript𝑚𝑗proxy\_score(d_{i}|m_{j})italic_p italic_r italic_o italic_x italic_y _ italic_s italic_c italic_o italic_r italic_e ( italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_m start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ). Several proxy tasks have been developed to predict p(di|mj)𝑝conditionalsubscript𝑑𝑖subscript𝑚𝑗p(d_{i}|m_{j})italic_p ( italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_m start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ), such as LEEP [11], KNN classification [12] etc. In this paper, we apply the LEEP score as the proxy_score𝑝𝑟𝑜𝑥𝑦_𝑠𝑐𝑜𝑟𝑒proxy\_scoreitalic_p italic_r italic_o italic_x italic_y _ italic_s italic_c italic_o italic_r italic_e, which is the average log-likelihood of the expected empirical predictor result of a source model on the target dataset. The advantage of the LEEP score lies in two aspects. Firstly, LEEP could be applied to heterogeneous target tasks that have different label spaces compared with the pre-trained models. Secondly, the computation of the LEEP score does not need extra training, which is more efficient compared with the KNN method.

Target Task. We use T𝑇Titalic_T to denote the target task and d(T)𝑑𝑇d(T)italic_d ( italic_T ) to denote the training dataset of T𝑇Titalic_T. The proxy score for model mjsubscript𝑚𝑗m_{j}italic_m start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT on the target dataset is denoted as proxy_score(d(T)|mj)𝑝𝑟𝑜𝑥𝑦_𝑠𝑐𝑜𝑟𝑒conditional𝑑𝑇subscript𝑚𝑗proxy\_score(d(T)|m_{j})italic_p italic_r italic_o italic_x italic_y _ italic_s italic_c italic_o italic_r italic_e ( italic_d ( italic_T ) | italic_m start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ), or proxy_score(T|mj)𝑝𝑟𝑜𝑥𝑦_𝑠𝑐𝑜𝑟𝑒conditional𝑇subscript𝑚𝑗proxy\_score(T|m_{j})italic_p italic_r italic_o italic_x italic_y _ italic_s italic_c italic_o italic_r italic_e ( italic_T | italic_m start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ).

II-B Framework Overview

Fig. 2 illustrates the whole process of the two-phase framework containing two computation parts:

Offline. The construction of the performance matrix constitutes the core offline process, wherein we fine-tune each model in M𝑀Mitalic_M on the benchmark datasets and record the validation and test results throughout the training process. Subsequently, we execute model clustering based on this matrix, resulting in MC={C1,C2,,Cp}𝑀𝐶subscript𝐶1subscript𝐶2subscript𝐶𝑝MC=\{C_{1},C_{2},...,C_{p}\}italic_M italic_C = { italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_C start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT }. Although this offline computation is time consuming, this part could be only computed once and then the generated model clusters and intermediate results could be used in the online computation directly.

Online. This part implements the two-phase model selection computation for a new task T𝑇Titalic_T. The coarse-recall phase, denoted as CR=corase_recall(M|MC,d(T))𝐶𝑅𝑐𝑜𝑟𝑎𝑠𝑒_𝑟𝑒𝑐𝑎𝑙𝑙conditional𝑀𝑀𝐶𝑑𝑇CR=corase\_recall(M|MC,d(T))italic_C italic_R = italic_c italic_o italic_r italic_a italic_s italic_e _ italic_r italic_e italic_c italic_a italic_l italic_l ( italic_M | italic_M italic_C , italic_d ( italic_T ) ), returns K𝐾Kitalic_K candidate models from model sets M𝑀Mitalic_M, prone to achieve high training performance on d(T)𝑑𝑇d(T)italic_d ( italic_T ) by computing the proxy score for model clusters MC𝑀𝐶MCitalic_M italic_C on d(T)𝑑𝑇d(T)italic_d ( italic_T ). Then, the fine-selection phase, denoted as fine_selection(CR|CT,d(T))𝑓𝑖𝑛𝑒_𝑠𝑒𝑙𝑒𝑐𝑡𝑖𝑜𝑛conditional𝐶𝑅𝐶𝑇𝑑𝑇fine\_selection(CR|CT,d(T))italic_f italic_i italic_n italic_e _ italic_s italic_e italic_l italic_e italic_c italic_t italic_i italic_o italic_n ( italic_C italic_R | italic_C italic_T , italic_d ( italic_T ) ), returns the final selected model from the recalled models CR𝐶𝑅CRitalic_C italic_R with convergence trend CT𝐶𝑇CTitalic_C italic_T.

III Coarse Recall

The coarse-recall phase aims to efficiently identify a much smaller number of candidate models which tend to achieve good result on the target task. Fig. 2(c) illustrates the overall steps of coarse-recall phase. We firstly present the model clustering process, and then introduce proxy score computation for model clusters on the target dataset to return the recalled models.

III-A Model Clustering

The computation of proxy_score(T|mj)𝑝𝑟𝑜𝑥𝑦_𝑠𝑐𝑜𝑟𝑒conditional𝑇subscript𝑚𝑗proxy\_score(T|m_{j})italic_p italic_r italic_o italic_x italic_y _ italic_s italic_c italic_o italic_r italic_e ( italic_T | italic_m start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) needs to load the model into memory and do inference computation on the target dataset. The load and inference step may consume dozens of seconds for a pre-trained model with millions of parameters and a target dataset with hundreds of data items, and consequently is still time consuming. To accelerate the coarse-recall phase, a natural way is to group pre-trained models into clusters by measuring the similarity between pre-trained models, so that the proxy score only needs to be computed for the representative model in a cluster. It reduces the time complexity of coarse-recall phase from O(|M|)𝑂𝑀O(|M|)italic_O ( | italic_M | ) to O(|MC|)𝑂𝑀𝐶O(|MC|)italic_O ( | italic_M italic_C | ).

The models’ similarity measures how two pre-trained models tend to have similar training performance on a target dataset. The training performance could be related to various factors, such as training data domain and quality, model architecture, and parameter size, etc. As these factors are heterogeneous and could not always be available, it maybe unfeasible to combine these factors explicitly to compute the model similarity. In our work, we propose to measure the similarity between models through a data-driven way motivated by the phenomena that models having similar training performances on benchmark datasets also tend to have similar training performance on a new task.

Fig. 2 (a) and (b) illustrate the model clustering process. Based on the performance matrix Matrix(D,M)𝑀𝑎𝑡𝑟𝑖𝑥𝐷𝑀Matrix(D,M)italic_M italic_a italic_t italic_r italic_i italic_x ( italic_D , italic_M ), each model mjsubscript𝑚𝑗m_{j}italic_m start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT could be represented as a |D|𝐷\left|D\right|| italic_D |-dimensional vector vec(mj)=(p(d1|mj),p(d2|mj),,p(dm|mj))𝑣𝑒𝑐subscript𝑚𝑗𝑝conditionalsubscript𝑑1subscript𝑚𝑗𝑝conditionalsubscript𝑑2subscript𝑚𝑗𝑝conditionalsubscript𝑑𝑚subscript𝑚𝑗vec(m_{j})=(p(d_{1}|m_{j}),p(d_{2}|m_{j}),...,p(d_{m}|m_{j}))italic_v italic_e italic_c ( italic_m start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) = ( italic_p ( italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | italic_m start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) , italic_p ( italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | italic_m start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) , … , italic_p ( italic_d start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT | italic_m start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ). And model clustering could be conducted by any clustering algorithm based on the models’ distance dis(mj1,mj2)𝑑𝑖𝑠subscript𝑚𝑗1subscript𝑚𝑗2dis(m_{j1},m_{j2})italic_d italic_i italic_s ( italic_m start_POSTSUBSCRIPT italic_j 1 end_POSTSUBSCRIPT , italic_m start_POSTSUBSCRIPT italic_j 2 end_POSTSUBSCRIPT ) measured based on vec(mj1)𝑣𝑒𝑐subscript𝑚𝑗1vec(m_{j1})italic_v italic_e italic_c ( italic_m start_POSTSUBSCRIPT italic_j 1 end_POSTSUBSCRIPT ) and vec(mj2)𝑣𝑒𝑐subscript𝑚𝑗2vec(m_{j2})italic_v italic_e italic_c ( italic_m start_POSTSUBSCRIPT italic_j 2 end_POSTSUBSCRIPT ). As the benchmark datasets cover a group of representative tasks for a machine learning application, the training performances on such datasets could measure both the feature extraction capability and domain characteristics of a pre-trained model. Therefore, for a target task T𝑇Titalic_T, it is possible that there are benchmark tasks which share similar feature extraction or domain of the training dataset. So the models, which are similar measured by vec(mj1)𝑣𝑒𝑐subscript𝑚𝑗1vec(m_{j1})italic_v italic_e italic_c ( italic_m start_POSTSUBSCRIPT italic_j 1 end_POSTSUBSCRIPT ) and vec(mj2)𝑣𝑒𝑐subscript𝑚𝑗2vec(m_{j2})italic_v italic_e italic_c ( italic_m start_POSTSUBSCRIPT italic_j 2 end_POSTSUBSCRIPT ) on benchmark datasets, are also tend to have similar performance in the new task T𝑇Titalic_T. Here, we measure the model similarity through the average accuracy differences on k𝑘kitalic_k benchmark datasets where two models have maximum accuracy differences:

sim(mj1,mj2)=1avg(topk|vec[mj1]vec[mj2]|)𝑠𝑖𝑚subscript𝑚𝑗1subscript𝑚𝑗21𝑎𝑣𝑔𝑡𝑜subscript𝑝𝑘𝑣𝑒𝑐delimited-[]subscript𝑚𝑗1𝑣𝑒𝑐delimited-[]subscript𝑚𝑗2sim(m_{j1},m_{j2})=1-avg(top_{k}{|vec[m_{j1}]-vec[m_{j2}]|})italic_s italic_i italic_m ( italic_m start_POSTSUBSCRIPT italic_j 1 end_POSTSUBSCRIPT , italic_m start_POSTSUBSCRIPT italic_j 2 end_POSTSUBSCRIPT ) = 1 - italic_a italic_v italic_g ( italic_t italic_o italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | italic_v italic_e italic_c [ italic_m start_POSTSUBSCRIPT italic_j 1 end_POSTSUBSCRIPT ] - italic_v italic_e italic_c [ italic_m start_POSTSUBSCRIPT italic_j 2 end_POSTSUBSCRIPT ] | ) (1)

For the performance matrix Matrix(D,M)𝑀𝑎𝑡𝑟𝑖𝑥𝐷𝑀Matrix(D,M)italic_M italic_a italic_t italic_r italic_i italic_x ( italic_D , italic_M ), although there are mn𝑚𝑛m\cdot nitalic_m ⋅ italic_n elements which need mn𝑚𝑛m\cdot nitalic_m ⋅ italic_n times training, it is not necessary to train a pre-trained model on the whole benchmark dataset, since only top accuracy differences will be used to measure the model similarity. Actually, the training performance on a subset of training data with relative small size could be enough.

Based on the above model similarity measurement, we can adopt state-of-the-art clustering algorithms to group models in model repository, such K-means [13], hierarchical clustering [14], etc. After model clustering, for a cluster Cisubscript𝐶𝑖C_{i}italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, the model belongs to Cisubscript𝐶𝑖C_{i}italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and has the maximum average training performance on benchmark datasets is selected as the representative model, denoted as m(Ci)𝑚subscript𝐶𝑖m(C_{i})italic_m ( italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ). Then, the proxy score between the target dataset d(T)𝑑𝑇d(T)italic_d ( italic_T ) and model cluster Cisubscript𝐶𝑖C_{i}italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT could be calculated as proxy_score(T|m(Ci))𝑝𝑟𝑜𝑥𝑦_𝑠𝑐𝑜𝑟𝑒conditional𝑇𝑚subscript𝐶𝑖proxy\_score(T|m(C_{i}))italic_p italic_r italic_o italic_x italic_y _ italic_s italic_c italic_o italic_r italic_e ( italic_T | italic_m ( italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) which could avoid online computing the proxy score for all the models on d(T)𝑑𝑇d(T)italic_d ( italic_T ) in the coarse-recall phase.

III-B Model Recall

Based on the training performance matrix and model clustering result, a recall_score𝑟𝑒𝑐𝑎𝑙𝑙_𝑠𝑐𝑜𝑟𝑒recall\_scoreitalic_r italic_e italic_c italic_a italic_l italic_l _ italic_s italic_c italic_o italic_r italic_e is computed as Eq. 2, where acc(mj)𝑎𝑐𝑐subscript𝑚𝑗acc(m_{j})italic_a italic_c italic_c ( italic_m start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) denotes the average accuracy of mjsubscript𝑚𝑗m_{j}italic_m start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT on benchmark datasets D𝐷Ditalic_D. We can see the recall_score(T|mj)𝑟𝑒𝑐𝑎𝑙𝑙_𝑠𝑐𝑜𝑟𝑒conditional𝑇subscript𝑚𝑗recall\_score(T|m_{j})italic_r italic_e italic_c italic_a italic_l italic_l _ italic_s italic_c italic_o italic_r italic_e ( italic_T | italic_m start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) contains two parts. The acc(mj)𝑎𝑐𝑐subscript𝑚𝑗acc(m_{j})italic_a italic_c italic_c ( italic_m start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) counts the prior capacity for a model mjsubscript𝑚𝑗m_{j}italic_m start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT to any new task, and the proxy_score(T|mj)𝑝𝑟𝑜𝑥𝑦_𝑠𝑐𝑜𝑟𝑒conditional𝑇subscript𝑚𝑗proxy\_score(T|m_{j})italic_p italic_r italic_o italic_x italic_y _ italic_s italic_c italic_o italic_r italic_e ( italic_T | italic_m start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) represents how the model mjsubscript𝑚𝑗m_{j}italic_m start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT matches the specific task T𝑇Titalic_T. In this paper, we adopt LEEP [11] to compute the proxy_score𝑝𝑟𝑜𝑥𝑦_𝑠𝑐𝑜𝑟𝑒proxy\_scoreitalic_p italic_r italic_o italic_x italic_y _ italic_s italic_c italic_o italic_r italic_e and normalize score between [0, 1]. Combining these two scores, we can sort the models in descendent order, and reserve top K𝐾Kitalic_K models as the result of the coarse-recall phase.

recall_score(T|mj)=acc(mj)proxy_score(T|mj)𝑟𝑒𝑐𝑎𝑙𝑙_𝑠𝑐𝑜𝑟𝑒conditional𝑇subscript𝑚𝑗𝑎𝑐𝑐subscript𝑚𝑗𝑝𝑟𝑜𝑥𝑦_𝑠𝑐𝑜𝑟𝑒conditional𝑇subscript𝑚𝑗recall\_score(T|m_{j})=acc(m_{j})\cdot proxy\_score(T|m_{j})italic_r italic_e italic_c italic_a italic_l italic_l _ italic_s italic_c italic_o italic_r italic_e ( italic_T | italic_m start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) = italic_a italic_c italic_c ( italic_m start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ⋅ italic_p italic_r italic_o italic_x italic_y _ italic_s italic_c italic_o italic_r italic_e ( italic_T | italic_m start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) (2)

As introduced in previous section, to speed up the computation of coarse-recall phase, we only compute the proxy_score𝑝𝑟𝑜𝑥𝑦_𝑠𝑐𝑜𝑟𝑒proxy\_scoreitalic_p italic_r italic_o italic_x italic_y _ italic_s italic_c italic_o italic_r italic_e between the target dataset and the representative model of a cluster, therefore, proxy_score(T|mi)𝑝𝑟𝑜𝑥𝑦_𝑠𝑐𝑜𝑟𝑒conditional𝑇subscript𝑚𝑖proxy\_score(T|m_{i})italic_p italic_r italic_o italic_x italic_y _ italic_s italic_c italic_o italic_r italic_e ( italic_T | italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) could be rewritten as proxy_score(T|m(c(mj))proxy\_score(T|m(c(m_{j}))italic_p italic_r italic_o italic_x italic_y _ italic_s italic_c italic_o italic_r italic_e ( italic_T | italic_m ( italic_c ( italic_m start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ) where c(mj)𝑐subscript𝑚𝑗c(m_{j})italic_c ( italic_m start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) denotes the cluster of model mjsubscript𝑚𝑗m_{j}italic_m start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT belonging to. Meanwhile, as there may be a number of singleton model clusters (|Ci|=1𝐶𝑖1|Ci|=1| italic_C italic_i | = 1) after model clustering, the proxy_score𝑝𝑟𝑜𝑥𝑦_𝑠𝑐𝑜𝑟𝑒proxy\_scoreitalic_p italic_r italic_o italic_x italic_y _ italic_s italic_c italic_o italic_r italic_e is only computed between the target target T𝑇Titalic_T and non-singleton clusters for efficiency consideration. Therefore, for models in non-singleton clusters, the recall_score𝑟𝑒𝑐𝑎𝑙𝑙_𝑠𝑐𝑜𝑟𝑒recall\_scoreitalic_r italic_e italic_c italic_a italic_l italic_l _ italic_s italic_c italic_o italic_r italic_e is computed as:

recall_score(T|mj)=acc(mj)proxy_score(T|m(c(mj)))for|c(mj)|>1\begin{split}\begin{aligned} recall&\_score(T|m_{j})=acc(m_{j})\cdot\\ &proxy\_score(T|m(c(m_{j})))\ for\ |c(m_{j})|>1\end{aligned}\end{split}start_ROW start_CELL start_ROW start_CELL italic_r italic_e italic_c italic_a italic_l italic_l end_CELL start_CELL _ italic_s italic_c italic_o italic_r italic_e ( italic_T | italic_m start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) = italic_a italic_c italic_c ( italic_m start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ⋅ end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL italic_p italic_r italic_o italic_x italic_y _ italic_s italic_c italic_o italic_r italic_e ( italic_T | italic_m ( italic_c ( italic_m start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ) ) italic_f italic_o italic_r | italic_c ( italic_m start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) | > 1 end_CELL end_ROW end_CELL end_ROW (3)

As we do not compute the proxy_score𝑝𝑟𝑜𝑥𝑦_𝑠𝑐𝑜𝑟𝑒proxy\_scoreitalic_p italic_r italic_o italic_x italic_y _ italic_s italic_c italic_o italic_r italic_e directly for singleton model clusters, the recall_score𝑟𝑒𝑐𝑎𝑙𝑙_𝑠𝑐𝑜𝑟𝑒recall\_scoreitalic_r italic_e italic_c italic_a italic_l italic_l _ italic_s italic_c italic_o italic_r italic_e for models in singleton clusters is computed as Eq. 4 where the proxy_score𝑝𝑟𝑜𝑥𝑦_𝑠𝑐𝑜𝑟𝑒proxy\_scoreitalic_p italic_r italic_o italic_x italic_y _ italic_s italic_c italic_o italic_r italic_e is propagated from the representative models of non-singleton clusters (denoted as Cnonsubscript𝐶𝑛𝑜𝑛C_{non}italic_C start_POSTSUBSCRIPT italic_n italic_o italic_n end_POSTSUBSCRIPT) and decayed by the model similarity sim(mj,m(Ck))𝑠𝑖𝑚subscript𝑚𝑗𝑚subscript𝐶𝑘sim(m_{j},m(C_{k}))italic_s italic_i italic_m ( italic_m start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_m ( italic_C start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ):

recall_score(T|mi)=acc(mi)1|Cnon|k=1|Cnon|(sim(mj,m(Ck))proxy_score(T|m(Ck)))\begin{split}\begin{aligned} &recall\_score(T|m_{i})=acc(m_{i})\cdot\frac{1}{|% C_{non}|}\cdot\\ &\sum_{k=1}^{|C_{non}|}(sim(m_{j},m(C_{k}))\cdot proxy\_score(T|m(C_{k})))\end% {aligned}\end{split}start_ROW start_CELL start_ROW start_CELL end_CELL start_CELL italic_r italic_e italic_c italic_a italic_l italic_l _ italic_s italic_c italic_o italic_r italic_e ( italic_T | italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = italic_a italic_c italic_c ( italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ⋅ divide start_ARG 1 end_ARG start_ARG | italic_C start_POSTSUBSCRIPT italic_n italic_o italic_n end_POSTSUBSCRIPT | end_ARG ⋅ end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | italic_C start_POSTSUBSCRIPT italic_n italic_o italic_n end_POSTSUBSCRIPT | end_POSTSUPERSCRIPT ( italic_s italic_i italic_m ( italic_m start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_m ( italic_C start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ) ⋅ italic_p italic_r italic_o italic_x italic_y _ italic_s italic_c italic_o italic_r italic_e ( italic_T | italic_m ( italic_C start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ) ) end_CELL end_ROW end_CELL end_ROW (4)

Combining Eq. 3 and Eq. 4, we can compute the recall_score𝑟𝑒𝑐𝑎𝑙𝑙_𝑠𝑐𝑜𝑟𝑒recall\_scoreitalic_r italic_e italic_c italic_a italic_l italic_l _ italic_s italic_c italic_o italic_r italic_e for all the models in M𝑀Mitalic_M and return the top K𝐾Kitalic_K models to the fine-selection phase.

IV Fine-Selection

After the initial coarse recall phase, we reduced the number of candidate pre-trained models from O|M|𝑂𝑀O|M|italic_O | italic_M | to O|MC|𝑂𝑀𝐶O|MC|italic_O | italic_M italic_C |. Next, based on successive halving, we utilize the convergence information from fine-tuning the models on benchmark datasets to select a good model more quickly and accurately.

IV-A Early Stopping

Fine-tuning models is a time-consuming process. Key to improve efficiency is the ability to filter out poorly-performing models at an earlier training steps. Therefore, we’re interested in understanding whether there is a strong and prevalent correlation between the initial validation performance and the final test performance during the fine-tuning process of pre-trained models. This correlation could potentially allow us to filter out more models at an earlier stage of training. For each target dataset, we plot the validation performance changes during the fine-tuning process for models that pass the initial screening. Fig. 3 illustrates the performance changes of 10 models screened for the MNLI dataset. It can be observed that the models performing well on the test set also exhibit better validation performance in the early stages of training, and it seems only two models out of the total ten models achieve much higher performance at the first training epoch. Therefore, we do not need to fine-tune all models to the point of convergence. Instead, we can filter out under-performing models at an earlier stage, leading to a more efficient model selection process. Fig. 8 in Appendix.A [15] provides model’s performances under another set of hyperparameters, showing the sensitivity of the training process to hyperparameters and the robustness of our method.

Refer to caption


Figure 3: Top-10 models validation and test results on MNLI dataset. Model names ignore the repository name they belong to, the full names can be found in Table VIII in Appendix. B [15].

IV-B Successive Halving

Based on the observation that models which perform well in early training iterations are likely to maintain superior performance when trained to full convergence, successive halving is the state-of-the-art method applied to speed up the model selection process [3][4]. The successive halving algorithm operates iteratively in stages, and halves the pool of considered models at each stage. Specifically, each model is trained for a fixed number of iterations at each stage and then its performance is validated. Models with poor performance are discarded, while the better-performing ones proceed to the next round of more intensive training. This process continues until the top models are identified. Assuming that the initial model number is |M|𝑀|M|| italic_M | and the training step during each halving is s𝑠sitalic_s batches, the total budget for identifying the highest performance model would typically be approximately |M|slog2(|M|)𝑀𝑠𝑙𝑜subscript𝑔2𝑀|M|\cdot s\cdot log_{2}(|M|)| italic_M | ⋅ italic_s ⋅ italic_l italic_o italic_g start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( | italic_M | ) steps.

IV-C Fine-Selection Algorithm

While successive halving ensures that computational resources are largely devoted to the most promising models, the practice of only filtering out half of the models in each round limits the further improvement of the selection efficiency. Therefore, we propose our refinement method called ’fine-selection’, which, built on successive halving, further leverages the fine-tuning information of the model on benchmark datasets. After observing the fine-tuning performance of the model on various datasets, we found that the performance changes of the same model across different datasets can be categorized into distinct clusters. As illustrated in Fig. 4, the fine-tuning performance of the BERT_base model on some benchmark datasets can be divided into four groups. Similar phenomena were observed across different models. Therefore, we propose to mining the convergence trend of models from the fine-tuning performance on benchmark datasets to predict the model’s final performance on the target dataset.

For a target dataset d(T)𝑑𝑇d(T)italic_d ( italic_T ) and a given pre-trained model mjsubscript𝑚𝑗m_{j}italic_m start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, after training mjsubscript𝑚𝑗m_{j}italic_m start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT on d(T)𝑑𝑇d(T)italic_d ( italic_T ) for every s𝑠sitalic_s steps, we can compute the validation accuracy val(T|mj)t𝑣𝑎𝑙subscriptconditional𝑇subscript𝑚𝑗𝑡val(T|m_{j})_{t}italic_v italic_a italic_l ( italic_T | italic_m start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT at stage t𝑡titalic_t and predict the final training performance as follows. Firstly, we generate a group of convergence trends for mjsubscript𝑚𝑗m_{j}italic_m start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, denoted as {CT(mj)t}𝐶𝑇subscriptsubscript𝑚𝑗𝑡\{CT(m_{j})_{t}\}{ italic_C italic_T ( italic_m start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT }. Specifically, we cluster the benchmark datasets into c𝑐citalic_c clusters based on the validate accuracy of m𝑚mitalic_m on these datasets. Then, a convergence trend CT(mj)t[x]=(valx¯,testx¯)𝐶𝑇subscriptsubscript𝑚𝑗𝑡delimited-[]𝑥¯𝑣𝑎subscript𝑙𝑥¯𝑡𝑒𝑠subscript𝑡𝑥CT(m_{j})_{t}[x]=(\overline{val_{x}},\overline{test_{x}})italic_C italic_T ( italic_m start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT [ italic_x ] = ( over¯ start_ARG italic_v italic_a italic_l start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_ARG , over¯ start_ARG italic_t italic_e italic_s italic_t start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_ARG ) is computed as valx¯¯𝑣𝑎subscript𝑙𝑥\overline{val_{x}}over¯ start_ARG italic_v italic_a italic_l start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_ARG and testx¯¯𝑡𝑒𝑠subscript𝑡𝑥\overline{test_{x}}over¯ start_ARG italic_t italic_e italic_s italic_t start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_ARG are the average validate and test accuracy of mjsubscript𝑚𝑗m_{j}italic_m start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT on the datasets in cluster x𝑥xitalic_x respectively. Then, based on the generated convergence trends CT(mj)t𝐶𝑇subscriptsubscript𝑚𝑗𝑡CT(m_{j})_{t}italic_C italic_T ( italic_m start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, we can assign the best matched convergence trend for mjsubscript𝑚𝑗m_{j}italic_m start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT trained on d(T)𝑑𝑇d(T)italic_d ( italic_T ) after s𝑠sitalic_s steps as Eq. 5. The final training performance could be predicted as the test accuracy of the matched convergence trend as in Eq. 6.

matched(val(T|mj)t)=argmin𝑥(|valx¯val(T|mj)t|)\begin{split}\begin{aligned} matched(val(T|m_{j})_{t})=\underset{x}{\arg\min}(% {|\overline{val_{x}}-val(T|m_{j})_{t}|})\end{aligned}\end{split}start_ROW start_CELL start_ROW start_CELL italic_m italic_a italic_t italic_c italic_h italic_e italic_d ( italic_v italic_a italic_l ( italic_T | italic_m start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = underitalic_x start_ARG roman_arg roman_min end_ARG ( | over¯ start_ARG italic_v italic_a italic_l start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_ARG - italic_v italic_a italic_l ( italic_T | italic_m start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | ) end_CELL end_ROW end_CELL end_ROW (5)
pred(T|mj)t=CT(mj)t[matched(val(T|mj)t)]𝑝𝑟𝑒𝑑subscriptconditional𝑇subscript𝑚𝑗𝑡𝐶𝑇subscriptsubscript𝑚𝑗𝑡delimited-[]𝑚𝑎𝑡𝑐𝑒𝑑𝑣𝑎𝑙subscriptconditional𝑇subscript𝑚𝑗𝑡\begin{split}\begin{aligned} pred(T|m_{j})_{t}=CT(m_{j})_{t}[matched(val(T|m_{% j})_{t})]\end{aligned}\end{split}start_ROW start_CELL start_ROW start_CELL italic_p italic_r italic_e italic_d ( italic_T | italic_m start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_C italic_T ( italic_m start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT [ italic_m italic_a italic_t italic_c italic_h italic_e italic_d ( italic_v italic_a italic_l ( italic_T | italic_m start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ] end_CELL end_ROW end_CELL end_ROW (6)

Refer to caption


Figure 4: Validation&Test performance of the DoyyingFace/bert-asian-hate-tweets-asian-unclean-freeze-4 model on 30 datasets.

Based on convergence trend mining, Algorithm 1 describes the proposed fine-selection algorithm. Like successive halving, when each remaining model has been fine-tuned for s𝑠sitalic_s steps, if the number of remaining models is not 1, fine-selection is performed. The specific process is as follows:

  • Obtaining: Obtain the model’s every stage validation and final test results on all benchmark datasets.

  • Convergence Trend Mining and Match: Perform clustering on validation results of current stage, and get the matched convergence trend by Eq. 5.

  • Predict: Use the mean final test performance of the matched convergence trend as the predicted result by Eq. 6.

  • Fine-Filter: Among the remaining models, starting from the model with the worst validation performance, if there exists a model with better validation performance and whose predicted performance is also better by a certain threshold, we remove this model.

  • Halving: If the number of remaining models is more than half of the number of models at the beginning of the selection, we directly eliminate the model with the worst validation result until the number is reduced by half.

In summary, our fine-selection method ultimately yields a single model fully trained on the target dataset. It ensures that at least half of the models are filtered out at each step, thereby resulting in a selection efficiency significantly higher than that of successive halving.

0:  Recalled models M0subscript𝑀0M_{0}italic_M start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT;
Validation and Test result of M𝑀Mitalic_M on D𝐷Ditalic_D, val𝑣𝑎𝑙valitalic_v italic_a italic_l and test𝑡𝑒𝑠𝑡testitalic_t italic_e italic_s italic_t;
Total training steps T𝑇Titalic_T, validation interval s𝑠sitalic_s
0:  Final Trained Model
1:  for t=0𝑡0t=0italic_t = 0 to T/s1𝑇𝑠1\lfloor T/s\rfloor-1⌊ italic_T / italic_s ⌋ - 1 do
2:     Msuperscript𝑀absentM^{\prime}\leftarrowitalic_M start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ← Train each model in Mtsubscript𝑀𝑡M_{t}italic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT
3:     if |M|1superscript𝑀1|M^{\prime}|\neq 1| italic_M start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | ≠ 1 then
4:        vjsubscript𝑣𝑗absentv_{j}\leftarrowitalic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ← Validate each model
5:        xjsubscript𝑥𝑗absentx_{j}\leftarrowitalic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ← Match convergence trends by Eq. 5
6:        predj𝑝𝑟𝑒subscript𝑑𝑗absentpred_{j}\leftarrowitalic_p italic_r italic_e italic_d start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ← Predict final performance by Eq. 6
7:        Msuperscript𝑀absentM^{\prime}\leftarrowitalic_M start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ← Remove models with lower v𝑣vitalic_v and pred𝑝𝑟𝑒𝑑preditalic_p italic_r italic_e italic_d,
pred𝑝𝑟𝑒𝑑preditalic_p italic_r italic_e italic_d’s difference should larger than the threshold
8:        while |M|>|Mt|/2superscript𝑀subscript𝑀𝑡2|M^{\prime}|>\lfloor|M_{t}|/2\rfloor| italic_M start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | > ⌊ | italic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | / 2 ⌋ do
9:           Msuperscript𝑀absentM^{\prime}\leftarrowitalic_M start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ← Remove models in Msuperscript𝑀M^{\prime}italic_M start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT with lowest v𝑣vitalic_v
10:        end while
11:     end if
12:     Mt+1Msubscript𝑀𝑡1superscript𝑀M_{t+1}\leftarrow M^{\prime}italic_M start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ← italic_M start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT
13:  end for
14:  return  MT/s[0]subscript𝑀𝑇𝑠delimited-[]0M_{T/s}[0]italic_M start_POSTSUBSCRIPT italic_T / italic_s end_POSTSUBSCRIPT [ 0 ]
Algorithm 1 Fine-Selection

V Experiments

In this section, we study the model selection experiment results on natural language process and computer vision tasks, and demonstrate the effectiveness and efficiency of the proposed methods.

V-A Experiment Setup

Pre-trained Models. The models are collected from HuggingFace [1]. For natural language tasks, we select 40 models as model repositories, which both contain the state-of-the-art pre-trained models (such as BERT[9], Roberta [16], etc) and the models fine-tuned on some downstream tasks. For computer vision tasks, we select 30 models as repositories. The structure of the models includes ViT [17], BEiT[18], DEiT [19], poolformer [20], dinat [21], and Visual Attention Network [22]. Similar to NLP models, these models contain state-of-the-art models and their fine-tuned versions. A complete list of models are given in Appendix. B [15].

Datasets. The experiments are conducted on NLP and CV datasets which are divided into benchmark datasets and target datasets. The benchmark datasets are used to construct the performance matrix for model clustering. The target datasets are used to evaluate the proposed two-phase model selection method. The datasets outputs are in the form of classification and are totally different in two parts, which means they are variant in the sub-tasks, label numbers, and label distributions. It is worth noting that our datasets contain both common datasets like GLUE[5] and cifar10 [23] as well as domain-specific datasets such as finance and medical science. These datasets are available on HuggingFace and have been split into training and testing sets by their contributors. The datasets used for performance matrix construction and method evaluation are given below:

  • Benchmark Datasets: The benchmark datasets for NLP are mainly from GLUE[5] and SuperGLUE [24]. They are COLA, MRPC, QNLI, QQP, RTE, SST2, STSB and WNLI in GLUE, and CB, COPA, WIC in SuperGLUE. And we use some domain-specific tasks popular in HuggingFace, which are: imdb [25], yelp_review_full [26], yahoo_answer_topic, dbpedia_14 [26], xnli [27], anli [28], app_reviews [29], trec [30] [31], sick [32] and financial_phrasebank [33]. For CV benchmark, we use datasets of image classification in different domains: food101 [34], CC6204-Hackaton-Cub-Dataset [6], cats_vs_dogs [35], cifar10 [23] and MNIST [36].

  • Target Datasets: For NLP evaluation, four tasks are selected: tweet_eval [37] collected from Tweeter reviews, MNLI [38] from the GLUE benchmark, MultiRC[39], and Boolq [40] from superGLUE benchmark. For CV evaluation, four image classification datasets with different domains are selected: chest-xray-classification [41], MedMNIST [42], oxford-flowers [43], and beans [44].

We give further description of datasets in Appendix. C[15].

Performance Matrix. We build a performance matrix by fine-tuning all the pre-trained models on the benchmark datasets, which contains 40×24402440\times 2440 × 24 trains for natural language processing and 30×10301030\times 1030 × 10 trains for computer vision respectively. Each training is conducted with 5 epochs and 4 epochs for natural language processing and computer vision respectively, which is enough to compare the relative accuracy between different trains and could support convergence trend mining for the fine-selection phase.

TABLE I: Clustering Methods Comparison
Model Similarity Hierarchical clustering K-means
NLP CV NLP CV
performance-based 0.505 0.806 0.466 0.702
text-based 0.476 0.696 0.453 0.732

V-B Experiment for Coarse-Recall Phase

For coarse-recall phase, we first study the experiment results for model clustering and then study the results for model recall.

Model Clustering. We study the model clustering results by comparing different model similarity measurements and clustering algorithms, and the clustering results are measured in terms of silhouette coefficient [45]. For model similarity measurement comparison, the performance-based similarity is calculated by Eq. 1 and we conduct an experiment in Appendix. D [15] to determine the parameter k𝑘kitalic_k. The text-based similarity is calculated from the text of the corresponding model card (Appendix E. [15] shows an example of a model card). We adopt SBERT[46] to encode the text into an vector so that cosine similarity could be computed. For the clustering algorithms, we compare two state-of-the-art clustering algorithms, K-means and hierarchical clustering.

Table I shows the comparison results. We can see the performance-based similarity achieves higher silhouette coefficient compared to text-based similarity, demonstrating the former could generate a better clustering structure. For clustering algorithm comparison, the hierarchical clustering outperforms the K-means clustering in a clear gap based on performance-based similarity, showing that clusters generated by hierarchical clustering have smaller similarity between and more connection within. To take a step further, it is worth noting that clusters generated by hierarchical clustering are more reasonable than those generated by K-means clustering as discussed below. Therefore, we will conduct the following experiments based on the results of hierarchical clustering.

TABLE II: Model Clustering Results
Model Clusters of Natural Language Processing
Cluster Size Pre-trained Models
C1subscript𝐶1C_{1}italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT 5 Jeevesh8/bert_ft_qqp-68, Jeevesh8/bert_ft_qqp-9, Jeevesh8/bert_ft_qqp-40, connectivity/bert_ft_qqp-1, connectivity/bert_ft_qqp-7
C2subscript𝐶2C_{2}italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT 7 Jeevesh8/512seq_len_6ep_bert_ft_cola-91, anirudh21/bert-base-uncased-finetuned-qnli, Jeevesh8/bert_ft_cola-88, manueltonneau/bert-twitter-en-is-hired, bert-base-uncased, aditeyabaral/finetuned-sail2017-xlm-roberta-base, DoyyingFace/bert-asian-hate-tweets-asian-unclean-freeze-4
C3subscript𝐶3C_{3}italic_C start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT 5 Jeevesh8/feather_berts_46, ishan/bert-base-uncased-mnli roberta-base, Alireza1044/albert-base-v2-qnli, albert-base-v2
C4subscript𝐶4C_{4}italic_C start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT 2 CAMeL-Lab/bert-base-arabic-camelbert-mix-did-nadi, aliosm/sha3bor-metre-detector-arabertv2-base
C5subscript𝐶5C_{5}italic_C start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT 2 Splend1dchan/bert-base-uncased-slue-goldtrascription-e3-lr1e-4, aychang/bert-base-cased-trec-coarse
C6subscript𝐶6C_{6}italic_C start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT 3 aviator-neural–bert-base-uncased-sst2, distilbert-base-uncased, 18811449050–bert_finetuning_test
C7subscript𝐶7C_{7}italic_C start_POSTSUBSCRIPT 7 end_POSTSUBSCRIPT 4 Jeevesh8/init_bert_ft_qqp-33, Jeevesh8/init_bert_ft_qqp-24, connectivity/bert_ft_qqp-17, connectivity/bert_ft_qqp-96
C8subscript𝐶8C_{8}italic_C start_POSTSUBSCRIPT 8 end_POSTSUBSCRIPT 2 XSY/albert-base-v2-imdb-calssification, emrecan/bert-base-multilingual-cased-snli_tr
Model Clusters of Computer Vision
Cluster Size Pre-trained Models
C1subscript𝐶1C_{1}italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT 6 facebook/deit-base-patch16-224, facebook/deit-base-patch16-384, facebook/dino-vits16, facebook/vit-msn-base, facebook/vit-msn-small, Visual-Attention-Network/van-large
C2subscript𝐶2C_{2}italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT 2 facebook/deit-small-patch16-224, Visual-Attention-Network/van-base
C3subscript𝐶3C_{3}italic_C start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT 11 facebook/dino-vitb16, facebook/dino-vitb8, google/vit-base-patch16-224, google/vit-base-patch16-384, lixiqi/beit-base-patch16-224-pt22k-ft22k-finetuned-FER2013-6e-05, lixiqi/beit-base-patch16-224-pt22k-ft22k-finetuned-FER2013-7e-05, lixiqi/beit-base-patch16-224-pt22k-ft22k-finetuned-FER-5e-05-3, microsoft/beit-base-patch16-224, microsoft/beit-base-patch16-224-pt22k-ft22k, microsoft/beit-base-patch16-384, nateraw/vit-age-classifier
C4subscript𝐶4C_{4}italic_C start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT 2 shi-labs/dinat-large-in22k-in1k-224, shi-labs/dinat-large-in22k-in1k-384
C5subscript𝐶5C_{5}italic_C start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT 2 sail/poolformer-m36, sail/poolformer-m48
C6subscript𝐶6C_{6}italic_C start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT 2 shi-labs/dinat-base-in1k-224, microsoft/beit-large-patch16-224-pt22k

Table II illustrates our clustering results of non-singleton clusters for natural language processing and computer vision tasks. For natural language processing models, there are 8 non-singleton clusters which contain 30 models out of a total of 40 models. For computer vision tasks, there are 6 non-singleton models that contain almost all the models. We study the details for some clusters. For natural language process models, as the introduction information is not available for many models, we infer the model training process from the model name. For C1subscript𝐶1C_{1}italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and C2subscript𝐶2C_{2}italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT clusters, we can see they mainly contain pre-trained models fine-tuned on qqp [47] and cola [48] datasets respectively, demonstrating the models fine-tuned on the same downstream tasks could be grouped together by model clustering. On the other hand, we can see there are also models with names containing qqp that are not clustered into C1subscript𝐶1C_{1}italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, such as C7subscript𝐶7C_{7}italic_C start_POSTSUBSCRIPT 7 end_POSTSUBSCRIPT, demonstrating the performance of models with similar model names may also vary, which may be caused by different training setups. For C3subscript𝐶3C_{3}italic_C start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT cluster, we find that it groups models with names containing mnli and feather_berts together, and it is reasonable since we can find from [49] that the feather_berts models are also BERT models fine-tuned on the MNLI datasets. This also demonstrates the effectiveness of the model clustering. The results of computer vision clusters exhibit a similar phenomenon. The C1subscript𝐶1C_{1}italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT cluster mainly contains base-size deit models [19], small-size vit models using dino [50], and vit models using msn [51]. Looking further into the models, we discover that these three kinds of models are all pre-trained or fine-tuned on dataset Imagenet-1k [52]. The C3subscript𝐶3C_{3}italic_C start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT cluster mainly contains base-size vit models using dino, vit base models, and beit base models. Similar to C1subscript𝐶1C_{1}italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT cluster, these models, except dino-vit, are all pre-trained or fine-tuned with dataset Imagenet-21k [53]. On the other hand, we can also see models that are not grouped in one cluster even though they share similar names or training datasets, like dino vit models in C1subscript𝐶1C_{1}italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and C3subscript𝐶3C_{3}italic_C start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT, which also demonstrates models with similar model names may have different performance. We put the result of the K-means clustering method in Appendix. F [15]. Generally speaking, some clusters contain models of different structures and/or training datasets.

We also study the performance of models in non-singleton clusters in Table III. We can see the average accuracy of models in non-singleton clusters is significantly higher than that of models in singleton clusters for both natural language and computer vision tasks. Meanwhile, we also compute the count of models that achieve maximum accuracy for a benchmark dataset, and it exhibits a similar result that models in non-singleton clusters almost contribute to all the best models for benchmark datasets. The result of Table III demonstrates that models in the non-singleton clusters tend to achieve higher training performance. This could be explained by the fact that the high-quality model may achieve similar performance bounded by the state-of-the-art model, while on the other hand, the performance of poorly-performed models may differ a lot on different datasets. This phenomenon supports the proxy_score𝑝𝑟𝑜𝑥𝑦_𝑠𝑐𝑜𝑟𝑒proxy\_scoreitalic_p italic_r italic_o italic_x italic_y _ italic_s italic_c italic_o italic_r italic_e computation strategy in the coarse-recall phase that we only compute the proxy_score𝑝𝑟𝑜𝑥𝑦_𝑠𝑐𝑜𝑟𝑒proxy\_scoreitalic_p italic_r italic_o italic_x italic_y _ italic_s italic_c italic_o italic_r italic_e between the target dataset and the representative models of non-singleton clusters, and propagate the score to models in singleton clusters.

TABLE III: Performance Comparison of models in Singleton and Non-Singleton Clusters
Task Type Cluster Type Avg(Acc) No. Maximum(Acc)
NLP Non-Singleton 0.67 22
Singleton 0.61 2
CV Non-Singleton 0.84 10
Singleton 0.73 0

Refer to caption


Figure 5: The average accuracy comparison of recalled models

Refer to caption


Figure 6: Clustering Performance based on the first validation results. The blue color represents the comparison between random clustering and clustering based on validation performance, with silhouette score as the selected metric. The higher the silhouette score, the better the clustering effect. The red color represents the comparison between the prediction of test performance using the clustering method and the mean of all historical test performances. The metric used here is the absolute difference between the predicted and actual test performance divided by the actual test performance. We calculate the average with each dataset as the target dataset; the smaller the value, the more accurate the prediction performance. Full model name could be found in Table VIII in Appendix.B[15].

Model Recall. To evaluate the effectiveness of coarse-recall phase, we fine-tune all the models on corresponding target datasets to get the actual training performance. Then, we compare the average training accuracy on the target datasets of top K𝐾Kitalic_K recalled models. Fig. 5 compares the average accuracy between coarse-recall and random-recall. Here, random-recall represents randomly return K𝐾Kitalic_K models from model repository. We can see coarse-recall achieves higher accuracy compared with random-recall on all the eight target datasets. Meanwhile, for coarse-recall, the average accuracy of smaller K𝐾Kitalic_K values is higher than that of bigger K𝐾Kitalic_K values, demonstrating the top models recalled by coarse-recall tend to achieve higher training performance compared with models with lower recall score. Meanwhile, we also find that the top 5 models recalled by coarse_recall has contained the model achieving maximum training performance; and for tweet, beans and MedMNIST datasets, the number of recalled models to contain the best model is 10, 10, 15 respectively. In the following experiments, we empirically set the number of recalled models as 10, which accounts for about 25% and 30% of corresponding total models for natural language processing and computer vision tasks respectively.

V-C Experiments for Fine-Selection Phase

Convergence Trend. In this section, we demonstrate the effectiveness of mined convergence trend based on a model’s validation performance on benchmark datasets. We apply clustering methods during the fine-selection phase, and usually, a single filtering is sufficient to select the final model. Therefore, we only consider the performance of clustering based on the first validation results. Firstly, we directly measure the effect of clustering through the silhouette coefficient [45], where a higher value indicates a better clustering outcome. As shown in the blue part of Fig. 6, across all models, clustering based on validation significantly outperforms random clustering, demonstrating the effectiveness of validation results for clustering. In addition, we consider each benchmark dataset as the target dataset, assess the feasibility of predicting the final test performance using the mean test performance of models within the same cluster as the current model belongs to. We compare this with predicting the test performance based on the mean of all benchmark dataset test performances. The final metric is the absolute difference between the predicted and actual test performance divided by the actual test performance, averaged across all datasets. The red part of Fig. 6 demonstrates that our method predicts final test performance more accurately. This not only validates the feasibility of using the mean test performance within the cluster as a prediction but also demonstrates the effectiveness of clustering. In summary, it is feasible to select models by mining convergence trend based on validation performance and further predicting the model’s final test performance.

Filtering Threshold Given the validation fluctuations during the model training process and the potential significant discrepancies between benchmark and target datasets, utilizing convergence trends to filter models might eliminate models with good performance. Therefore, we propose introducing a threshold, stipulating that a model is only filtered out when there is another model with better validation and a predicted performance improvement exceeding this threshold. The threshold is a proportion of the difference between the predicted performances. Table IV illustrates the performance of fine-tuning methods under various threshold conditions. It can be observed that the threshold ensures better-performing models are filtered out later, albeit at the expense of efficiency. In order to more intuitively represent the efficiency of our method and the performance of the selected models, we uniformly use a 0% threshold in subsequent experiments.

TABLE IV: Accuracy and Time comparison among different filtering threshold settings in fine-selection. 0% is the original setting.
Models 0% 1% 5% 10%
MNLI Accuracy 0.85 0.85 0.85 0.85
RunTime 14 14 15 16
\hdashlineMultiRC Accuracy 0.63 0.63 0.63 0.63
RunTime 16 16 19 19
\hdashlineFlowers Accuracy 0.985 0.985 0.986 0.986
RunTime 15 15 18 18
\hdashlineX-Ray Accuracy 0.966 0.966 0.969 0.969
RunTime 14 15 18 18

Method Comparison. After validating the effectiveness of convergence trend mining from the model training performance on benchmark datasets, we further verify its value in model selection methods.

V-C1 Performance

Fig. 7 compares the final performances of the last model selected by the successive halving and our proposed fine-selection method among the top 10 performing and all of the models. At the same time, we provide the best and worst performances among the top 10 models for NLP and CV tasks. We can find that the fine-selection (FS) method is always able to pick out the optimal or near-optimal model. However, the traditional successive halving method may not necessarily select the best model.

Refer to caption


Figure 7: Selected model’s performance comparison between successive halving(SH) and our fine-selection(FS) method on NLP datasets: MNLI, Tweet, Boolq, MultiRC and CV dataset: Beans, MedMNIST, Flowers and X-ray. 10, 30 and 40 models represent initial model number for filtering. The best and worst model performances in the top 10 models for each dataset are also provided.

V-C2 Time

In addition to the performance of the selected model, the speed of model selection is equally important. In Table V, we compare the speed improvement of the two methods based on brute-force search (BS), that is, fine-tuning all models for a fixed number of epochs. Given the consistency of training settings and hardware environments, we use the total number of fine-tuning epochs across all models to represent the time for model selection. We can observe that our method has a noticeable efficiency improvement compared to successive halving in all datasets and different model numbers. Furthermore, it’s worth noting that the time taken per epoch is considerable, thus the time saved in model selection by our method is significant.

TABLE V: Time comparisons among different model selection methods. NLP tasks are trained for 5 epochs and CV tasks are trained for 4 epochs. Runtime is total training epoch number. Selections of top 10 models and all of the models, whose number is 40 and 30 for NLP and CV, is both provided.
Models 10 40(NLP)/30(CV)
Runtime
(epoch)
Speedup
(vs. BF)
Runtime
(epoch)
Speedup
(vs. BF)
NLP BF 50 200
SH 19 2.63x 77 2.60x
\hdashlineTweet FS 14 3.57x 44 4.55x
MNLI FS 14 3.57x 44 4.55x
MultiRC FS 15 3.33x 46 4.35x
Boolq FS 16 3.13x 48 4.17x
CV BF 40 120
SH 18 2.22x 55 2.18x
\hdashlineX-Ray FS 13 3.08x 38 3.16x
MedMNIST FS 15 2.67x 37 3.24x
Flowers FS 15 2.67x 36 3.33x
Beans FS 17 2.35x 41 2.93x

V-C3 Scaling to more models

In addition to the 10 models obtained based on coarse-recall, we further explored the performance of our method when the number of models increased. In Fig. 7 and Table V, we provided performance and time comparisons for 40 models of natural language process and 30 models of computer vision. We found that as the number of models increases, our method not only maintains the quality of model selection but also significantly saves time. This indicates that the fine-selection method is not only suitable for fine-grained selection after coarse-grained model recall, but it is also effective for a larger number of models. Therefore, our method can be expanded to accommodate more models, making it capable of coping with the increasing emergence of pre-trained models.

V-D Overall Performance

We study the end-to-end model selection effectiveness and efficiency in this section. The comparison methods are also brute-force search (BF) and successive halving (SH) in last section. The number of total models are 40 and 30 for NLP and CV, and recalled model is 10. Table VI shows accuracy and efficiency comparison among different methods where CR+FS stands for the two-phase framework (coarse-recall and fine-selection) in this paper. We can see both SH and two-phase model selection methods proposed in this paper achieve near accuracy compared with BF. Table VI also exhibits the time consumed for the three methods in terms of training epochs. For CR+FS, as proxy_score𝑝𝑟𝑜𝑥𝑦_𝑠𝑐𝑜𝑟𝑒proxy\_scoreitalic_p italic_r italic_o italic_x italic_y _ italic_s italic_c italic_o italic_r italic_e needs to be computed in coarse-recall phase which needs to do inference on the target task, we count the computation time as 0.5|MC|0.5𝑀𝐶0.5\cdot|MC|0.5 ⋅ | italic_M italic_C | epochs because the inference do not need to compute the gradients to do back-propagation. From Table VI, We can see the two-phase model selection methods achieve about 2x to 3x times faster compared with SH and about 5x to 8x times faster compared with BF. The speed up comes from both phases where the coarse-recall phase largely reduce the number of models need to be fine-tuned and the fine-selection phase further reduce computation time of fine-tuning by exploiting the convergence trend. The results demonstrate the model selection method proposed in this paper could achieve high training performance with significantly lower compute time, hence are more suitable to address large scale model repository.

TABLE VI: End-to-End comparisons on time and performance among different model selection methods. 2PH, BF, SH is short for 2 phase (Coarse Recall +Fine Selection), Brute Forcing, Seccessive Halving, respectively. Acc is accuracy metric.
Runtime Speedup Acc
2PH (vs.BF) (vs.SH) BF SH 2PH
Tweet 19 10.53x 4.05x 0.650 0.60 0.650
MNLI 19 10.53x 4.05x 0.850 0.850 0.850
MultiRC 20 10.00x 3.85x 0.640 0.630 0.630
Boolq 21 9.52x 3.67x 0.720 0.720 0.720
\hdashlineX-Ray 18 6.67x 3.06x 0.969 0.962 0.962
MedMNIST 20 6.00x 2.75x 0.779 0.773 0.773
Flowers 20 6.00x 2.75x 0.986 0.986 0.985
Beans 22 5.45x 2.50x 0.968 0.961 0.961
TABLE VII: Case study of the final selected model after coarse-recall and fine-selection on four datasets. Acc and R are short for accuracy and rank. The rank for CR are obtained by sorting the recalled models by proxy score. Avg_Acc is obtained by computing the average accuracies of the recalled models.
Dataset Best_model Acc R@CR Avg_Acc
MultiRC albert-base-v2 0.630 5 0.574
Boolq bert-base-uncased-mnli 0.720 0 0.635
MedMNIST vit-base-patch16-384 0.773 1 0.768
Flowers vit-base-patch16-224 0.985 9 0.891

V-E Generalization Study

The generalization capability of the proposed method for new target tasks lies in three aspects. Firstly, we can address new tasks with different domain distribution and task types compared with benchmark datasets. This is because the benchmark datasets are only used to measure the model similarity offline and will be used neither in corase-recall nor fine-selection phase. Secondly, the LEEP score in the coarse-recall phase proves to be able to measure the transferability between heterogeneous tasks. And thirdly, the fine-selection phase selects model directly based on the fine-tuning results at different validation intervals, which is more accurate to evaluate the final transferability of a source model.

Table VII illustrates the best selected models for four target tasks which have different domain distribution and task type compared with benchmark datasets. The best models are ranked higher at coarse-recall phase and the accuracy are all higher than the average accuracy of all models. The Boolq is a question answering task for yes/no questions based on given passages, and the best model selected for Boolq is bert-base-uncased-mnli which is the bert model fine-tuned on the MNLI dataset. As the input format and output label space are both different for Boolq and MNLI, the result demonstrates the proposed method could capture the latent transferability between heterogeneous tasks. On the other hand, the MultiRC dataset selects albert-base-v2 as the best model, which is a pre-trained model not fine-tuned on any downstream dataset. For computer vision task, both Flowers and MedMNIST datasets select the vit-base-patch16 models which are trained on data with different domain distribution compared with corresponding tasks, demonstrating the out-of-domain capability of proposed method in CV tasks.

VI Related Work

Pre-training could be viewed as an application of deep transfer leanring [54], where the a model is pre-trained on a upstreaming dataset and then the pre-trained model will be used as parameter initialization and continuing trained on datasets of various downstream tasks. The upstream task could be unsupervised task, such as masked language model [9][16] for natural language processing, or supervised task, such as image classification for computer vision [55]. Pre-trained models could help the downstream task to achieve better training effect especially for the situation where the training data of the downstream task is limited. Meanwhile, compared with training dataset, the pre-trained model are usually more safely to be published. Therefore, thousands of pre-trained models has been published on model hub website such as HuggingFace [1] , PyTorch-Hub [56], etc. On the other hand, previous work has demonstrated that for the same downstream task, the training performance may vary a lot when fine-tuned on different pre-trained models [3]. Hence, selecting a good pre-trained model from model repository is an important step to achieve high training performance for the target task.

As the number of models in model repository becomes larger, it is compute infeasible to select a good model after fine-tuning all the models on the target task, and a few methods has been proposed to speed up the model selection process. We divide the proposed model selection methods into two categories, light-weight proxy score computation and model selection during fine-tuning. Specifically, for methods of light-weight proxy score computation, Task2Vec [57] embedding the upstream and downstream tasks into the same vector space, and models trained in upstream task closed to the downstream task could be selected directly; LEEP [11] is proposed to measure the transfer-ability for a pre-trained model on the target classification task; and in [12], the KNN classifier is built on the hidden layer output of pre-trained models to approximate the training effect after fine-tuning. For methods of model selection during fine-tuning, Palette [3] adopts successive halving [56] to filter lower effect models at early training step, and Shift [4] builds cost model to predict the training cost of successive halving and fine-tuning directly. The two-phase model selection framework could be viewed as combing the advantage of the two category methods. As the two-phase model selection framework is flexible, other model selection methods could be combined in this framework even in corresponding phase, such as we can also measure the model performance on benchmark datasets as [58] and combine this strategy with LEEP [11] in coarse-recall phase, and we can also combine multi-model selection methods [3][59][60][61] in the fine-selection phase to achieve high ensemble performance.

In this paper, we also propose to cluster models and mine convergence trend to speed up coarse-recall and fine-selection phase. The clustering and mining process both depend on the training performance of pre-trained models on benchmark datasets, such as GLUE [5] for natural language processing. As thousands of datasets are also published on the website, such as HuggingFace datasets [1], methods such as [57][62] could also be applied to measure the similarity between the task datasets and help to build more effective benchmark datasets which could cover a wider range of tasks for model selection.

VII Future Work

The future work could be summarized into three aspects. Firstly, the coarse-recall phase aims to retain the high performance pre-trained models in top models. A single proxy-score measurement may be not enough to return high performance models for different machine learning tasks. We plan to combine different light-weight tasks to return a high quality subset of models more robustly. Secondly, in this paper, we build benchmark datasets empirically. As a large of number of datasets has been published, we will study data-driven methods to build benchmark datasets which could cover more types of machine tasks, and meanwhile make benchmark datasets more compact to maintain performance matrix more cheaply. And thirdly, as model selection is an important step in the machine learning pipeline, the capability of automatically select high performance model enable us to build the whole machine learning pipeline for new task. We plan to build data management system which stores and maintains the pre-trained models and datasets, then support automatically selecting models efficiently to help users complete the model training for new task.

VIII Conclusion

In this paper, we propose a two-phase framework for fast model selection, where the coarse-recall phase implements light-weight proxy tasks to recall a much smaller number of candidate models and fine-selection phase only fine-tunes the models from this first phase to select the best model. To speed up these two phases, we build performance matrix by fine-tuning pre-trained models on benchmark datasets, and cluster models based on performance matrix to avoid duplicated proxy-score computation in coarse-recall phase and mine convergence trend to filter poorly-perform models more earlier in fine-selection phase. The experiments results on natural language process and computer vision tasks demonstrate the methods proposed in this paper could select a good model for the new task much faster compared with baseline methods.

IX Acknowledgment

This work was supported by the National Natural Science Foundation of China under Grant No. 62072458 and No. 62062058.

References

  • [1] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz et al., “Huggingface’s transformers: State-of-the-art natural language processing,” arXiv preprint arXiv:1910.03771, 2019.
  • [2] Z. Zhao, Y. Li, C. Hou, J. Zhao, R. Tian, W. Liu, Y. Chen, N. Sun, H. Liu, W. Mao et al., “Tencentpretrain: A scalable and flexible toolkit for pre-training models of different modalities,” arXiv preprint arXiv:2212.06385, 2022.
  • [3] Y. Li, Y. Shen, and L. Chen, “Palette: towards multi-source model selection and ensemble for reuse,” in 2021 IEEE 37th International Conference on Data Engineering (ICDE).   IEEE, 2021, pp. 2147–2152.
  • [4] C. Renggli, X. Yao, L. Kolar, L. Rimanic, A. Klimovic, and C. Zhang, “Shift: an efficient, flexible search engine for transfer learning,” arXiv preprint arXiv:2204.01457, 2022.
  • [5] A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. R. Bowman, “Glue: A multi-task benchmark and analysis platform for natural language understanding,” arXiv preprint arXiv:1804.07461, 2018.
  • [6] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie, “The caltech-ucsd birds-200-2011 dataset,” California Institute of Technology, Tech. Rep. CNS-TR-2011-001, 2011.
  • [7] S. Kornblith, J. Shlens, and Q. V. Le, “Do better imagenet models transfer better?” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 2661–2671.
  • [8] J. Dodge, G. Ilharco, R. Schwartz, A. Farhadi, H. Hajishirzi, and N. Smith, “Fine-tuning pretrained language models: Weight initializations, data orders, and early stopping,” arXiv preprint arXiv:2002.06305, 2020.
  • [9] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018.
  • [10] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in 2009 IEEE conference on computer vision and pattern recognition.   Ieee, 2009, pp. 248–255.
  • [11] C. Nguyen, T. Hassner, M. Seeger, and C. Archambeau, “Leep: A new measure to evaluate transferability of learned representations,” in International Conference on Machine Learning.   PMLR, 2020, pp. 7294–7305.
  • [12] C. Renggli, A. S. Pinto, L. Rimanic, J. Puigcerver, C. Riquelme, C. Zhang, and M. Lučić, “Which model to transfer? finding the needle in the growing haystack,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 9205–9214.
  • [13] J. A. Hartigan and M. A. Wong, “Algorithm as 136: A k-means clustering algorithm,” Journal of the royal statistical society. series c (applied statistics), vol. 28, no. 1, pp. 100–108, 1979.
  • [14] F. Murtagh and P. Contreras, “Algorithms for hierarchical clustering: an overview,” Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, vol. 2, no. 1, pp. 86–97, 2012.
  • [15] J.-W. Cui, W.-H. Shi, H.-L. Tao, W. Lu, and X.-Y. Du, “technical report: https://github.com/plasware/two-phase-selection/blob/main/tech-report-137.pdf,” unpublished.
  • [16] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov, “Roberta: A robustly optimized bert pretraining approach,” arXiv preprint arXiv:1907.11692, 2019.
  • [17] B. Wu, C. Xu, X. Dai, A. Wan, P. Zhang, Z. Yan, M. Tomizuka, J. Gonzalez, K. Keutzer, and P. Vajda, “Visual transformers: Token-based image representation and processing for computer vision,” arXiv preprint arXiv:2006.03677, 2020.
  • [18] H. Bao, L. Dong, S. Piao, and F. Wei, “Beit: Bert pre-training of image transformers,” arXiv preprint arXiv:2106.08254, 2021.
  • [19] H. Touvron, M. Cord, M. Douze, F. Massa, A. Sablayrolles, and H. Jégou, “Training data-efficient image transformers & distillation through attention,” in International conference on machine learning.   PMLR, 2021, pp. 10 347–10 357.
  • [20] W. Yu, M. Luo, P. Zhou, C. Si, Y. Zhou, X. Wang, J. Feng, and S. Yan, “Metaformer is actually what you need for vision,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 10 819–10 829.
  • [21] A. Hassani and H. Shi, “Dilated neighborhood attention transformer,” arXiv preprint arXiv:2209.15001, 2022.
  • [22] M.-H. Guo, C.-Z. Lu, Z.-N. Liu, M.-M. Cheng, and S.-M. Hu, “Visual attention network,” arXiv preprint arXiv:2202.09741, 2022.
  • [23] A. Krizhevsky, G. Hinton et al., “Learning multiple layers of features from tiny images,” 2009.
  • [24] A. Wang, Y. Pruksachatkun, N. Nangia, A. Singh, J. Michael, F. Hill, O. Levy, and S. R. Bowman, “Superglue: A stickier benchmark for general-purpose language understanding systems,” arXiv preprint arXiv:1905.00537, 2019.
  • [25] A. L. Maas, R. E. Daly, P. T. Pham, D. Huang, A. Y. Ng, and C. Potts, “Learning word vectors for sentiment analysis,” in Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies.   Portland, Oregon, USA: Association for Computational Linguistics, June 2011, pp. 142–150. [Online]. Available: http://www.aclweb.org/anthology/P11-1015
  • [26] Y. Kim, “Convolutional neural networks for sentence classification,” in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), A. Moschitti, B. Pang, and W. Daelemans, Eds.   Doha, Qatar: Association for Computational Linguistics, Oct. 2014, pp. 1746–1751. [Online]. Available: https://aclanthology.org/D14-1181
  • [27] A. Conneau, R. Rinott, G. Lample, A. Williams, S. R. Bowman, H. Schwenk, and V. Stoyanov, “Xnli: Evaluating cross-lingual sentence representations,” in Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing.   Association for Computational Linguistics, 2018.
  • [28] Y. Nie, A. Williams, E. Dinan, M. Bansal, J. Weston, and D. Kiela, “Adversarial nli: A new benchmark for natural language understanding,” in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics.   Association for Computational Linguistics, 2020.
  • [29] “Software applications user reviews,” 2017.
  • [30] X. Li and D. Roth, “Learning question classifiers,” in COLING 2002: The 19th International Conference on Computational Linguistics, 2002. [Online]. Available: https://www.aclweb.org/anthology/C02-1150
  • [31] E. Hovy, L. Gerber, U. Hermjakob, C.-Y. Lin, and D. Ravichandran, “Toward semantics-based answer pinpointing,” in Proceedings of the First International Conference on Human Language Technology Research, 2001. [Online]. Available: https://www.aclweb.org/anthology/H01-1069
  • [32] M. Marelli, S. Menini, M. Baroni, L. Bentivogli, R. Bernardi, and R. Zamparelli, “A SICK cure for the evaluation of compositional distributional semantic models,” in Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14).   Reykjavik, Iceland: European Language Resources Association (ELRA), May 2014, pp. 216–223. [Online]. Available: http://www.lrec-conf.org/proceedings/lrec2014/pdf/363_Paper.pdf
  • [33] P. Malo, A. Sinha, P. Korhonen, J. Wallenius, and P. Takala, “Good debt or bad debt: Detecting semantic orientations in economic texts,” Journal of the Association for Information Science and Technology, vol. 65, 2014.
  • [34] L. Bossard, M. Guillaumin, and L. Van Gool, “Food-101 – mining discriminative components with random forests,” in European Conference on Computer Vision, 2014.
  • [35] J. Elson, J. J. Douceur, J. Howell, and J. Saul, “Asirra: A captcha that exploits interest-aligned manual image categorization,” in Proceedings of 14th ACM Conference on Computer and Communications Security (CCS).   Association for Computing Machinery, Inc., October 2007. [Online]. Available: https://www.microsoft.com/en-us/research/publication/asirra-a-captcha-that-exploits-interest-aligned-manual-image-categorization/
  • [36] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.
  • [37] F. Barbieri, J. Camacho-Collados, L. Neves, and L. Espinosa-Anke, “Tweeteval: Unified benchmark and comparative evaluation for tweet classification,” arXiv preprint arXiv:2010.12421, 2020.
  • [38] A. Williams, N. Nangia, and S. R. Bowman, “A broad-coverage challenge corpus for sentence understanding through inference,” arXiv preprint arXiv:1704.05426, 2017.
  • [39] D. Khashabi, S. Chaturvedi, M. Roth, S. Upadhyay, and D. Roth, “Looking beyond the surface: A challenge set for reading comprehension over multiple sentences,” in Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), 2018, pp. 252–262.
  • [40] C. Clark, K. Lee, M.-W. Chang, T. Kwiatkowski, M. Collins, and K. Toutanova, “Boolq: Exploring the surprising difficulty of natural yes/no questions,” in NAACL, 2019.
  • [41] D. S. Kermany, M. Goldbaum, W. Cai, C. C. Valentim, H. Liang, S. L. Baxter, A. McKeown, G. Yang, X. Wu, F. Yan et al., “Identifying medical diagnoses and treatable diseases by image-based deep learning,” cell, vol. 172, no. 5, pp. 1122–1131, 2018.
  • [42] J. Yang, R. Shi, D. Wei, Z. Liu, L. Zhao, B. Ke, H. Pfister, and B. Ni, “Medmnist v2-a large-scale lightweight benchmark for 2d and 3d biomedical image classification,” Scientific Data, vol. 10, no. 1, p. 41, 2023.
  • [43] M.-E. Nilsback and A. Zisserman, “Automated flower classification over a large number of classes,” in 2008 Sixth Indian conference on computer vision, graphics & image processing.   IEEE, 2008, pp. 722–729.
  • [44] M. A. Lab. (2020, January) Bean disease dataset. [Online]. Available: https://github.com/AI-Lab-Makerere/ibean/
  • [45] P. J. Rousseeuw, “Silhouettes: a graphical aid to the interpretation and validation of cluster analysis,” Journal of computational and applied mathematics, vol. 20, pp. 53–65, 1987.
  • [46] N. Reimers and I. Gurevych, “Sentence-bert: Sentence embeddings using siamese bert-networks,” in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing.   Association for Computational Linguistics, 11 2019. [Online]. Available: https://arxiv.org/abs/1908.10084
  • [47] L. Sharma, L. Graesser, N. Nangia, and U. Evci, “Natural language understanding with the quora question pairs dataset,” arXiv preprint arXiv:1907.01041, 2019.
  • [48] A. Warstadt, A. Singh, and S. R. Bowman, “Neural network acceptability judgments,” Transactions of the Association for Computational Linguistics, vol. 7, pp. 625–641, 2019.
  • [49] R. T. McCoy, J. Min, and T. Linzen, “Berts of a feather do not generalize together: Large variability in generalization across models with similar test set performance,” arXiv preprint arXiv:1911.02969, 2019.
  • [50] M. Caron, H. Touvron, I. Misra, H. Jégou, J. Mairal, P. Bojanowski, and A. Joulin, “Emerging properties in self-supervised vision transformers,” in Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 9650–9660.
  • [51] M. Assran, M. Caron, I. Misra, P. Bojanowski, F. Bordes, P. Vincent, A. Joulin, M. Rabbat, and N. Ballas, “Masked siamese networks for label-efficient learning,” in European Conference on Computer Vision.   Springer, 2022, pp. 456–473.
  • [52] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein et al., “Imagenet large scale visual recognition challenge,” International journal of computer vision, vol. 115, pp. 211–252, 2015.
  • [53] T. Ridnik, E. Ben-Baruch, A. Noy, and L. Zelnik-Manor, “Imagenet-21k pretraining for the masses,” arXiv preprint arXiv:2104.10972, 2021.
  • [54] C. Tan, F. Sun, T. Kong, W. Zhang, C. Yang, and C. Liu, “A survey on deep transfer learning,” in Artificial Neural Networks and Machine Learning–ICANN 2018: 27th International Conference on Artificial Neural Networks, Rhodes, Greece, October 4-7, 2018, Proceedings, Part III 27.   Springer, 2018, pp. 270–279.
  • [55] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” Advances in neural information processing systems, vol. 25, 2012.
  • [56] K. Jamieson and A. Talwalkar, “Non-stochastic best arm identification and hyperparameter optimization,” in Artificial intelligence and statistics.   PMLR, 2016, pp. 240–248.
  • [57] A. Achille, M. Lam, R. Tewari, A. Ravichandran, S. Maji, C. C. Fowlkes, S. Soatto, and P. Perona, “Task2vec: Task embedding for meta-learning,” in Proceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 6430–6439.
  • [58] X. Zhai, J. Puigcerver, A. Kolesnikov, P. Ruyssen, C. Riquelme, M. Lucic, J. Djolonga, A. S. Pinto, M. Neumann, A. Dosovitskiy et al., “The visual task adaptation benchmark,” 2019.
  • [59] J.-W. Cui, W. Lu, X. Zhao, and X.-Y. Du, “Efficient model store and reuse in an olml database system,” Journal of Computer Science and Technology, vol. 36, pp. 792–805, 2021.
  • [60] W. Zhang, J. Jiang, Y. Shao, and B. Cui, “Efficient diversity-driven ensemble for deep neural networks,” in 2020 IEEE 36th International Conference on Data Engineering (ICDE).   IEEE, 2020, pp. 73–84.
  • [61] Y.-X. Ding and Z.-H. Zhou, “Boosting-based reliable model reuse,” in Asian Conference on Machine Learning.   PMLR, 2020, pp. 145–160.
  • [62] F. Feng, Y. Yang, D. Cer, N. Arivazhagan, and W. Wang, “Language-agnostic bert sentence embedding,” arXiv preprint arXiv:2007.01852, 2020.

-A Mnli Results

We provide different models’ validation results on MNLI dataset under a different learning rate 1e-5 in Fig. 8, compared to 3e-5 in Fig. 3. It can be observed that under new hyperparameter settings, the training dynamics of the models have changed. The performances of the top two models did not continuously decline with further training, suggesting a less severe overfitting issue. This indicates that the training process of models is highly sensitive to the setting of hyperparameters. In addition, we use our two-phase model selection method for the model training process under the new hyperparameters, and the performance and efficiency are consistent. Despite the changes in the training process, the variation in model performance was not significant enough to impact the effectiveness of our method. Therefore, our approach is robust to different hyperparameter settings in model training and is applicable across various model training scenarios.

Refer to caption


Figure 8: Top-10 models validation and test results on MNLI dataset. Learning rate is 1e-5, which is different than 3e-5 in the Fig. 3.

-B Model Details

The pre-trained models we use are all from Huggingface’s model hub 111https://huggingface.co/models. We list the full names of all the NLP and CV models used in our work in Table VIII. Note that we sometimes use incomplete model names in the main text to save space by removing the name of the repository to which the model belongs. After removing the repository name prefix, the model names are still uniquely summarized in the list of models we use, so partial model names can also be used to pinpoint the corresponding model.

NLP models and CV models are listed in Table IX in total. All models are available using ”https://huggingface.co/” as prefix.

TABLE VIII: NLP and CV Models
NLP model name CV model name
18811449050/bert_finetuning_test facebook/deit-base-patch16-224
aditeyabaral/finetuned-sail2017-xlm-roberta-base facebook/deit-base-patch16-384
albert-base-v2 facebook/deit-small-patch16-224
aliosm/sha3bor-metre-detector-arabertv2-base facebook/dino-vitb16
Alireza1044/albert-base-v2-qnli facebook/dino-vitb8
anirudh21/bert-base-uncased-finetuned-qnli facebook/dino-vits16
aviator-neural/bert-base-uncased-sst2 facebook/vit-msn-base
aychang/bert-base-cased-trec-coarse facebook/vit-msn-small
bert-base-uncased google/vit-base-patch16-224
bondi/bert-semaphore-prediction-w4 google/vit-base-patch16-384
CAMeL-Lab/bert-base-arabic-camelbert-da-sentiment google/vit-base-patch32-224-in21k
CAMeL-Lab–bert-base-arabic-camelbert-mix-did-nadi lixiqi/beit-base-patch16-224-pt22k-ft22k-finetuned-FER2013-6e-05
classla/bcms-bertic-parlasent-bcs-ter lixiqi/beit-base-patch16-224-pt22k-ft22k-finetuned-FER2013-7e-05
connectivity/bert_ft_qqp-1 lixiqi/beit-base-patch16-224-pt22k-ft22k-finetuned-FER-5e-05-3
connectivity/bert_ft_qqp-17 microsoft/beit-base-patch16-224
connectivity/bert_ft_qqp-7 microsoft/beit-base-patch16-224-pt22k
connectivity/bert_ft_qqp-96 microsoft/beit-base-patch16-224-pt22k-ft22k
dhimskyy/wiki-bert microsoft/beit-base-patch16-384
distilbert-base-uncased microsoft/beit-large-patch16-224-pt22k
DoyyingFace/bert-asian-hate-tweets-asian-unclean-freeze-4 mrgiraffe/vit-large-dataset-model-v3
emrecan/bert-base-multilingual-cased-snli_tr sail/poolformer_m36
gchhablani/bert-base-cased-finetuned-rte sail/poolformer_m48
gchhablani/bert-base-cased-finetuned-wnli sail/poolformer_s36
ishan/bert-base-uncased-mnli shi-labs/dinat-base-in1k-224
jb2k/bert-base-multilingual-cased-language-detection shi-labs/dinat-large-in22k-in1k-224
Jeevesh8/512seq_len_6ep_bert_ft_cola-91 shi-labs/dinat-large-in22k-in1k-384
Jeevesh8/6ep_bert_ft_cola-47 Visual-Attention-Network/van-base
Jeevesh8/bert_ft_cola-88 Visual-Attention-Network/van-large
Jeevesh8/bert_ft_qqp-40 oschamp/vit-artworkclassifier
Jeevesh8/bert_ft_qqp-68 nateraw/vit-age-classifier
Jeevesh8/bert_ft_qqp-9 -
Jeevesh8/feather_berts_46 -
Jeevesh8/init_bert_ft_qqp-24 -
Jeevesh8/init_bert_ft_qqp-33 -
manueltonneau/bert-twitter-en-is-hired -
roberta-base -
socialmediaie/TRAC2020_IBEN_B_bert-base-multilingual-uncased -
Splend1dchan/bert-base-uncased-slue-goldtrascription-e3-lr1e-4 -
XSY/albert-base-v2-imdb-calssification -
Guscode/DKbert-hatespeech-detection -

-C Dataset Details

NLP datasets and CV datasets are listed in Table IX. Some datasets contains multiple subsets. All datasets are available using ”https://huggingface.co/” as prefix. GLUE and SuperGLUE are the most common benchmark datasets in NLP. Cifar10 and MNIST are the most common benchmark datasets in CV. Other NLP datasets are described below:

  • LysandreJik/glue-mnli-train This datasets contain labelled MNLI dataset. The original MNLI dataset in GLUE does not have label, and the label is necessary for our experiment. This task is to predict the relation between the premise and the hypothesis. The result could be entailment, contradiction, or neutral. The labels of this dataset are balanced.

  • SetFit/qnli This datasets contain labelled qnli dataset. The original qnli dataset in GLUE does not have label, and the label is necessary for our experiment. This task is to predict whether or not the paragraph contains the answer to the question. The labels of this dataset are balanced.

  • xnli This dataset contains part of MNLI dataset after translated into different languages. The labels of this dataset are balanced.

  • stsb_multi_mt This task is to score the similarity between two sentences on the scale of 0 to 5. The labels of this dataset are not balanced.

  • anli This task is the same as MNLI dataset. However, the dataset is collected in an adversarial procedure. The labels of this dataset are not balanced.

  • tweet_eval This is a sentiment analysis task. The dataset is collected from Tweeter. The labels of this dataset are not balanced.

  • paws This is a paraphrase identification task. The labels of this dataset are not balanced.

  • financial_phrasebank This is a sentiment analysis task in the realm of finance. The dataset is collected from financial news. The labels of this dataset are not balanced.

  • yahoo_answers_topics This is a classification task. The dataset is collected from Yahoo. The labels of this dataset are balanced.

Other CV datasets are described below:

  • food101 This dataset contains 101 kinds of food that need to predict. The size of the image is not the same. The labels of this dataset are balanced.

  • nelorth/oxford-flowers This dataset contains 102 kinds of flowers that need to predict. The size of the images is not the same. The labels of this dataset are not balanced.

  • Matthijs/snacks This dataset contains 20 kinds of snacks that need to predict. The size of the images is not the same. The labels of this dataset are slightly unbalanced.

  • beans This dataset contains 3 kinds of leaves that need to predict. The size of the images is the same. The labels of this dataset are balanced.

  • cats_vs_dogs This dataset contains images of cats or dogs and is a subset of Asirra dataset. The size of the images is not the same. The labels of this dataset are balanced.

  • trpakov/chest-xray-classification This dataset contains images of chest x-ray. The size of the images is the same. The labels of this dataset are not balanced.

  • alkzar90/CC6204-Hackaton-Cub-Dataset This daatset contains images of birds. The size of the images is not the same. The labels of this dataset are not balanced.

  • albertvillanova/medmnist-v2 This dataset contains images about biomedical. The size of the image is the same. The labels of this dataset are not balanced.

TABLE IX: NLP and CV Datasets
NLP dataset name CV dataset name
glue food101
super_glue nelorth/oxford-flowers
LysandreJik/glue-mnli-train Matthijs/snacks
SetFit/qnli beans
xnli cats_vs_dogs
stsb_multi_mt trpakov/chest-xray-classification
anli cifar10
tweet_eval MNIST
paws alkzar90/CC6204-Hackaton-Cub-Dataset
financial_phrasebank albertvillanova/medmnist-v2
yahoo_answers_topics -

-D Experiment on the Number of Dimensions for Max Average Error

As discussed in Eq. 1 and Section V.B., we use top-k maximum average error to measure the model similarity and the parameter k𝑘kitalic_k may influence the performance of the model selection algorithm. Thus, we test different values of k𝑘kitalic_k while fixing other items. Due to the number of datasets, we choose k=5,10,15𝑘51015k=5,10,15italic_k = 5 , 10 , 15 for NLP clustering evaluation and k=3,4,5𝑘345k=3,4,5italic_k = 3 , 4 , 5 for CV clustering evaluation. The result is shown in Table X. We can find that the influence of parameter k𝑘kitalic_k is limited since the silhouette coefficient fluctuates within an acceptable range. Considering that the parameter k𝑘kitalic_k in Eq. 1 should be able to filter noise and retain valid information, we choose k=5𝑘5k=5italic_k = 5 in both tasks.

TABLE X: Parameter K Selection
NLP CV
K Value 5 10 15 3 4 5
Silhouette Coefficient 0.543 0.503 0.535 0.850 0.828 0.821

-E Model cards

A model card is given in Fig. 9. A model card contain the general description of the model, such as structure and training information.

Refer to caption


Figure 9: Model card of bert-base-uncased-mnli. Each model on HuggingFace has a model card to describe the model.

-F K-means Clustering Results

The result of K-means clustering is shown in Table XI. This table is related to Table II in section V. B. Model Clustering. In that section, we explain the result of hierarchical clustering in detail. We conclude that the result of hierarchical clustering is effective since the in-cluster models share the same model structure or training dataset while the silhouette coefficient is high. Here we give the result of K-means clustering to better prove our conclusion. Both the NLP clustering result and CV clustering result of the K-means clustering algorithm show less connection between in-cluster models. In the NLP part, the 2 biggest clusters, C2subscript𝐶2C_{2}italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT and C8subscript𝐶8C_{8}italic_C start_POSTSUBSCRIPT 8 end_POSTSUBSCRIPT, consist of a mix of models that have different structures and training datasets. In the CV part, there is a cross mixing in C6subscript𝐶6C_{6}italic_C start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT and C7subscript𝐶7C_{7}italic_C start_POSTSUBSCRIPT 7 end_POSTSUBSCRIPT, and the biggest cluster, C4subscript𝐶4C_{4}italic_C start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT, does not show consistency in either model structure or training dataset. Thus, we take the method of hierarchical clustering as the main line of this paper.

TABLE XI: Model Clustering Results Using K-means
Model Clusters of Natural Language Processing
Cluster Size Pre-trained Models
C1subscript𝐶1C_{1}italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT 2 gchhablani–bert-base-cased-finetuned-rte, anirudh21–bert-base-uncased-finetuned-qnli
C2subscript𝐶2C_{2}italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT 5 Jeevesh8–bert_ft_cola-88, DoyyingFace–bert-asian-hate-tweets-asian-unclean-freeze-4, bert-base-uncased , aditeyabaral–finetuned-sail2017-xlm-roberta-base, Jeevesh8–512seq_len_6ep_bert_ft_cola-91
C3subscript𝐶3C_{3}italic_C start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT 2 manueltonneau–bert-twitter-en-is-hired, aychang–bert-base-cased-trec-coarse
C4subscript𝐶4C_{4}italic_C start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT 2 XSY–albert-base-v2-imdb-calssification, distilbert-base-uncased
C4subscript𝐶4C_{4}italic_C start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT 4 ishan–bert-base-uncased-mnli, Alireza1044–albert-base-v2-qnli, albert-base-v2, Jeevesh8–feather_berts_46 :
C5subscript𝐶5C_{5}italic_C start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT 2 CAMeL-Lab–bert-base-arabic-camelbert-mix-did-nadi, aliosm–sha3bor-metre-detector-arabertv2-base
C6subscript𝐶6C_{6}italic_C start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT 3 socialmediaie–TRAC2020_IBEN_B_bert-base-multilingual-uncased, jb2k–bert-base-multilingual-cased-language-detection, emrecan–bert-base-multilingual-cased-snli_tr
C7subscript𝐶7C_{7}italic_C start_POSTSUBSCRIPT 7 end_POSTSUBSCRIPT 2 dhimskyy–wiki-bert, bondi–bert-semaphore-prediction-w4
C8subscript𝐶8C_{8}italic_C start_POSTSUBSCRIPT 8 end_POSTSUBSCRIPT 5 Jeevesh8–init_bert_ft_qqp-33, Jeevesh8–bert_ft_qqp-68, Jeevesh8–bert_ft_qqp-40, connectivity–bert_ft_qqp-1, Jeevesh8–bert_ft_qqp-9
C9subscript𝐶9C_{9}italic_C start_POSTSUBSCRIPT 9 end_POSTSUBSCRIPT 4 connectivity–bert_ft_qqp-96, connectivity–bert_ft_qqp-7, connectivity–bert_ft_qqp-17, Jeevesh8–init_bert_ft_qqp-24
C10subscript𝐶10C_{10}italic_C start_POSTSUBSCRIPT 10 end_POSTSUBSCRIPT 2 Splend1dchan–bert-base-uncased-slue-goldtrascription-e3-lr1e-4, Jeevesh8–6ep_bert_ft_cola-47
Model Clusters of Computer Vision
Cluster Size Pre-trained Models
C1subscript𝐶1C_{1}italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT 6 shi-labs/dinat-large-in22k-in1k-224, shi-labs/dinat-large-in22k-in1k-384, microsoft/beit-base-patch16-224-pt22k-ft22k, lixiqi/beit-base-patch16-224-pt22k-ft22k-finetuned-FER2013-7e-05, lixiqi/beit-base-patch16-224-pt22k-ft22k-finetuned-FER2013-6e-05, lixiqi/beit-base-patch16-224-pt22k-ft22k-finetuned-FER-5e-05-3
C2subscript𝐶2C_{2}italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT 2 nateraw/vit-age-classifier, facebook/dino-vitb16
C3subscript𝐶3C_{3}italic_C start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT 3 sail/poolformer_m48, sail/poolformer_m36, sail/poolformer_s36
C4subscript𝐶4C_{4}italic_C start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT 7 facebook/vit-msn-small, facebook/vit-msn-base, facebook/deit-base-patch16-384, google/vit-base-patch32-224-in21k, Visual-Attention-Network/van-large, facebook/deit-base-patch16-224, facebook/dino-vits16
C5subscript𝐶5C_{5}italic_C start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT 4 Visual-Attention-Network/van-base, microsoft/beit-large-patch16-224-pt22k, facebook/deit-small-patch16-224, shi-labs/dinat-base-in1k-224
C6subscript𝐶6C_{6}italic_C start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT 2 microsoft/beit-base-patch16-384, google/vit-base-patch16-384
C7subscript𝐶7C_{7}italic_C start_POSTSUBSCRIPT 7 end_POSTSUBSCRIPT 2 microsoft/beit-base-patch16-224, google/vit-base-patch16-224