ADs: Active Data-sharing for Data Quality Assurance in Advanced Manufacturing Systems

Yue Zhao, Yuxuan Li, Chenang Liu, Yinan Wang Yue Zhao and Yinan Wang are with the Department of Industrial and Systems Engineering, Rensselaer Polytechnic Institute, Troy, NY, 12180.
E-mail: {zhaoy23, wangy88}@rpi.edu Yuxuan Li and Chenang Liu are with the School of Industrial Engineering and Management, Oklahoma State University, Stillwater, OK, 74078.
E-mail: {yuxuan.li, chenang.liu}@okstate.eduManuscript received XXX; revised YYY. (Corresponding Author: Yinan Wang)
Abstract

Machine learning (ML) methods are widely used in manufacturing applications, which usually require a large amount of training data. However, data collection needs extensive costs and time investments in the manufacturing system, and data scarcity commonly exists. With the development of the industrial internet of things (IIoT), data-sharing is widely enabled among multiple machines with similar functionality to augment the dataset for building ML models. Despite the machines being designed similarly, the distribution mismatch inevitably exists in their data due to different working conditions, process parameters, measurement noise, etc. However, the effective application of ML methods is built upon the assumption that the training and testing data are sampled from the same distribution. Thus, an intelligent data-sharing framework is needed to ensure the quality of the shared data such that only beneficial information is shared to improve the performance of ML methods. In this work, we propose an Active Data-sharing (ADs) framework to ensure the quality of the shared data among multiple machines. It is designed as a self-supervised learning framework by integrating the architecture of contrastive learning (CL) and active learning (AL). A novel acquisition function is developed for active learning by integrating the information measure for benefiting downstream tasks and the similarity score for data quality assurance. To validate the effectiveness of the proposed ADs framework, we collected real-world in-situ monitoring data from three 3D printers, two of which share identical specifications, while the other one is different. The results demonstrated that our ADs framework could intelligently share monitoring data between identical machines while eliminating the data points from the different machines when training ML methods. With a high-quality augmented dataset generated by our proposed framework, the ML methods can achieve a better performance of accuracy 95.78% when utilizing 26% labeled data. This represents an improvement of 1.41% compared with benchmark methods which used 100% labeled data.

Note to Practitioners

This paper is motivated by the need to share data across machines or processes for building machine learning models and the widely existing low-quality data issue. Low-quality data here refers to data samples collected from machines/processes different from the target one. In the manufacturing system, when building a machine learning model for a target machine/system, those low-quality data will decay the model performance if they are selected and shared for training. Therefore, it hinders the direct application of active learning in data sharing as the classic setting of active learning does not consider the impact of data quality on model performance. The objective of this paper is to develop an intelligent data-sharing framework. It is designed to simultaneously select the most informative data points benefiting the downstream tasks and mitigate the impact of low-quality data. We collected real-world in-situ monitoring data of the same additive manufacturing process from three different machines, two of which are more similar than the rest. The proposed method is applied to train an anomaly detection model for those two similar machines, and the entire data pool from all three machines is available for selecting and annotating. The results demonstrated that our proposed method outperforms the benchmark methods by only requiring 26% of labeled training samples. In addition, all selected data samples are from machines with similar conditions, while the data from the different machines are prevented from misleading the training.

Index Terms:
Multi-objective Optimization, Active Learning, Contrastive learning, Data-sharing, Data Quality Control

I Introduction

Supervised learning has been extensively applied to a wide spectrum of downstream tasks in advanced manufacturing systems [1]. For example, a machine learning (ML) model might be used to detect manufacturing faults or identify anomalous sensor readings[2]. The training process of supervised learning requires large quantities of annotated data to achieve good performance on the required task. Unfortunately, in manufacturing systems, it is often impossible to collect a large amount of annotated data due to the high costs of labor, time, and investment, which causes the lack of overall data samples and annotated samples for supervised learning. Therefore, data-sharing is proposed as a solution to share data among multiple manufacturing processes with similar functionality, which naturally augments the amount of data available for the ML model [3].

Despite the benefits of data-sharing, there are two gaps when applying it in practice. First, although the shared data increases the size of the data pool, they do not necessarily include the most informative subset of data that benefits the model performance on the downstream task, given a limited budget for data annotation. Second, an important assumption to ensure the effectiveness of the ML method is that the training and testing data should be from the same distribution [4]. However, distribution mismatch naturally exists among the collected data from different manufacturing processes or even the same process but slightly different machines. Although with similar functionalities, different processes contain both systematic and stochastic differences in working conditions, process parameters, sensors, machine specifications, etc. Therefore, the objective of data-sharing in advanced manufacturing can be summarized as selecting a subset of data samples across multiple manufacturing processes or different machines sharing identical underly distribution to benefit the performance of downstream tasks under a given annotation budget. In this work, anomaly detection is selected as the downstream task, which is defined to identify normal or abnormal working conditions from multivariate time-series in-situ sensor data collected from different manufacturing processes. In the following description, we use the term target distribution to refer to the data distribution that the ML methods are designed to model.

The concept of active learning (AL) emerged as a plausible solution to the first gap by prioritizing important data to cut down on annotation costs. Generally, AL identifies a much smaller portion of the data pool for annotation (refer to query set) that maximizes the information gain of the ML model. The idea is that training the model on the annotated subset of data achieves a comparable performance to a model trained on the full dataset. This is usually conducted by quantifying the information measure of the unlabeled data for the ML model. AL techniques have generally achieved great success in these limited-budget tasks and, in some cases, even improved the performance of models with a much smaller amount of training data [5]. Despite their effectiveness, most of the existing AL methods also assume all the data points are sampled from the same underlying distribution, which does not always hold in practice, especially in the manufacturing system. This is problematic because although the most informative samples would still be selected by the AL method, there is no guarantee that these samples are representative of the target distribution, thus hindering the performance of supervised learning trained on the selected samples. Therefore, it is essential to consider the issue of distribution mismatch when applying AL methods to applications in the manufacturing system.

The distribution mismatch over training and testing data has also been observed in various machine learning applications. Du et al. [6] note several typical ML scenarios, i.e., medical diagnoses containing unseen lesions and the house annotation of remote sensing images containing numerous natural sceneries, having the issue of distribution mismatch mitigating the model performance [6], including classification, regression [7], etc. Applying ML in these situations is referred to as learning under class distribution mismatch, which is naturally a challenging task [8, 9, 10]. Therefore, various methods have been developed to resolve the distribution mismatch issue from different perspectives. One popular line of research attacks this issue by correctly detecting the data samples not belonging to the target distribution, which is formally formulated as out-of-distribution (OOD) detection [11, 12, 13]. The idea is to passively eliminate the impact of OOD data by filtering them out in the testing phase. The current OOD detection methods are usually built upon two assumptions: (1) the distribution mismatch mainly existed in the label space, which usually refers to a novel class of object (not existing in the training dataset) appearing in the testing phase for the classification task. This phenomenon is demonstrated in Figure 1(a), in which novel classes in the cream background might appear in the testing phases; (2) the “clean” dataset following the target distribution is available when training the ML model. Therefore, in the classification task, the problem of OOD detection is usually formulated as identifying the data samples that do not belong to any class in the training. However, these two assumptions do not hold when sharing data among manufacturing processes due to (1) the distribution mismatch is more likely to exist in the input space due to the systematic and stochastic differences among manufacturing processes, which is demonstrated by the gap between input distributions in blue and orange curves shown in Figure 1(b); and (2) the “clean” dataset following the target distribution is usually unavailable for training due to the data scarcity of each single manufacturing process. Consequently, a new paradigm is needed to resolve the challenge of distribution mismatch in data-sharing during model training.

Refer to caption
Figure 1: (a) An instance of different class distribution mismatch in output space [6]. (b) An instance of distribution mismatch in input space.

With emerging attention to mitigate the issue of distribution mismatch during the training process, researchers started to tackle this challenge in the cycle of the AL method. The idea is to actively select the subset of data samples from a general data pool to ensure they both follow the same target distribution and benefit the downstream task. Failure-averse AL is first proposed to incorporate the physical principles into preparing the data samples for the regression task in the manufacturing system [14]. Chattopadhyay et al. proposed a novel criterion that achieves good generalization performance of a classifier by selecting a set of query samples to minimize the difference in distribution between the labeled and the unlabeled data [15]. An integrated information measure is proposed to score and rank the unlabeled data points such that the top candidates are ensured to both benefit the downstream regression task and follow the same physical principle. It provides a way to intuitively incorporate the measure of target distribution into the information measure in AL. However, it still cannot be intuitively applied in a general data-sharing scenario, as the target population usually has no closed-form explicit expression that can be directly exploited. With the recent development of self-supervised learning, contrastive learning (CL) offers the solution to evaluate the similarity over features extracted from different input data [6]. Therefore, dissimilar features naturally correspond to data samples from different input distributions. As a self-supervised learning method, CL does not require label information, which is trained by forming data samples into positive pairs (anchor-positive, i.e., following similar distributions) and negative pairs (anchor-negative, i.e., following dissimilar distributions). The intuitive idea of CL is to train the feature extractor by encouraging it to drag positive pairs close to each other in the feature space while pushing negative pairs away from each other. In that sense, a model trained within the CL framework is capable of embedding data samples with similar distributions close to each other in the feature space, while those with dissimilar distributions are embedded farther away. Thus, it models the distributional differences present in the data. It is worth noting that CL still requires some data samples from the target distribution to initiate the training but is much less data-demanding compared with supervised learning methods due to the advantage of self-supervised schema [16].

Du et al. [6] first propose to incorporate the CL into the cycle of AL to mitigate the distribution mismatch issue in the output space shown in Figure 1(a), which is referred to as class distribution mismatch in the classification problem. It is not intuitively applicable to our scenario. Similar to our previous discussion, the distribution mismatch in data-sharing mainly exists in the input space. Using anomaly detection as an example, the output space across different manufacturing processes should uniformly contain two classes of normal or abnormal while the monitoring data representing the same working condition (i.e., either normal or abnormal) might follow different distributions across different manufacturing processes due to distribution mismatch in the input space.

To fill in the research gap, the proposed framework, termed Active Data-sharing (ADs), views the problem as a multi-objective AL where the first objective is to select highly uncertain samples for the anomaly detection task and the second is to ensure that the selected data matches the target distribution. The approach is based primarily on the observation that AL and CL can achieve these objectives independently. Therefore, combining them might result in a feasible joint solution. That is, in the context of multi-objective optimization (MOO), the resulting solution would be Pareto optimal. In ADs, each objective is optimized by a separate model, wherein the first involves uncertainty sampling in the form of entropy of classifier predictions, and the second utilizes a CL network that learns similarity features within the data to distinguish data from similar and dissimilar machines. These models are trained on a small set of annotated data as initialization and then applied to the unlabeled data to form two sets of scores corresponding to each objective. A joint query strategy is then used to integrate the two scores into a joint set of scores, which is used to select the best samples for the human annotator. Thus, the proposed framework allows high-quality data-sharing that satisfies both objectives.

The major contributions of the work are: (1) A novel Active Data-sharing (ADs) framework is proposed to ensure the quality of industrial data-sharing when subject to data scarcity, distribution mismatch, and low annotation budget. (2) A novel acquisition function is developed for AL under distribution mismatch in the input space by integrating the informativeness and distribution similarity scores. (3) The effectiveness of the framework is evaluated on real-world in-situ monitoring data from additive manufacturing (AM) processes.

The paper is structured in the following manner. Section II reviews the literature related to AL, CL, and AL under distribution mismatch, which refers to applications where AL in isolation fails to produce desirable results. Section III elaborates on the proposed ADs framework for effective annotation and data-sharing under distribution mismatch. Section III-E presents a formal theoretical background and explanation for the mathematical validity of ADs. Section IV evaluates the proposed method in various settings with a comprehensive case study involving a real-world industrial additive manufacturing process by using in-situ monitoring data recorded from three AM processes.

II Literature Review

This section summarizes the literature and recent work on the key domains of AL, CL, and particularly semi-supervised learning, all of which were critical to developing ADs.

II-A Active Learning (AL)

AL offers various strategies for reducing the annotation budget by actively selecting the most informative or valuable data points and feeding them to the annotator for label acquisition [17]. These strategies fall into three categories: query synthesis (conventional [18, 19, 20], generative [21, 22, 23]), sequential sampling [24, 25], and pool-based sampling [26, 27].

The problem presented in this work belongs to pool-based sampling wherein the unlabeled data pool is the collected in-situ monitoring data from three AM processes, and the goal is to evaluate the information measure over the entire unlabeled data pool prior to querying the candidate samples for sharing. Almost all pool-based methods score the samples based on their informativeness. The key idea of these approaches is to select only the most informative subset of samples so as to maximize the information gain for the model. Examples include uncertainty sampling approaches [28, 29, 30, 31, 32, 33, 34], variance reduction [26, 27], query-by-committee [35, 36, 37], etc. A sample’s uncertainty may be quantified in several ways, e.g., marginal uncertainty [38] or, most popularly, entropy [39], which utilizes the posterior probability of the model’s prediction on all samples to select the best one. This is particularly intuitive if the data can be represented as a probabilistic distribution. Recently, Sinha et al. [33] employed variational autoencoders to determine uncertainty by comparing the distribution of the annotated and unlabeled data. A task-agnostic approach by Yoo et al. [34] proposes to use a parametric loss predictor module to predict the loss of the unlabeled sample and, therefore, measure its uncertainty.

Another class of pool-based AL methods is based on representative measures, such as diversity [40, 41, 42], density [43, 44, 45] or a combination of the two [46]. Diversity methods favor exploration and prefer the selection of dissimilar instances, whereas density-based approaches assume that either dense or sparse groups of data points contain the most information. Therefore, they prefer selecting instances that are either similar or dissimilar to several other instances [47]. Hierarchical clustering [43] and density estimation methods [44, 45] are commonly used for density-based approaches. Amongst diversity-based approaches, the classical core-set approach [48] is the most popular – it aims to identify a diverse set of samples, i.e., the core or cover set, by minimizing the distance between each sampled point and the remaining points. The expected result is that a model trained on the core-set is at least equivalent to a model trained on the complete data in terms of performance. The core-set was first adapted to batched inputs for convolutional neural networks (CNNs) by Sener et al. [41]. They proposed a robust k-center algorithm operating on Euclidean distances of the last layer’s feature vectors for a batch of input images. Kim et al. [42] propose a density-aware core-set approach (DACS) to select diverse samples from locally sparse regions, which is useful if the sparse regions contain informative samples compared to densely grouped instances.

An inherent flaw of the representative approaches is that all samples are treated equally in terms of informativeness, which is not necessarily the case. It is also possible, however, that a sample’s representativeness is related to its importance, in which case the representative measures might be more effective. Recently, it has been verified that a combination of both informative and representative approaches results in better performance [44, 49, 50, 51, 52, 53].

However, the underlying assumption of the active-learning-based methods, as well as most other methods focusing on maximizing information gain, is that the distribution of the data is the same, whether the samples are annotated or not. Upon violation of this assumption, the performance of these methods deteriorates sharply [6]. Recall that class distribution mismatch and mitigate it in data-sharing tasks is a central focus of this work. Therefore, even though pool-based uncertainty sampling seems to be a viable solution for at least picking informative data points, it is not advisable to apply it unless the mismatched data is somehow identified and excluded.

II-B Contrastive Learning (CL)

CL is a self-supervised, task-agnostic technique to learn effective high-dimensional feature representations of data, usually to the benefit of a downstream task. Practically, CL is utilized to improve the performance of a model by exposing it to pairs of annotated negative and positive samples in addition to actual samples (anchors) to produce high-dimensional embeddings of the data and designing the loss function [54, 55] so as to maximize positive-anchor and minimize negative-anchor distances. The labels positive and negative are termed the similarity labels in this context. In practice, CL has been successfully applied to various tasks related to vision, natural language processing, etc. It greatly improves model performance under distribution mismatch or when the inputs share similar features but belong to distinct classes [56].

Owing to its widespread success in self-supervised learning and task-independent nature, the technique has been applied to several domains and continues to receive significant attention and development. Transformations of the input data to generate positive augmentations of anchors are crucial, and the type of transformation(s) used should be based on the format of the data. For example, SimCLR [57] composes several different types of augmentations via local and global image transformations such as crops, rotations, cutouts, blurs, color distortions, etc., and additionally provides an account of augmentations that, when considered positive, seem to worsen performance. Building off of these findings, CSI [58] proposed to term these problematic augmentations as distribution-shifting transformations and instead feed them as negative samples to the CL framework. This led to further study on learning shift-invariant feature representations by integrating CL and clustering approaches [59, 60]. The fully unsupervised Winner-Take-All (WTK) autoencoder architecture [61] enforces sparsity in addition to shift-invariance of the learned representations, which can be used to reliably generate in-distribution augmentations. Furthermore, the WTK architecture allows joint back-propagation of gradients through both the encoder and decoder paths, which results in quicker and more stable training.

It is worth noting that in the absence of similarity labels, CL becomes susceptible to sampling bias, which leads to the inclusion of false negatives because, in the simplest case, negative samples for an anchor are randomly picked from the dataset. While similarity labels are assumed available for a subset of the data in ADs, Chuang et al. [62], and several others [63, 64] propose debiased CL frameworks for specific downstream tasks that focus on minimizing sampling bias for fully unsupervised CL.

II-C Constrained Active Learning

AL approaches generally have a drawback that they are inherently unaware of the underlying distribution of the data and, therefore, cannot be used when the data suffers from class distribution mismatch. At the same time, applying CL approaches to the described problem is not a plausible solution either because (1) being task-agnostic, they do not extract features specifically for improving the performance on the downstream task, and (2) human annotation can resolve the issue of lacking labels but cannot identifying distribution mismatch, while the CL can identify distribution mismatch.

Several papers in recent years have proposed and developed constrained AL techniques in order to leverage the benefits of AL in certain feasible regions. Constrained AL can mitigate the decay on downstream task performance when directly applying AL [6, 65, 14]. Du et al. [6] propose CCAL, an AL framework with a joint query strategy wherein each sample is scored on the basis of two separate scores that are then combined into a joint score used to select the best samples. They also prove a tight upper boundary on the consequent error function termed the CCAL error and compare the results for the task of image classification under distribution mismatch with the two other semi-supervised learning methods (DS3L [10] and UASD [9]) under distribution mismatch. It should be noted that semi-supervised learning is different from AL in that it directly utilizes the unlabeled data during training. In addition, no extra annotation efforts will be applied to the unlabeled data.

Lee et al. [14] explored incorporating physical constraints into AL to accommodate a more practical context, such as an engineering system, where physical constraints are present. Violation of these constraints could result in fatal system failure. The authors suggest the existence of a safe region in the sample space and attempt to make AL focus on exploring the safe region. At the same time, it is desirable to explore the safe region as thoroughly as possible with limited data samples. The authors propose the PhyCAL framework, utilizing safe variance reduction and safe region expansion to trade-off between information maximization and safe exploration of the design space.

The ADs framework proposed in this paper expands on these methods. More specifically, (1) instead of AL for image classification under distribution mismatch over the output space, ADs focus on resolving the distribution mismatch in the input space when sharing data over multiple manufacturing processes, (2) a new joint query strategy is proposed to select the samples simultaneously following the target distribution and benefiting the downstream tasks. In addition, the convergence of the joint function to the optimal point is proven with methods pertaining to convex analysis and Pareto optimal.

III Methodology

Refer to caption
Figure 2: Architecture of active data sharing (ADs).

This section comprehensively explores the proposed ADs framework to solve the problem of data sharing among multiple manufacturing processes to jointly select the most informative samples for the downstream task and mitigate distribution mismatch over the input space. First, the general notations and nomenclature are introduced. Then, the calculation of the individual scores and their integration into the joint score is explained. Finally, a theoretical investigation of the Pareto optimality of the proposed method is conducted through convex analysis of the loss functions with the proposed individual and joint acquisition functions. For brevity, the data pre-processing stage and data augmentation model are left for Section IV, where data management is more relevant.

This study involves data-sharing on in-situ monitoring data among various manufacturing processes assigned to produce identical objects. Here, we assume small and large machines are used respectively and are the main reason incurring the difference among manufacturing processes. Monitoring data collected from small machines with identical manufacturers and model numbers are under similar distributions, while those from large machines with different manufacturers and model numbers are recognized as having dissimilar distributions. Consequently, a distribution mismatch is observed in the input space. The downstream task of the study is anomaly detection, which distinguishes between normal and abnormal working conditions of the manufacturing process.

Consider that machines S1subscript𝑆1S_{1}italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and S2subscript𝑆2S_{2}italic_S start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT are similar, providing data that follows a similar distribution. Suppose the data from the dissimilar machine L1subscript𝐿1L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, with a mismatched distribution relative to S1subscript𝑆1S_{1}italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and S2subscript𝑆2S_{2}italic_S start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, is additionally introduced to form a combined dataset 𝒟=(S1,S2,L1)𝒟subscript𝑆1subscript𝑆2subscript𝐿1\mathcal{D}=(S_{1},S_{2},L_{1})caligraphic_D = ( italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ). Let ={normal,abnormal}normalabnormal\mathcal{L}=\{\text{normal},\,\text{abnormal}\}caligraphic_L = { normal , abnormal } be the labels or classes of each sample associated with the downstream task, regardless of which machine they were sourced from. Let 𝐗D×3𝐗superscript𝐷3\mathbf{X}\in\mathbb{R}^{D\times 3}bold_X ∈ blackboard_R start_POSTSUPERSCRIPT italic_D × 3 end_POSTSUPERSCRIPT be the matrix representation storing all samples in 𝒟𝒟\mathcal{D}caligraphic_D. Then, 𝐗S1,𝐗S2superscript𝐗subscript𝑆1superscript𝐗subscript𝑆2\mathbf{X}^{S_{1}},\mathbf{X}^{S_{2}}bold_X start_POSTSUPERSCRIPT italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , bold_X start_POSTSUPERSCRIPT italic_S start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT and 𝐗L1superscript𝐗subscript𝐿1\mathbf{X}^{L_{1}}bold_X start_POSTSUPERSCRIPT italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT are all disjoint sub-matrices representing the data from each machine. As they are disjoint, the concatenation and disjointed union (or logical XOR) of any combination of these subsets is equivalent, i.e., 𝐗S1𝐗S2=𝐗S1𝐗S2superscript𝐗subscript𝑆1superscript𝐗subscript𝑆2direct-sumsuperscript𝐗subscript𝑆1superscript𝐗subscript𝑆2\mathbf{X}^{S_{1}}\frown\mathbf{X}^{S_{2}}=\mathbf{X}^{S_{1}}\oplus\mathbf{X}^% {S_{2}}bold_X start_POSTSUPERSCRIPT italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ⌢ bold_X start_POSTSUPERSCRIPT italic_S start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT = bold_X start_POSTSUPERSCRIPT italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ⊕ bold_X start_POSTSUPERSCRIPT italic_S start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, etc. Finally, suppose that a human annotator labels a small portion of these samples as normal and abnormal. Thus, there is a small annotated pool of samples in the form of a matrix 𝐃labelsubscript𝐃label\mathbf{D}_{\text{label}}bold_D start_POSTSUBSCRIPT label end_POSTSUBSCRIPT, and a large pool of unlabeled data 𝐃unlabel=𝐗𝐃labelsubscript𝐃unlabel𝐗subscript𝐃label\mathbf{D}_{\text{unlabel}}=\mathbf{X}\setminus\mathbf{D}_{\text{label}}bold_D start_POSTSUBSCRIPT unlabel end_POSTSUBSCRIPT = bold_X ∖ bold_D start_POSTSUBSCRIPT label end_POSTSUBSCRIPT, such that 𝐃unlabel>>𝐃labelmuch-greater-thansubscript𝐃unlabelsubscript𝐃label\mathbf{D}_{\text{unlabel}}>>\mathbf{D}_{\text{label}}bold_D start_POSTSUBSCRIPT unlabel end_POSTSUBSCRIPT > > bold_D start_POSTSUBSCRIPT label end_POSTSUBSCRIPT, i.e., the unlabeled portion is much larger than the annotated portion. The purpose of ADs is to extend AL to form a query set 𝐃Qsuperscript𝐃𝑄\mathbf{D}^{Q}bold_D start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT from the unlabeled data pool 𝐃unlabelsubscript𝐃unlabel\mathbf{D}_{\text{{unlabel}}}bold_D start_POSTSUBSCRIPT unlabel end_POSTSUBSCRIPT, such that each queried sample, 𝐱iQsuperscriptsubscript𝐱𝑖𝑄\mathbf{x}_{i}^{Q}bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT is not only among the most informative samples in 𝐃unlabelsubscript𝐃unlabel\mathbf{D}_{\text{unlabel}}bold_D start_POSTSUBSCRIPT unlabel end_POSTSUBSCRIPT, but also has a high likelihood of belonging to the similar machine data:

𝐱iQ𝐃Q:entropy(𝐱iQ)1high uncertainty𝐱iQ(𝐗S1𝐗S2)high similarity:for-allsuperscriptsubscript𝐱𝑖𝑄superscript𝐃𝑄subscriptentropysuperscriptsubscript𝐱𝑖𝑄1high uncertaintysubscriptsuperscriptsubscript𝐱𝑖𝑄direct-sumsuperscript𝐗subscript𝑆1superscript𝐗subscript𝑆2high similarity\forall\;\mathbf{x}_{i}^{Q}\in\mathbf{D}^{Q}:\underbrace{\text{entropy}(% \mathbf{x}_{i}^{Q})\to 1}_{\text{high uncertainty}}\;\wedge\;\underbrace{% \mathbf{x}_{i}^{Q}\in(\mathbf{X}^{S_{1}}\oplus\mathbf{X}^{S_{2}})}_{\text{high% similarity}}∀ bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT ∈ bold_D start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT : under⏟ start_ARG entropy ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT ) → 1 end_ARG start_POSTSUBSCRIPT high uncertainty end_POSTSUBSCRIPT ∧ under⏟ start_ARG bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT ∈ ( bold_X start_POSTSUPERSCRIPT italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ⊕ bold_X start_POSTSUPERSCRIPT italic_S start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) end_ARG start_POSTSUBSCRIPT high similarity end_POSTSUBSCRIPT

In order to satisfy both objectives, the joint query strategy employed in ADs consists of integrating two separately calculated scores, termed the similarity and uncertainty scores, to form a joint acquisition function 𝐉𝐉\mathbf{J}bold_J, which scores the unlabeled data based on how well they satisfy both objectives simultaneously. The individual scores are computed based on the feature representations obtained by two separate models. Specifically, the similarity score is obtained via the feature vector extracted from the CL model, and the uncertainty score is obtained via the current anomaly detection model.

For the subsequent subsections, let feature Zs()subscript𝑍𝑠Z_{s}(\cdot)italic_Z start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( ⋅ ) and feature Zu()[0,1]subscript𝑍𝑢01Z_{u}(\cdot)\in[0,1]italic_Z start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ( ⋅ ) ∈ [ 0 , 1 ] denote a forward pass through the CL and uncertainty sampling networks to generate a feature vector, and let Dlsubscript𝐷𝑙D_{l}italic_D start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT and Dusubscript𝐷𝑢D_{u}italic_D start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT be the number of samples in the labeled and unlabeled data pools respectively.

III-A Learning Similarity Features

The CL model is trained to quantitatively evaluate the similarity between data from the target distribution (S1S2subscript𝑆1subscript𝑆2S_{1}\cup S_{2}italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∪ italic_S start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT) and data from a different distribution (L1subscript𝐿1L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT). In this sense, it can be further exploited to facilitate the objective that queried samples 𝐃Qsuperscript𝐃𝑄\mathbf{D}^{Q}bold_D start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT from the unlabeled pool 𝐃unlabelsubscript𝐃unlabel\mathbf{D}_{\text{unlabel}}bold_D start_POSTSUBSCRIPT unlabel end_POSTSUBSCRIPT closely match the target distribution of data XS1XS2direct-sumsuperscript𝑋subscript𝑆1superscript𝑋subscript𝑆2X^{S_{1}}\oplus X^{S_{2}}italic_X start_POSTSUPERSCRIPT italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ⊕ italic_X start_POSTSUPERSCRIPT italic_S start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT. In our application, it indicates that for each cycle of AL, all the selected samples would be from similar machines. This naturally leverages the fact that the distribution of in-situ monitoring data varies slightly between the similar agents (S1,S2)subscript𝑆1subscript𝑆2(S_{1},S_{2})( italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ), but considerably in the case of (S1,L1)subscript𝑆1subscript𝐿1(S_{1},L_{1})( italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) or (S2,L1)subscript𝑆2subscript𝐿1(S_{2},L_{1})( italic_S start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ).

To prepare positive and negative pairs of samples for training purposes, the initially sampled and annotated data from S1S2subscript𝑆1subscript𝑆2S_{1}\cup S_{2}italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∪ italic_S start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT are fed into a data augmentation model (trained as part of the preliminary processing detailed in Section IV-B) to generate two positive augmentations based on the augmentation Equations (11), (12). These augmentations are used as the positive samples while the original data sample is kept as the anchor. Similarly, the initially annotated samples from L1 are randomly chosen as negative samples.

With the prepared positive and negative pairs of samples, triplet loss function [55] is used to train the CL model to better differentiate samples belonging to different distributions. Its input consists of positive and negative samples for the same anchor, which is the data sample from the target distribution. Additionally, a distance function must be defined to quantify the similarity between positive (anchor versus positive sample) and negative (anchor versus negative sample) pairs. Regardless of the choice of distance function, the triplet loss is optimized toward maximizing the similarity for the positive pair and minimizing the similarity for the negative pair. The expression of triplet loss is given as follows

ltriplet(𝐱a,𝐱p,𝐱n)=max(fd(𝐱a,𝐱p)fd(𝐱a,𝐱n)+ϵ, 0),subscript𝑙tripletsubscript𝐱𝑎subscript𝐱𝑝subscript𝐱𝑛subscript𝑓𝑑subscript𝐱𝑎subscript𝐱𝑝subscript𝑓𝑑subscript𝐱𝑎subscript𝐱𝑛italic-ϵ 0l_{\text{triplet}}(\mathbf{x}_{a},\mathbf{x}_{p},\mathbf{x}_{n})=\max(f_{d}(% \mathbf{x}_{a},\mathbf{x}_{p})-f_{d}(\mathbf{x}_{a},\mathbf{x}_{n})+\epsilon,% \,0),italic_l start_POSTSUBSCRIPT triplet end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) = roman_max ( italic_f start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) - italic_f start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) + italic_ϵ , 0 ) , (1)

where 𝐱psubscript𝐱𝑝\mathbf{x}_{p}bold_x start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT is a positive sample, 𝐱asubscript𝐱𝑎\mathbf{x}_{a}bold_x start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT is the anchor, and 𝐱nsubscript𝐱𝑛\mathbf{x}_{n}bold_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT is a negative sample; fdsubscript𝑓𝑑f_{d}italic_f start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT is the general distance function that can be applied to each pair; ϵitalic-ϵ\epsilonitalic_ϵ is a small positive constant added for numerical stability. In this work, cosine similarity is used as the distance function. The triplet loss in Equation (1) for sample 𝐱j𝐃labelsubscript𝐱𝑗subscript𝐃label\mathbf{x}_{j}\in\mathbf{D}_{\text{label}}bold_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ bold_D start_POSTSUBSCRIPT label end_POSTSUBSCRIPT where j={a,p,n}𝑗𝑎𝑝𝑛j=\{a,p,n\}italic_j = { italic_a , italic_p , italic_n } is then:

ltriplet(𝐱a,𝐱p,𝐱n)=max(cos(𝐱a,𝐱n)cos(𝐱a,𝐱p)+ϵ,0)subscript𝑙tripletsubscript𝐱𝑎subscript𝐱𝑝subscript𝐱𝑛subscript𝐱𝑎subscript𝐱𝑛subscript𝐱𝑎subscript𝐱𝑝italic-ϵ0l_{\text{triplet}}(\mathbf{x}_{a},\mathbf{x}_{p},\mathbf{x}_{n})=\max(\cos(% \mathbf{x}_{a},\mathbf{x}_{n})-\cos(\mathbf{x}_{a},\mathbf{x}_{p})+\epsilon,0)italic_l start_POSTSUBSCRIPT triplet end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) = roman_max ( roman_cos ( bold_x start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) - roman_cos ( bold_x start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) + italic_ϵ , 0 ) (2)

Assuming a batch of data samples ={x}i=1Bsuperscriptsubscript𝑥𝑖1𝐵\mathcal{B}=\{x\}_{i=1}^{B}caligraphic_B = { italic_x } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT, where B𝐵Bitalic_B is the number of samples in the batch, the loss function in Equation (2) is updated as the average loss over all the samples in the batch:

l()=1Bi=1Bltriplet(𝐱ai,𝐱pi,𝐱ni)𝑙1𝐵superscriptsubscript𝑖1𝐵subscript𝑙tripletsubscript𝐱𝑎𝑖subscript𝐱𝑝𝑖subscript𝐱𝑛𝑖l(\mathcal{B})=\frac{1}{B}\sum_{i=1}^{B}l_{\text{triplet}}(\mathbf{x}_{ai},% \mathbf{x}_{pi},\mathbf{x}_{ni})italic_l ( caligraphic_B ) = divide start_ARG 1 end_ARG start_ARG italic_B end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT italic_l start_POSTSUBSCRIPT triplet end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_a italic_i end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT italic_p italic_i end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT italic_n italic_i end_POSTSUBSCRIPT ) (3)

The contrastive model needs to be trained on a small initial subset of annotated data to be capable of evaluating the similarity among incoming sample pairs. For each unlabeled sample 𝐱iU𝐃unlabel where i={1,,du}superscriptsubscript𝐱𝑖𝑈subscript𝐃unlabel where 𝑖1subscript𝑑𝑢\mathbf{x}_{i}^{U}\in\mathbf{D}_{\text{unlabel}}\text{ where }i=\{1,\ldots,d_{% u}\}bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_U end_POSTSUPERSCRIPT ∈ bold_D start_POSTSUBSCRIPT unlabel end_POSTSUBSCRIPT where italic_i = { 1 , … , italic_d start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT }, the corresponding similarity score sisubscript𝑠𝑖s_{i}italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is obtained as follows: (1) All annotated samples 𝐃labeldl×3subscript𝐃labelsuperscriptsubscript𝑑𝑙3\mathbf{D}_{\text{label}}\in\mathbb{R}^{d_{l}\times 3}bold_D start_POSTSUBSCRIPT label end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT × 3 end_POSTSUPERSCRIPT are passed through the trained contrastive model, generating an array of feature vectors 𝐅L=Zs(𝐃label)dl×dsuperscript𝐅𝐿subscript𝑍𝑠subscript𝐃labelsuperscriptsubscript𝑑𝑙𝑑\mathbf{F}^{L}=Z_{s}(\mathbf{D}_{\text{label}})\in\mathbb{R}^{d_{l}\times d}bold_F start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT = italic_Z start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( bold_D start_POSTSUBSCRIPT label end_POSTSUBSCRIPT ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT × italic_d end_POSTSUPERSCRIPT, where row j𝑗jitalic_j denotes a feature vector 𝐟jLdsuperscriptsubscript𝐟𝑗𝐿superscript𝑑\mathbf{f}_{j}^{L}\in\mathbb{R}^{d}bold_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT; (2) unlabeled sample 𝐱iUsuperscriptsubscript𝐱𝑖𝑈\mathbf{x}_{i}^{U}bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_U end_POSTSUPERSCRIPT is passed through the trained contrastive model, generating a d𝑑ditalic_d-dimensional feature vector 𝐟iU=Zs(𝐱iU)superscriptsubscript𝐟𝑖𝑈subscript𝑍𝑠superscriptsubscript𝐱𝑖𝑈\mathbf{f}_{i}^{U}=Z_{s}(\mathbf{x}_{i}^{U})bold_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_U end_POSTSUPERSCRIPT = italic_Z start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_U end_POSTSUPERSCRIPT ); (3) The pairwise cosine similarity between 𝐟iUsuperscriptsubscript𝐟𝑖𝑈\mathbf{f}_{i}^{U}bold_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_U end_POSTSUPERSCRIPT and 𝐅Lsuperscript𝐅𝐿\mathbf{F}^{L}bold_F start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT is computed to obtain a vector of cosine similarities 𝐜𝐜\mathbf{c}bold_c with dlsubscript𝑑𝑙d_{l}italic_d start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT elements and an arbitrary element is cos(𝐟jL,𝐟iU)[1,1]superscriptsubscript𝐟𝑗𝐿superscriptsubscript𝐟𝑖𝑈11\cos(\mathbf{f}_{j}^{L},\mathbf{f}_{i}^{U})\in[-1,1]roman_cos ( bold_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT , bold_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_U end_POSTSUPERSCRIPT ) ∈ [ - 1 , 1 ]. (4) Maximum value in 𝐜𝐜\mathbf{c}bold_c is identified as the final similarity score si=max𝐜subscript𝑠𝑖𝐜s_{i}=\max\mathbf{c}italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = roman_max bold_c for the incoming sample 𝐱iUsuperscriptsubscript𝐱𝑖𝑈\mathbf{x}_{i}^{U}bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_U end_POSTSUPERSCRIPT, which is expressed as

si=maxj={1,,dl}cos(Zs(𝐱jL),Zs(𝐱iU))subscript𝑠𝑖subscript𝑗1subscript𝑑𝑙subscript𝑍𝑠superscriptsubscript𝐱𝑗𝐿subscript𝑍𝑠superscriptsubscript𝐱𝑖𝑈s_{i}=\max_{j=\{1,...,d_{l}\}}\,\cos\left(Z_{s}\left(\mathbf{x}_{j}^{L}\right)% ,Z_{s}\left(\mathbf{x}_{i}^{U}\right)\right)italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = roman_max start_POSTSUBSCRIPT italic_j = { 1 , … , italic_d start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT } end_POSTSUBSCRIPT roman_cos ( italic_Z start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT ) , italic_Z start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_U end_POSTSUPERSCRIPT ) ) (4)

A higher similarity score corresponds to an increased probability of the unlabeled data point belonging to the target distribution. Notice that this score involves both the annotated data pool and the unlabeled data in the forward pass. To get the complete vector, steps 2–4 are repeated for all remaining samples in the unlabeled data pool 𝐃unlabelsubscript𝐃unlabel\mathbf{D}_{\text{unlabel}}bold_D start_POSTSUBSCRIPT unlabel end_POSTSUBSCRIPT, resulting in a vector of contrastive similarity scores 𝐒superscript𝐒\mathbf{S}^{\prime}bold_S start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT through iterative concatenation 𝐒:=𝐒si,iformulae-sequenceassignsuperscript𝐒superscript𝐒subscript𝑠𝑖for-all𝑖\mathbf{S}^{\prime}:=\mathbf{S}^{\prime}\frown s_{i},\;\forall ibold_S start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT := bold_S start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ⌢ italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , ∀ italic_i:

The CL model is applied to evaluate the similarity of semantic features among data points and elevated similarity scores for data originating from a shared target distribution. This enables us to selectively choose datapoints within the similar distribution (smaller machines). This selection is visually represented in Figure 3 by the mauve region.

III-B Uncertainty Sampling

The entropy for unlabeled samples can be used to compute the informativeness of that sample. Entropy sampling is a popular AL technique [17] under the category of uncertainty sampling approaches. Samples exhibiting high uncertainty are ideal samples for annotation as they lie near the decision boundary and thus contain potentially valuable information for the model to learn. For a binary classification problem where x𝑥xitalic_x is a sample, and y^^𝑦\hat{y}over^ start_ARG italic_y end_ARG is the predicted class, entropy is defined as:

H(x)𝐻𝑥\displaystyle H(x)italic_H ( italic_x ) =Pθ(y^|x)log(1Pθ(y^|x))absentsubscript𝑃𝜃conditional^𝑦𝑥1subscript𝑃𝜃conditional^𝑦𝑥\displaystyle=P_{\theta}(\hat{y}\;|\;x)\log\left(\frac{1}{P_{\theta}(\hat{y}\;% |\;x)}\right)= italic_P start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( over^ start_ARG italic_y end_ARG | italic_x ) roman_log ( divide start_ARG 1 end_ARG start_ARG italic_P start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( over^ start_ARG italic_y end_ARG | italic_x ) end_ARG ) (5)
=Pθ(y^|x)log(Pθ(y^|x))absentsubscript𝑃𝜃conditional^𝑦𝑥subscript𝑃𝜃conditional^𝑦𝑥\displaystyle=-P_{\theta}(\hat{y}\;|\;x)\log(P_{\theta}(\hat{y}\;|\;x))= - italic_P start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( over^ start_ARG italic_y end_ARG | italic_x ) roman_log ( italic_P start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( over^ start_ARG italic_y end_ARG | italic_x ) )

The computation of the vector of uncertainty scores 𝐔[0,1]du𝐔superscript01subscript𝑑𝑢\mathbf{U}\in[0,1]^{d_{u}}bold_U ∈ [ 0 , 1 ] start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT end_POSTSUPERSCRIPT is now discussed. Naturally, it involves the uncertainty model described above, trained on the same initial subset of annotated data as the CL model. Consider a sample from the unlabeled data pool 𝐱iU where i={1,,du}superscriptsubscript𝐱𝑖𝑈 where 𝑖1subscript𝑑𝑢\mathbf{x}_{i}^{U}\text{ where }i=\{1,\ldots,d_{u}\}bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_U end_POSTSUPERSCRIPT where italic_i = { 1 , … , italic_d start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT }. Let ciUsuperscriptsubscript𝑐𝑖𝑈c_{i}^{U}italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_U end_POSTSUPERSCRIPT represent its predicted class label argmaxkZu(xiU)subscriptargmax𝑘subscript𝑍𝑢superscriptsubscript𝑥𝑖𝑈\operatorname*{argmax}_{k}Z_{u}(x_{i}^{U})roman_argmax start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_Z start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_U end_POSTSUPERSCRIPT ) for k𝑘k\in\mathcal{L}italic_k ∈ caligraphic_L. In addition, suppose pi,kUsuperscriptsubscript𝑝𝑖𝑘𝑈p_{i,k}^{U}italic_p start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_U end_POSTSUPERSCRIPT represents the normalized probability that a sample xiUsuperscriptsubscript𝑥𝑖𝑈x_{i}^{U}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_U end_POSTSUPERSCRIPT belongs to class k𝑘kitalic_k, so that pi,kU=p(giU=k|𝐱iU)superscriptsubscript𝑝𝑖𝑘𝑈𝑝subscriptsuperscript𝑔𝑈𝑖conditional𝑘superscriptsubscript𝐱𝑖𝑈p_{i,k}^{U}=p({g^{U}_{i}}=k\;|\;\mathbf{x}_{i}^{U})italic_p start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_U end_POSTSUPERSCRIPT = italic_p ( italic_g start_POSTSUPERSCRIPT italic_U end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_k | bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_U end_POSTSUPERSCRIPT ), where giUsubscriptsuperscript𝑔𝑈𝑖g^{U}_{i}italic_g start_POSTSUPERSCRIPT italic_U end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT represents the probability of any class. The unlabeled sample 𝐱iU=(xi,yi,zi)superscriptsubscript𝐱𝑖𝑈subscript𝑥𝑖subscript𝑦𝑖subscript𝑧𝑖\mathbf{x}_{i}^{U}=(x_{i},y_{i},z_{i})bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_U end_POSTSUPERSCRIPT = ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) is passed through the trained classifier model to generate the features Zusubscript𝑍𝑢Z_{u}italic_Z start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT, and Shannon entropy H𝐻Hitalic_H is calculated for 𝐩iUsuperscriptsubscript𝐩𝑖𝑈\mathbf{p}_{i}^{U}bold_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_U end_POSTSUPERSCRIPT, which represents the uncertainty score uisubscript𝑢𝑖u_{i}italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT.

ui=H(Zu(𝐱iU))=H(𝐩iU)=kpi,kUlog2pi,kUsubscript𝑢𝑖𝐻subscript𝑍𝑢superscriptsubscript𝐱𝑖𝑈𝐻superscriptsubscript𝐩𝑖𝑈subscript𝑘superscriptsubscript𝑝𝑖𝑘𝑈subscript2superscriptsubscript𝑝𝑖𝑘𝑈u_{i}=H\left(Z_{u}\left(\mathbf{x}_{i}^{U}\right)\right)=H(\mathbf{p}_{i}^{U})% =-\sum_{k}p_{i,k}^{U}\log_{2}p_{i,k}^{U}italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_H ( italic_Z start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_U end_POSTSUPERSCRIPT ) ) = italic_H ( bold_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_U end_POSTSUPERSCRIPT ) = - ∑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_U end_POSTSUPERSCRIPT roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_U end_POSTSUPERSCRIPT (6)

The vector of uncertainty scores is also generated iteratively through 𝐔:=𝐔ui,iformulae-sequenceassign𝐔𝐔subscript𝑢𝑖for-all𝑖\mathbf{U}:=\mathbf{U}\frown u_{i},\;\forall ibold_U := bold_U ⌢ italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , ∀ italic_i. Therefore, Uncertainty sampling aims to select instances that improve model performance by focusing on the most informative or uncertain datapoints. The uncertainty score guides the selection of data points within the blue region depicted in Figure 3b. This strategy is particularly useful in scenarios where labeling data is expensive or time-consuming, as it helps make the most out of the limited labeled data available.

III-C Joint Query Strategy

Refer to caption
Figure 3: Visualizing divisions of the sample space by applying the ADS framework applied to the unlabeled data pool. (a) The similarity score 𝐒𝐒\mathbf{S}bold_S based on the contrastive model forms a horizontal orange decision boundary in the sample space, addressing the distribution mismatch in the data by separating the mismatched data of the large machines (circles) from the desired data of the small machines data (triangles, mauve region). (b) The uncertainty score 𝐔𝐔\mathbf{U}bold_U based on the entropy of the classifier scores (normal/abnormal) as the blue region containing the desired subset of highly informative samples. The vertical blue decision boundary is generated by the binary classifier. (c) The best joint scores in 𝐉=𝐒𝐔𝐉direct-product𝐒𝐔\mathbf{J}=\mathbf{S}\odot\mathbf{U}bold_J = bold_S ⊙ bold_U identify the samples that are both highly informative for the downstream task and follow the desired similar data distribution from the two small machines (green region).

In the framework of AL, the idea is to iteratively select and annotate a certain number of informative samples in each cycle to improve the performance of the downstream task. In our problem, the unlabeled samples are evaluated from two aspects: (1) the closeness to the target distribution that is evaluated by the similarity score, and (2) the potential improvement to the anomaly detection task that is evaluated by the uncertainty score. Guided by the idea that we want to prevent the distribution-mismatch issue in data sharing, the joint query strategy is designed to select the samples with a high uncertainty score (benefit the downstream task) conditioned on they are close to the target distribution (with a high similarity score).

Therefore, the similarity score 𝐒superscript𝐒\mathbf{S}^{\prime}bold_S start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is firstly binarized to set samples with the top w%percent𝑤w\%italic_w % similarity scores as 1 and the remaining as 0. The binarized similarity score is denoted as 𝐒𝐒\mathbf{S}bold_S. The binarization will be repeated in each cycle of the proposed ADs. Therefore, the value of w𝑤witalic_w is determined to make sure that there are enough samples to be queried. This binarization of 𝐒superscript𝐒\mathbf{S}^{\prime}bold_S start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is also motivated by the fact in the case study that the number of negative samples (𝐗L1)superscript𝐗subscript𝐿1\ell(\mathbf{X}^{L_{1}})roman_ℓ ( bold_X start_POSTSUPERSCRIPT italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) might be larger than the number of positive samples (𝐗S1𝐗S2)direct-sumsuperscript𝐗subscript𝑆1superscript𝐗subscript𝑆2\ell(\mathbf{X}^{S_{1}}\oplus\mathbf{X}^{S_{2}})roman_ℓ ( bold_X start_POSTSUPERSCRIPT italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ⊕ bold_X start_POSTSUPERSCRIPT italic_S start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) in the given dataset. Given this situation, we tend to set the value of w𝑤witalic_w as small as possible to ensure only samples close to the target distribution (with the highest similarity scores) are selected. With these two criteria, the value of w𝑤witalic_w is determined by ensuring there are just enough samples with the binarized similarity score of 1 to be queried. This work quotes 25% of the amount of unlabeled data as the choice of w𝑤witalic_w that consistently satisfied the criterion for the various test settings.

The binarization described above is formally defined as a function locmaxnsubscriptlocmax𝑛\operatorname*{locmax}_{n}roman_locmax start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT that maps an m𝑚mitalic_m-dimensional vector 𝐗𝐗\mathbf{X}bold_X to another m𝑚mitalic_m-dimensional mask vector 𝐘𝐘\mathbf{Y}bold_Y, such that each element of 𝐘{0,1}𝐘01\mathbf{Y}\in\{0,1\}bold_Y ∈ { 0 , 1 } denotes whether the corresponding element of 𝐗𝐗\mathbf{X}bold_X was part of the top-n𝑛nitalic_n maximum set (i.e., 1) or not (i.e., 0). For example, y1=1subscript𝑦11y_{1}=1italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 1 indicates the x1subscript𝑥1x_{1}italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT has the top-n𝑛nitalic_n maximum in 𝐗𝐗\mathbf{X}bold_X, otherwise y1=0subscript𝑦10y_{1}=0italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 0. Then, for i=1,2,,n𝑖12𝑛i=1,2,\ldots,nitalic_i = 1 , 2 , … , italic_n:

locmaxn::subscriptlocmax𝑛absent\displaystyle\operatorname*{locmax}_{n}:roman_locmax start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT : (7)
𝐗m𝐘m|yi=1 iff ximaxn(𝐗) else yi=0𝐗superscript𝑚maps-to𝐘conditionalsuperscript𝑚subscript𝑦𝑖1 iff subscript𝑥𝑖subscript𝑛𝐗 else subscript𝑦𝑖0\displaystyle\mathbf{X}\in\mathbb{R}^{m}\mapsto\mathbf{Y}\in\mathbb{R}^{m}\;|% \;y_{i}=1\text{ iff }x_{i}\in\max_{n}(\mathbf{X})\text{ else }y_{i}=0bold_X ∈ blackboard_R start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ↦ bold_Y ∈ blackboard_R start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT | italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 1 iff italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ roman_max start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( bold_X ) else italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 0

where maxnsubscript𝑛\max_{n}roman_max start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT is simply a vector of n𝑛nitalic_n maximum values of a target vector. It can be formally defined as:

maxn::subscript𝑛absent\displaystyle\max_{n}:{}roman_max start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT : 𝐗m𝐘n|n<m𝐗superscript𝑚maps-to𝐘conditionalsuperscript𝑛𝑛𝑚\displaystyle\mathbf{X}\in\mathbb{R}^{m}\mapsto\mathbf{Y}\in\mathbb{R}^{n}\;|% \;n<mbold_X ∈ blackboard_R start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ↦ bold_Y ∈ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT | italic_n < italic_m (8)
𝐗0=subscript𝐗0\displaystyle\mathbf{X}_{0}=\emptysetbold_X start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = ∅
𝐗n+1=𝐗n{max(𝐗𝐗n)} for n0subscript𝐗𝑛1subscript𝐗𝑛𝐗subscript𝐗𝑛 for 𝑛0\displaystyle\mathbf{X}_{n+1}=\mathbf{X}_{n}\cup\{\max(\mathbf{X}\setminus% \mathbf{X}_{n})\}\text{ for }n\geq 0bold_X start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT = bold_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∪ { roman_max ( bold_X ∖ bold_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) } for italic_n ≥ 0

The binarized similarity score is then given as 𝐒=locmaxk×du(𝐒)𝐒subscriptlocmax𝑘subscript𝑑𝑢superscript𝐒\mathbf{S}=\operatorname*{locmax}_{\lfloor k\times d_{u}\rfloor}(\mathbf{S}^{% \prime})bold_S = roman_locmax start_POSTSUBSCRIPT ⌊ italic_k × italic_d start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ⌋ end_POSTSUBSCRIPT ( bold_S start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ). The joint score can now be computed. Observe that there are two dusubscript𝑑𝑢d_{u}italic_d start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT-dimensional vectors 𝐒𝐒\mathbf{S}bold_S and 𝐔𝐔\mathbf{U}bold_U representing the similarity and uncertainty scores for the unlabeled data pool. The joint score vector 𝐉𝐉\mathbf{J}bold_J is now computed by calculating their Hadamard product:

𝐉𝐉\displaystyle\mathbf{J}bold_J =𝐒𝐔absentdirect-product𝐒𝐔\displaystyle=\mathbf{S}\odot\mathbf{U}= bold_S ⊙ bold_U (9)
(J1,,Jdu)subscript𝐽1subscript𝐽subscript𝑑𝑢\displaystyle\left(J_{1},\ldots,J_{d_{u}}\right)( italic_J start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_J start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) =(s1,,sdu)(u1,,udu)absentdirect-productsubscript𝑠1subscript𝑠subscript𝑑𝑢subscript𝑢1subscript𝑢subscript𝑑𝑢\displaystyle=\left(s_{1},\ldots,s_{d_{u}}\right)\odot\left(u_{1},\ldots,u_{d_% {u}}\right)= ( italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_s start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ⊙ ( italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_u start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) (10)

Given that si{0,1}subscript𝑠𝑖01s_{i}\in\{0,1\}italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ { 0 , 1 } and ui[0,1]subscript𝑢𝑖01u_{i}\in[0,1]italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ [ 0 , 1 ], it follows that Ji=siui[0,1]subscript𝐽𝑖subscript𝑠𝑖subscript𝑢𝑖01J_{i}=s_{i}u_{i}\in[0,1]italic_J start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ [ 0 , 1 ]. The higher the joint score Jisubscript𝐽𝑖J_{i}italic_J start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT for a sample, the more likely that it is both similar to the datasets XS1XS2direct-sumsuperscript𝑋subscript𝑆1superscript𝑋subscript𝑆2X^{S_{1}}\oplus X^{S_{2}}italic_X start_POSTSUPERSCRIPT italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ⊕ italic_X start_POSTSUPERSCRIPT italic_S start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, and has high entropy. Thus, by selecting samples with high joint score values, we effectively query samples for annotation that are (1) highly likely to belong to the desired similar machine distribution, and (2) benefit the downstream anomaly detection task, i.e., they lie close to the classification boundary. Our objective is to integrate the architectures of CL and AL, capitalizing on the unique strengths of each approach. In CL, we measure the similarity of semantic features among data points, assigning high similarity scores to data from the same underlying distribution. This similarity score is then employed to selectively pick data points from a smaller machine (the mauve region in Figure 3a) that shares a similar distribution. In addition, AL utilizes entropy to score the least certain instances for data point selection(the blue region in Figure 3b). This uncertainty measure proves valuable for distinguishing between normal and abnormal data during classification. This strategy is particularly useful in scenarios where labeling data is expensive or time-consuming. Combine both the high similarity scores and high uncertainty scores to select samples for labeling (the green region in Figure 3c).

III-D Main Active Learning Cycle

1
Input : Small machine labeled data DlabelSsubscriptsuperscript𝐷𝑆labelD^{S}_{\text{label}}italic_D start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT start_POSTSUBSCRIPT label end_POSTSUBSCRIPT: {S1X,S1Y,S2X,S2Y}superscriptsubscript𝑆1𝑋superscriptsubscript𝑆1𝑌superscriptsubscript𝑆2𝑋superscriptsubscript𝑆2𝑌\{S_{1}^{X},S_{1}^{Y},S_{2}^{X},S_{2}^{Y}\}{ italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_X end_POSTSUPERSCRIPT , italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_Y end_POSTSUPERSCRIPT , italic_S start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_X end_POSTSUPERSCRIPT , italic_S start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_Y end_POSTSUPERSCRIPT }
Large Machine labeled data DlabelLsubscriptsuperscript𝐷𝐿labelD^{L}_{\text{label}}italic_D start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT start_POSTSUBSCRIPT label end_POSTSUBSCRIPT:{L1X,L1Y}superscriptsubscript𝐿1𝑋superscriptsubscript𝐿1𝑌\{L_{1}^{X},L_{1}^{Y}\}{ italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_X end_POSTSUPERSCRIPT , italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_Y end_POSTSUPERSCRIPT }
Unlabeled data pool Dunlabelsubscript𝐷unlabelD_{\text{unlabel}}italic_D start_POSTSUBSCRIPT unlabel end_POSTSUBSCRIPT
2
Output : Trained classifier θusubscript𝜃𝑢\theta_{u}italic_θ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT
3
4
5Train the similarity model θssubscript𝜃𝑠\theta_{s}italic_θ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT with {DlabelS,DlabelL}subscriptsuperscript𝐷𝑆labelsubscriptsuperscript𝐷𝐿label\{D^{S}_{\text{label}},D^{L}_{\text{label}}\}{ italic_D start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT start_POSTSUBSCRIPT label end_POSTSUBSCRIPT , italic_D start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT start_POSTSUBSCRIPT label end_POSTSUBSCRIPT }
6for IinALcycle𝐼𝑖𝑛𝐴𝐿𝑐𝑦𝑐𝑙𝑒I\ in\ ALcycleitalic_I italic_i italic_n italic_A italic_L italic_c italic_y italic_c italic_l italic_e do
7      
8      Train the classifier Model θusubscript𝜃𝑢\theta_{u}italic_θ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT with DlabelSsubscriptsuperscript𝐷𝑆labelD^{S}_{\text{label}}italic_D start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT start_POSTSUBSCRIPT label end_POSTSUBSCRIPT
9      
10      Calculate the uncertainty feature of Dunlabelsubscript𝐷unlabelD_{\text{unlabel}}italic_D start_POSTSUBSCRIPT unlabel end_POSTSUBSCRIPT using θusubscript𝜃𝑢\theta_{u}italic_θ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT: zu(Dunlabel)=θu(Dunlabelz_{u}({D_{\text{unlabel}}})=\theta_{u}(D_{\text{unlabel}}italic_z start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ( italic_D start_POSTSUBSCRIPT unlabel end_POSTSUBSCRIPT ) = italic_θ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ( italic_D start_POSTSUBSCRIPT unlabel end_POSTSUBSCRIPT)
11      
12      Calculate Suncertainty(Dunlabel)subscript𝑆uncertaintysubscript𝐷unlabelS_{\text{uncertainty}}(D_{\text{unlabel}})italic_S start_POSTSUBSCRIPT uncertainty end_POSTSUBSCRIPT ( italic_D start_POSTSUBSCRIPT unlabel end_POSTSUBSCRIPT ) using Eq. (6);
13      Calculate the similarity feature of DlabelSsubscriptsuperscript𝐷𝑆labelD^{S}_{\text{label}}italic_D start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT start_POSTSUBSCRIPT label end_POSTSUBSCRIPT and Dunlabelsubscript𝐷unlabelD_{\text{{unlabel}}}italic_D start_POSTSUBSCRIPT unlabel end_POSTSUBSCRIPT using θssubscript𝜃𝑠\theta_{s}italic_θ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT: zs(DlabelS)=θs(DlabelS),zs(Dunlabel)=θs(Dunlabel)formulae-sequencesubscript𝑧𝑠subscriptsuperscript𝐷𝑆labelsubscript𝜃𝑠subscriptsuperscript𝐷𝑆labelsubscript𝑧𝑠subscript𝐷unlabelsubscript𝜃𝑠subscript𝐷unlabelz_{s}(D^{S}_{\text{{label}}})=\theta_{s}(D^{S}_{\text{label}}),z_{s}(D_{\text{% unlabel}})=\theta_{s}(D_{\text{unlabel}})italic_z start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_D start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT start_POSTSUBSCRIPT label end_POSTSUBSCRIPT ) = italic_θ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_D start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT start_POSTSUBSCRIPT label end_POSTSUBSCRIPT ) , italic_z start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_D start_POSTSUBSCRIPT unlabel end_POSTSUBSCRIPT ) = italic_θ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_D start_POSTSUBSCRIPT unlabel end_POSTSUBSCRIPT )
14      
15      Calculate Ssimilarity(Dunlabel)subscript𝑆similaritysubscript𝐷unlabelS_{\text{similarity}}(D_{\text{unlabel}})italic_S start_POSTSUBSCRIPT similarity end_POSTSUBSCRIPT ( italic_D start_POSTSUBSCRIPT unlabel end_POSTSUBSCRIPT ) using Eq. (9)
16      
17      Calculate joint score Junlabelsubscript𝐽unlabelJ_{\text{unlabel}}italic_J start_POSTSUBSCRIPT unlabel end_POSTSUBSCRIPT using Eq. (10) with Suncertainty(Dunlabel)subscript𝑆uncertaintysubscript𝐷unlabelS_{\text{uncertainty}}(D_{\text{unlabel}})italic_S start_POSTSUBSCRIPT uncertainty end_POSTSUBSCRIPT ( italic_D start_POSTSUBSCRIPT unlabel end_POSTSUBSCRIPT ) and Ssimilarity(Dunlabel)subscript𝑆similaritysubscript𝐷unlabelS_{\text{similarity}}(D_{\text{unlabel}})italic_S start_POSTSUBSCRIPT similarity end_POSTSUBSCRIPT ( italic_D start_POSTSUBSCRIPT unlabel end_POSTSUBSCRIPT )
18       Query the samples with max Junlabelsubscript𝐽unlabelJ_{\text{unlabel}}italic_J start_POSTSUBSCRIPT unlabel end_POSTSUBSCRIPT to query set XQueriedsubscript𝑋QueriedX_{\text{Queried}}italic_X start_POSTSUBSCRIPT Queried end_POSTSUBSCRIPT
19       Request oracle to annotate all labels in YQueriedsubscript𝑌QueriedY_{\text{Queried}}italic_Y start_POSTSUBSCRIPT Queried end_POSTSUBSCRIPT
20       DlabeledSDlabeledS{XQueried,YQueried}superscriptsubscript𝐷labeled𝑆superscriptsubscript𝐷labeled𝑆subscript𝑋Queriedsubscript𝑌QueriedD_{\text{labeled}}^{S}\leftarrow D_{\text{labeled}}^{S}\cup\{X_{\text{Queried}% },Y_{\text{Queried}}\}italic_D start_POSTSUBSCRIPT labeled end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT ← italic_D start_POSTSUBSCRIPT labeled end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT ∪ { italic_X start_POSTSUBSCRIPT Queried end_POSTSUBSCRIPT , italic_Y start_POSTSUBSCRIPT Queried end_POSTSUBSCRIPT }
return Trained classifier θusubscript𝜃𝑢\theta_{u}italic_θ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT
Algorithm 1 Active Data-sharing

The Algorithm 1 for our proposed ADs framework is presented in this subsection and also shown in Figure 2. Initiated with Latin Hypercube Sampling (LHS), we meticulously select the initial labeled dataset from two small machines DlabelSsubscriptsuperscript𝐷𝑆labelD^{S}_{\text{label}}italic_D start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT start_POSTSUBSCRIPT label end_POSTSUBSCRIPT: {S1X,S1Y,S2X,S2Y}superscriptsubscript𝑆1𝑋superscriptsubscript𝑆1𝑌superscriptsubscript𝑆2𝑋superscriptsubscript𝑆2𝑌\{S_{1}^{X},S_{1}^{Y},S_{2}^{X},S_{2}^{Y}\}{ italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_X end_POSTSUPERSCRIPT , italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_Y end_POSTSUPERSCRIPT , italic_S start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_X end_POSTSUPERSCRIPT , italic_S start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_Y end_POSTSUPERSCRIPT } and a large machine DlabelLsubscriptsuperscript𝐷𝐿labelD^{L}_{\text{label}}italic_D start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT start_POSTSUBSCRIPT label end_POSTSUBSCRIPT:{L1X,L1Y}superscriptsubscript𝐿1𝑋superscriptsubscript𝐿1𝑌\{L_{1}^{X},L_{1}^{Y}\}{ italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_X end_POSTSUPERSCRIPT , italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_Y end_POSTSUPERSCRIPT }. This ensures that the set of random numbers represents the genuine variability of the initial dataset. Both pre-trained CL models are trained using this initial labeled data. The steps are summarized as follows:

  1. 1.

    In the AL phase, the classifier is trained using the initial labeled data.

  2. 2.

    Leveraging existing classifiers and pre-trained CL, we extract the similarity feature and uncertainty feature for both labeled and unlabeled data.

  3. 3.

    The features are used to calculate the similarity score with Equation (4) and uncertainty score with Equation (6) of unlabeled data Dunlabelsubscript𝐷unlabelD_{\text{unlabel}}italic_D start_POSTSUBSCRIPT unlabel end_POSTSUBSCRIPT, which are then combined using Equation (9).

  4. 4.

    The data annotator (oracle) labels the queried data {XQueried,YQueried}subscript𝑋Queriedsubscript𝑌Queried\{X_{\text{Queried}},Y_{\text{Queried}}\}{ italic_X start_POSTSUBSCRIPT Queried end_POSTSUBSCRIPT , italic_Y start_POSTSUBSCRIPT Queried end_POSTSUBSCRIPT } with the highest joint score.

  5. 5.

    The labeled dataset undergoes an update by incorporating the queried data, and the size of the annotated data pool increases while the size of the unlabeled data pool decreases by the same amount.

  6. 6.

    Signifying the inception of a cyclic process that iteratively repeats steps 1-5. Notably, the classifier model evolves with each cycle, updating dynamically with new labeled data from the oracle.

III-E Pareto Optimality of ADs

It is evident that the described problem can be modeled as a multi-objective optimization problem (MOO) due to the distinct nature of the two objectives involved. Such problems typically have multiple optimal solutions, and it becomes necessary to define the concept of Pareto optimality.

Definition 1 (Pareto Optimal [66]).

A point, xXsuperscript𝑥𝑋x^{*}\in Xitalic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ italic_X, is Pareto optimal iff there does not exist another point, xX𝑥𝑋x\in Xitalic_x ∈ italic_X, such that F(x)F(x)𝐹𝑥𝐹superscript𝑥F(x)\leq F(x^{*})italic_F ( italic_x ) ≤ italic_F ( italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ), and Fi(x)<Fi(x)subscript𝐹𝑖𝑥subscript𝐹𝑖superscript𝑥F_{i}(x)<F_{i}(x^{*})italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) < italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) for at least one function Fisubscript𝐹𝑖F_{i}italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT.

Definition 2 (Weakly Pareto Optimal [66]).

A point, xXsuperscript𝑥𝑋x^{*}\in Xitalic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ italic_X, is weakly Pareto optimal iff there does not exist another point, xX𝑥𝑋x\in Xitalic_x ∈ italic_X, such that F(x)<F(x)𝐹𝑥𝐹superscript𝑥F(x)<F(x^{*})italic_F ( italic_x ) < italic_F ( italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ).

For the first objective, i.e., to differentiate data samples from different distributions, the acquisition function is a similarity score 𝐒𝐒\mathbf{S}bold_S based on the cosine similarity metric. It is important to note that this could be any similarity metric, and the approach would still remain valid. To achieve the second objective, i.e., to decide the informativeness of samples, uncertainty sampling [17] is leveraged to compute the uncertainty score 𝐔𝐔\mathbf{U}bold_U. By harmonizing these distinct objective functions to form an integrated criterion that represents both objectives, the queried data 𝐃Qsuperscript𝐃𝑄\mathbf{D}^{Q}bold_D start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT meeting the Pareto optimality can ensure meeting both objectives. To that end, the 𝐉𝐉\mathbf{J}bold_J in Equation (9) is proposed, forming a scalarized combination of the two individual acquisition functions using element-wise multiplication. In effect, t𝑡titalic_t target samples will be selected for the annotator based on the calculated 𝐉𝐉\mathbf{J}bold_J at the end of each iteration in the active learning. In the following description in this section, we will prove that the Pareto optimal of two objectives in our formulation exists and can be achieved by optimizing the proposed joint acquisition function 𝐉𝐉\mathbf{J}bold_J.

Convex Analysis of Joint Acquisition Function: The joint acquisition function 𝐉𝐉\mathbf{J}bold_J is defined as the Hadamard product between the similarity score 𝐒𝐒\mathbf{S}bold_S and uncertainty 𝐔𝐔\mathbf{U}bold_U. The definition of similarity score (shown in Equation (4)) indicates that for the ithsubscript𝑖thi_{\text{th}}italic_i start_POSTSUBSCRIPT th end_POSTSUBSCRIPT sample in the unlabelled data pool, its similarity score (sisubscript𝑠𝑖s_{i}italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT) is the maximum of cosine similarities to all the annotated samples. Once sisubscript𝑠𝑖s_{i}italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT has been computed for all samples to compute the vector of cosine similarities 𝐒superscript𝐒\mathbf{S}^{\prime}bold_S start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, it is further processed by setting k𝑘kitalic_k% maximum scores to one and the rest to zero, resulting in a sparse vector 𝐒𝐒\mathbf{S}bold_S. Because of this, the similarity score vector is essentially just an indicator vector to select the subset of unlabelled samples that are close to the target population. In addition, the uncertainty score vector 𝐔𝐔\mathbf{U}bold_U (shown in Equation (6)) is defined upon entropy, which is a concave function (The proof is detailed in Appendix A-A). Additionally, recall that the similarity score vector 𝐒𝐒\mathbf{S}bold_S is a sparse indicator vector so that the Hadamard product representing the joint score function 𝐉=𝐒𝐔𝐉direct-product𝐒𝐔\mathbf{J}=\mathbf{S}\odot\mathbf{U}bold_J = bold_S ⊙ bold_U essentially reduces to the entropy values for selected samples. In other words, the joint score Ji=siuisubscript𝐽𝑖subscript𝑠𝑖subscript𝑢𝑖J_{i}=s_{i}u_{i}italic_J start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT for a sample is the product of a scalar with a concave function. Therefore, the joint acquisition function is also concave given the following Theorem 1.

Theorem 1 (Canonical Combinations of Convex Functions).

Consider a set of convex functions f1,,fnsubscript𝑓1subscript𝑓𝑛f_{1},\ldots,f_{n}italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_f start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT mapping 𝐱𝐱\mathbf{x}\to\mathbb{R}bold_x → blackboard_R, and α1,,αnsubscript𝛼1subscript𝛼𝑛\alpha_{1},\ldots,\alpha_{n}italic_α start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_α start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT be a set of non-negative scalars, then:

g(𝐱)=i=1naifi=a1f1+a2f2++anfn𝑔𝐱superscriptsubscript𝑖1𝑛subscript𝑎𝑖subscript𝑓𝑖subscript𝑎1subscript𝑓1subscript𝑎2subscript𝑓2subscript𝑎𝑛subscript𝑓𝑛g(\mathbf{x})=\sum_{i=1}^{n}a_{i}f_{i}=a_{1}f_{1}+a_{2}f_{2}+\ldots+a_{n}f_{n}italic_g ( bold_x ) = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + … + italic_a start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT

is also convex. Furthermore, if any fisubscript𝑓𝑖f_{i}italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is strictly convex, then g(𝐱)𝑔𝐱g(\mathbf{x})italic_g ( bold_x ) is strictly convex.

The concavity of the joint acquisition function indicates the existence of its global optimum. Next, we will demonstrate the existence of Pareto optimal in the MOO setting.

Theorem 2 (Sufficient Condition for Pareto Optimality [66]).

Let FZ𝐹𝑍F\in Zitalic_F ∈ italic_Z, xXsuperscript𝑥𝑋x^{*}\in Xitalic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ italic_X, and F=F(x)superscript𝐹𝐹superscript𝑥F^{*}=F(x^{*})italic_F start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = italic_F ( italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ). Let a scalar global criterion Fg(F):Z:subscript𝐹𝑔𝐹𝑍F_{g}(F):Z\rightarrow\mathbb{R}italic_F start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ( italic_F ) : italic_Z → blackboard_R be differentiable with FFg(F)>0FZsubscript𝐹subscript𝐹𝑔𝐹0for-all𝐹𝑍\nabla_{F}F_{g}(F)>0\ \forall F\in Z∇ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ( italic_F ) > 0 ∀ italic_F ∈ italic_Z. Assume Fg(F)=min{Fg(F)<FZ}subscript𝐹𝑔superscript𝐹subscript𝐹𝑔𝐹𝐹𝑍F_{g}(F^{*})=\min\{F_{g}(F)<F\in Z\}italic_F start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ( italic_F start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) = roman_min { italic_F start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ( italic_F ) < italic_F ∈ italic_Z }. Then, xsuperscript𝑥x^{*}italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is Pareto optimal.

Define that Z={𝐒,𝐔}𝑍𝐒𝐔Z=\{\mathbf{S},\mathbf{U}\}italic_Z = { bold_S , bold_U }, xXsuperscript𝑥𝑋x^{*}\in Xitalic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ italic_X, the joint acquisition function is the scalar global criterion 𝐉:Z:𝐉𝑍\mathbf{J}:Z\rightarrow\mathbb{R}bold_J : italic_Z → blackboard_R, we need to prove that 𝐉𝐉\mathbf{J}bold_J increases monotonically with respect to 𝐒,𝐔𝐒𝐔\mathbf{S},\mathbf{U}bold_S , bold_U, which means that F𝐉>0subscript𝐹𝐉0\nabla_{F}\mathbf{J}>0∇ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT bold_J > 0 for all FZ𝐹𝑍F\in Zitalic_F ∈ italic_Z. Since 𝐉/𝐒=𝐔𝐉𝐒𝐔\partial\mathbf{J}/\partial\mathbf{S}=\mathbf{U}∂ bold_J / ∂ bold_S = bold_U, the uncertainty score 𝐔𝐔\mathbf{U}bold_U is based on entropy so 𝐉/𝐒>0𝐉𝐒0\partial\mathbf{J}/\partial\mathbf{S}>0∂ bold_J / ∂ bold_S > 0. Similarly, since 𝐉/𝐔=𝐒𝐉𝐔𝐒\partial\mathbf{J}/\partial\mathbf{U}=\mathbf{S}∂ bold_J / ∂ bold_U = bold_S, the similarity score 𝐒𝐒\mathbf{S}bold_S is designed based on the triplet loss with cosine similarity as a distance function, so 𝐉/𝐔>0𝐉𝐔0\partial\mathbf{J}/\partial\mathbf{U}>0∂ bold_J / ∂ bold_U > 0 intuitively. Therefore, we have F𝐉>0,for all FZformulae-sequencesubscript𝐹𝐉0for all 𝐹𝑍\nabla_{F}\mathbf{J}>0,\text{for all }F\in Z∇ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT bold_J > 0 , for all italic_F ∈ italic_Z, for our proposed joint acquisition function. Following Theorem 2, we have the optimal solution of a global function 𝐉𝐉\mathbf{J}bold_J is sufficient for achieving the Pareto optimality of separate objectives if 𝐉𝐉\mathbf{J}bold_J increases monotonically with respect to each objective function.

Given the existence of the optimum of joint acquisition function and the Pareto optimal of separate objectives, we further demonstrate the design of AL can converge to the optimal point with the increase of queried samples. Referring to Equation (9), the similarity scores serve as a filter to keep uncertainty scores for those unlabelled samples that are close to the target population. It essentially reduces the problem into a classical AL setting, where no distribution mismatch exists (data samples from different distributions are eliminated). Raj et al.[67] show that the uncertainty sampling in classic AL converges to the optimal predictor for binary classification and provide the provide a non-asymptotic rate of convergence of order O(1/n)𝑂1𝑛O(1/n)italic_O ( 1 / italic_n ), where n is the number of iterations of the AL.

IV Case Study

This section presents the motivation, core design choices, as well as a comprehensive overview of the experiments conducted as a case study to verify the effectiveness of the proposed ADs framework in industrial tasks.

IV-A Experimental Setup

ML applications in industrial tasks are generally hindered by challenges, including data scarcity, limited annotation budgets, and lack of prior knowledge with regard to data distribution and informativeness, which leads to the shared data is not aligned with the target distribution and uninformative. Data sharing is commonly purported as a solution, but it only addresses the problems of data scarcity and, to an extent, annotation budget. The issues of distribution mismatch and uninformative of the shared data remain prevalent, thereby delaying the performance of the downstream task. It follows that the purpose of any framework attempting to solve these is to further satisfy that (1) the distribution of the selected data matches the target distribution, i.e., the distribution mismatch of shared data is minimized, and (2) the selected data contributes valuable information towards the model dealing with the downstream task.

The problem scenario for the case study is carefully designed to simulate these issues in the real world. Data is collected for the same additive manufacturing process from three machines with the ultimate goal of enabling data-sharing to enhance the downstream task of anomaly detection in the manufactured object. The particular anomaly to detect is a manufacturing fault, like the one shown in Figure 5. Research has shown that the monitoring data can reflect this anomaly [68]. Of the three machines, two are smaller in size and of the same make and model, and the third is a larger model from a separate manufacturer, as evident from Figure 4. In this section, the two similar small machines are termed S={S1,S2}𝑆subscript𝑆1subscript𝑆2S=\{S_{1},S_{2}\}italic_S = { italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT }, and the dissimilar large machine is termed L𝐿Litalic_L.

Refer to caption
Figure 4: Experimental platform using a Fused Filament Fabrication (FFF) machine (a) two small printers of the same make and model, and (b) one large printer from a separate manufacturer [3].

The purpose of having two similar and one dissimilar machine is to explicitly introduce the distribution mismatch in a typical data-sharing problem. Specifically, the case study considers the scenario where all three machines are tasked to manufacture the same object, a cube with dimensions 2×2×2cm3222superscriptcm32\times 2\times 2\text{cm}^{3}2 × 2 × 2 cm start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT. It is natural to treat the monitoring data obtained from similar machines following a similar distribution compared to that from the dissimilar machine owing to the variety in build design, software, specifications, and other manufacturing differences. Therefore, distribution mismatch needs to be considered when sharing the monitoring data from these three machines with each other. As discussed earlier, this is the issue that the contrastive model in AD addresses. Furthermore, given that the data is scarce and annotations are costly in terms of manual time and effort, it becomes critical to have only the most informative samples annotated for the downstream task. The uncertainty sampling model of ADs is well-suited for this task. The model requires a classifier to identify normal and abnormal data, which, in this work, is based on a Convolutional Neural Network (CNN) architecture with cross-entropy loss and Adam optimization[69].

IV-B Data Preprocessing and Augmentation

In the ADs framework, the initial phase involves preprocessing the data and training the augmentation model. This model is specifically designed to generate positive pairs for time sequence data, serving as the training input for the CL model. The entirety of the data is min-max normalized to ensure consistent scaling throughout. An initial subset of the complete data (e.g., 20%) is then sampled for manual annotation. Rather than use random sampling, ADs employs Latin Hypercube Sampling (LHS) [70, 71] to ensure that the sampled data is uniformly distributed within the parameter space and therefore representative of the real variability of the distribution. Following this, the Winner-Take-All autoencoder of [61] is trained in an unsupervised fashion to learn shift-invariant and sparse, high-dimensional latent embeddings of the normalized data. Once trained, transformed augmentations of the input data are generated by passing the input through the encoder to generate the latent input embedding, manipulating it in the latent space, and finally reconstructing the input by decoding the modified latent embedding. The reconstruction result serves as the transformed augmentation. ADs defines two augmentation transformations in this way:

Additive Gaussian Noise: The first augmentation 𝐲a1subscript𝐲subscript𝑎1\mathbf{y}_{a_{1}}bold_y start_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT of the n𝑛nitalic_n-dimensional latent space output embedding 𝐲esubscript𝐲𝑒\mathbf{y}_{e}bold_y start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT is obtained by adding a Gaussian noise vector 𝐧𝐧\mathbf{n}bold_n to 𝐲esubscript𝐲𝑒\mathbf{y}_{e}bold_y start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT:

𝐲a1=𝐲e+𝐧subscript𝐲subscript𝑎1subscript𝐲𝑒𝐧\mathbf{y}_{a_{1}}=\mathbf{y}_{e}+\mathbf{n}bold_y start_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT = bold_y start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT + bold_n (11)
𝐧=rσe5×𝐌𝐧𝑟subscript𝜎𝑒5𝐌\mathbf{n}=\frac{r\sigma_{e}}{5}\times\mathbf{M}bold_n = divide start_ARG italic_r italic_σ start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT end_ARG start_ARG 5 end_ARG × bold_M

Where r𝒩(0,1)similar-to𝑟𝒩01r\sim\mathcal{N}(0,1)italic_r ∼ caligraphic_N ( 0 , 1 ), σesubscript𝜎𝑒\sigma_{e}italic_σ start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT indicates the standard deviation of the latent space output, and 𝐌𝐌\mathbf{M}bold_M is a 2D binary mask array based on 𝐲esubscript𝐲𝑒\mathbf{y}_{e}bold_y start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT:

𝐌={1ify>σe0ifyσe,y𝐲eformulae-sequence𝐌cases1if𝑦subscript𝜎𝑒0if𝑦subscript𝜎𝑒for-all𝑦subscript𝐲𝑒\mathbf{M}=\left\{\begin{array}[]{lcr}1&\text{if}&y>\sigma_{e}\\ 0&\text{if}&y\leq\sigma_{e}\end{array},\;\forall\;y\in\mathbf{y}_{e}\right.bold_M = { start_ARRAY start_ROW start_CELL 1 end_CELL start_CELL if end_CELL start_CELL italic_y > italic_σ start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL if end_CELL start_CELL italic_y ≤ italic_σ start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT end_CELL end_ROW end_ARRAY , ∀ italic_y ∈ bold_y start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT

Shortly, M𝑀Mitalic_M enforces that noise is only added to those components of the output embedding 𝐲esubscript𝐲𝑒\mathbf{y}_{e}bold_y start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT that lie outside the range of one standard deviation of the vector. This helps ensure that the generated augmentation exploits the variability of the embedding as much as possible by adding noise only to the components that deviate significantly from the mean. Thus, the generated augmentation retains components most closely clustered around the mean, thereby preserving the most characteristic features. This is desirable as the augmentations are meant to be used as positive examples while training the contrastive network. Consequently, they should not deviate too much from the actual sample (i.e., the anchor) to mitigate the risk of generating an out-of-distribution (OOD) augmentation.

Embedding Deviation Thresholding: The second augmentation 𝐲a2subscript𝐲subscript𝑎2\mathbf{y}_{a_{2}}bold_y start_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT is generated by multiplying 𝐲esubscript𝐲𝑒\mathbf{y}_{e}bold_y start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT with its absolute value |𝐲e|subscript𝐲𝑒\lvert\,\mathbf{y}_{e}\rvert| bold_y start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT | to produce a resultant vector 𝐲esuperscriptsubscript𝐲𝑒\mathbf{y}_{e}^{\prime}bold_y start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. Each component of squared components 𝐲esuperscriptsubscript𝐲𝑒\mathbf{y}_{e}^{\prime}bold_y start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is then compared with σesubscript𝜎𝑒\sigma_{e}italic_σ start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT, such that all components greater than σesubscript𝜎𝑒\sigma_{e}italic_σ start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT are set to 1 and the rest to 0. Mathematically:

𝐲e=𝐲e|𝐲e|superscriptsubscript𝐲𝑒direct-productsubscript𝐲𝑒subscript𝐲𝑒\mathbf{y}_{e}^{\prime}=\mathbf{y}_{e}\odot|\,\mathbf{y}_{e}|bold_y start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = bold_y start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT ⊙ | bold_y start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT |
𝐲a2={1ify>σe0ifyσe,y𝐲eformulae-sequencesubscript𝐲subscript𝑎2cases1if𝑦subscript𝜎𝑒0if𝑦subscript𝜎𝑒for-all𝑦superscriptsubscript𝐲𝑒\mathbf{y}_{a_{2}}=\left\{\begin{array}[]{lcr}1&\text{if}&y>\sigma_{e}\\ 0&\text{if}&y\leq\sigma_{e}\end{array},\;\forall\;y\in\mathbf{y}_{e}^{\prime}\right.bold_y start_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT = { start_ARRAY start_ROW start_CELL 1 end_CELL start_CELL if end_CELL start_CELL italic_y > italic_σ start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL if end_CELL start_CELL italic_y ≤ italic_σ start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT end_CELL end_ROW end_ARRAY , ∀ italic_y ∈ bold_y start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT (12)

This augmentation is similar to the mask in the first one, with the main differences being that no noise is added, and the threshold is applied to the squared components of the output embedding, thereby exaggerating the distance, or lack thereof, of each component from the mean.

In order to evaluate the effectiveness of ADs as well as perform a comparative analysis with various other settings and methods, data from the additive manufacturing machines were acquired as they printed a cube with a small square void as the anomaly on one of the faces as shown in Figure 5. It is important to understand the data handling procedure to ensure the quality of the evaluation. To that end, a few terms are now introduced. Source identification is defined as the task of identifying whether a sample belongs to S𝑆Sitalic_S or L𝐿Litalic_L, and Classification as assigning it the normal or abnormal class. For ease of reference, data may be considered notorious, i.e., it is unknown which machine produced them As the first step in data pre-processing for this case study, the actual identity and class labels of all position recordings are isolated and stored separately to compute evaluation metrics. As such, the incoming data is both anonymous and unclassified (unlabeled). The data is then combined into a single, large dataset that is hereafter referred to as the initial dataset. Additionally, to make the task more challenging, the ratio of the data L:S:𝐿𝑆L:Sitalic_L : italic_S is kept large so that the majority of the combined dataset is populated by data from the dissimilar machine. Finally, since the application of a supervised ML approach requires some annotated data, a small initial subset of the data is sampled via LHS and annotated so that almost all of the data remains unlabeled.

These initial annotations consist of both source identification (S𝑆Sitalic_S or L𝐿Litalic_L) as input for a CL model and data classification (normal or abnormal). Note that source identification is not related to the downstream task of anomaly detection – it is only included to allow training the CL model in the ADs architecture. All successive annotations in the AL cycle consist purely of normal/abnormal classification so that the classifier can be retrained on the high-entropy data.

Refer to caption
Figure 5: The printed product, a 2×2×2 cm3222superscript cm32\times 2\times 2\text{ cm}^{3}2 × 2 × 2 cm start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT cube. The square void indicated in the red bounding box is the anomaly that the downstream model is required to detect.

IV-C Experiment Settings

TABLE I: Experiment settings by using the shared data from the three additive manufacturing machines.
𝒟initalsubscript𝒟𝑖𝑛𝑖𝑡𝑎𝑙\mathcal{D}_{inital}caligraphic_D start_POSTSUBSCRIPT italic_i italic_n italic_i italic_t italic_a italic_l end_POSTSUBSCRIPT 𝒟trainingsubscript𝒟𝑡𝑟𝑎𝑖𝑛𝑖𝑛𝑔\mathcal{D}_{training}caligraphic_D start_POSTSUBSCRIPT italic_t italic_r italic_a italic_i italic_n italic_i italic_n italic_g end_POSTSUBSCRIPT 𝒟testingsubscript𝒟𝑡𝑒𝑠𝑡𝑖𝑛𝑔\mathcal{D}_{testing}caligraphic_D start_POSTSUBSCRIPT italic_t italic_e italic_s italic_t italic_i italic_n italic_g end_POSTSUBSCRIPT
Supervised N/A {S1,S2}100% trainsuperscriptsubscript𝑆1subscript𝑆2100% train\{S_{1},S_{2}\}^{\text{100\% train}}{ italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT } start_POSTSUPERSCRIPT 100% train end_POSTSUPERSCRIPT S1test,S2testsuperscriptsubscript𝑆1testsuperscriptsubscript𝑆2testS_{1}^{\text{test}},S_{2}^{\text{test}}italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT test end_POSTSUPERSCRIPT , italic_S start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT test end_POSTSUPERSCRIPT
Random Pick S 20%S1,20%S2,20%Lpercent20subscript𝑆1percent20subscript𝑆2percent20𝐿20\%S_{1},20\%S_{2},20\%L20 % italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , 20 % italic_S start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , 20 % italic_L {S1,S2}random picksuperscriptsubscript𝑆1subscript𝑆2random pick\{S_{1},S_{2}\}^{\text{random pick}}{ italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT } start_POSTSUPERSCRIPT random pick end_POSTSUPERSCRIPT S1test,S2testsuperscriptsubscript𝑆1testsuperscriptsubscript𝑆2testS_{1}^{\text{test}},S_{2}^{\text{test}}italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT test end_POSTSUPERSCRIPT , italic_S start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT test end_POSTSUPERSCRIPT
Random Pick S+L 20%S1,20%S2,20%Lpercent20subscript𝑆1percent20subscript𝑆2percent20𝐿20\%S_{1},20\%S_{2},20\%L20 % italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , 20 % italic_S start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , 20 % italic_L {S1,S2,L}random picksuperscriptsubscript𝑆1subscript𝑆2𝐿random pick\{S_{1},S_{2},L\}^{\text{random pick}}{ italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_L } start_POSTSUPERSCRIPT random pick end_POSTSUPERSCRIPT S1test,S2testsuperscriptsubscript𝑆1testsuperscriptsubscript𝑆2testS_{1}^{\text{test}},S_{2}^{\text{test}}italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT test end_POSTSUPERSCRIPT , italic_S start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT test end_POSTSUPERSCRIPT
ADs Without CL 20%S1,20%S2,20%Lpercent20subscript𝑆1percent20subscript𝑆2percent20𝐿20\%S_{1},20\%S_{2},20\%L20 % italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , 20 % italic_S start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , 20 % italic_L {S1,S2,L}queriedsuperscriptsubscript𝑆1subscript𝑆2𝐿queried\{S_{1},S_{2},L\}^{\text{queried}}{ italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_L } start_POSTSUPERSCRIPT queried end_POSTSUPERSCRIPT S1test,S2testsuperscriptsubscript𝑆1testsuperscriptsubscript𝑆2testS_{1}^{\text{test}},S_{2}^{\text{test}}italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT test end_POSTSUPERSCRIPT , italic_S start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT test end_POSTSUPERSCRIPT
ADs 20%S1,20%S2,20%Lpercent20subscript𝑆1percent20subscript𝑆2percent20𝐿20\%S_{1},20\%S_{2},20\%L20 % italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , 20 % italic_S start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , 20 % italic_L {S1,S2}queriedsuperscriptsubscript𝑆1subscript𝑆2queried\{S_{1},S_{2}\}^{\text{queried}}{ italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT } start_POSTSUPERSCRIPT queried end_POSTSUPERSCRIPT S1test,S2testsuperscriptsubscript𝑆1testsuperscriptsubscript𝑆2testS_{1}^{\text{test}},S_{2}^{\text{test}}italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT test end_POSTSUPERSCRIPT , italic_S start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT test end_POSTSUPERSCRIPT

At this point, the data is ready for input to ADs. Comparative experiments among the proposed ADs and other benchmark methods for anomaly detection are conducted for more conclusive and insightful results. To that end, anomaly detection is performed in five different settings that are now discussed in order. The experiment setting is shown in Table I. Note that, in all settings: (1) In-text variables should be considered specific to the setting they are in. (2) Testing dataset for all settings was kept the same for a fair comparison. (3) Per typical training regulations, the training, testing, and validation sets 𝒟train,𝒟test,𝒟valsubscript𝒟trainsubscript𝒟testsubscript𝒟val\mathcal{D}_{\text{train}},\mathcal{D}_{\text{test}},\mathcal{D}_{\text{val}}caligraphic_D start_POSTSUBSCRIPT train end_POSTSUBSCRIPT , caligraphic_D start_POSTSUBSCRIPT test end_POSTSUBSCRIPT , caligraphic_D start_POSTSUBSCRIPT val end_POSTSUBSCRIPT are pairwise disjoint.

Supervised: This is the usual setting for most deep-learning approaches dealing with classification. It assumes that labels are available for all samples, which is the opposite of our scenario as it implies that data collection and annotations are not a problem, so data-sharing is not needed. As such, its performance on the downstream task is considered the baseline.

Random Pick S: In this setting, only the data from similar machines is made available to the model, so 𝒟={S1,S2}𝒟subscript𝑆1subscript𝑆2\mathcal{D}=\{S_{1},S_{2}\}caligraphic_D = { italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT }. By removing the dissimilar data from the dataset, this setting allows observing the effect of the uncertainty score 𝐔𝐔\mathbf{U}bold_U from ADs in isolation, thus evaluating its effectiveness.

Random Pick S+L: This is similar to the Random Pick S setting, with the main difference being that the data available to the model is provided by all three machines, so 𝒟={S1,S2,L}𝒟subscript𝑆1subscript𝑆2𝐿\mathcal{D}=\{S_{1},S_{2},L\}caligraphic_D = { italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_L }. As such, this setting better allows studying the usefulness of both the similarity score 𝐒𝐒\mathbf{S}bold_S and uncertainty score 𝐔𝐔\mathbf{U}bold_U. 20% of the entire data alongside 400 randomly picked samples from the remaining data forms the subset 𝒟sub𝒟superscript𝒟sub𝒟\mathcal{D}^{\text{sub}}\subset\mathcal{D}caligraphic_D start_POSTSUPERSCRIPT sub end_POSTSUPERSCRIPT ⊂ caligraphic_D. Evidently, only the training dataset changes.

ADs Without CL: This setting strips away the contrastive model from the ADs framework so that instead of a joint score 𝐉𝐉\mathbf{J}bold_J comprised of a similarity score 𝐒𝐒\mathbf{S}bold_S and uncertainty score 𝐔𝐔\mathbf{U}bold_U, each sample is only assigned 𝐔𝐔\mathbf{U}bold_U to generate the query set for the next cycle. Thus, it represents a purely AL-based approach to data-sharing. In this setting, data is sourced from all three machines, so 𝒟={S1,S2,L}𝒟subscript𝑆1subscript𝑆2𝐿\mathcal{D}=\{S_{1},S_{2},L\}caligraphic_D = { italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_L }. Like in the Random Pick S+L settings, a subset of the data is formed by sampling 20% of this data, but no random samples are added. Instead, the 400 new samples are selected via AL. ADs is set to terminate after 5 cycles with 80 samples per cycle to ultimately select these 400 samples, and after adding them to the sampled 20%, the final subset of data 𝒟sub𝒟superscript𝒟sub𝒟\mathcal{D}^{\text{sub}}\subset\mathcal{D}caligraphic_D start_POSTSUPERSCRIPT sub end_POSTSUPERSCRIPT ⊂ caligraphic_D for training and validation is ready. Evidently, the only difference is how the 400 samples are selected.

ADs: This setting is exactly the same as ADs Without CL, except for keeping the contrastive model in the ADs framework so that each sample is selected based on the joint score 𝐉𝐉\mathbf{J}bold_J, which is in turn computed via an integration of the similarity score 𝐒𝐒\mathbf{S}bold_S and uncertainty score 𝐔𝐔\mathbf{U}bold_U. Consequently, this setting minimizes the number of dissimilar data L𝐿Litalic_L introduced into the anomaly detector’s training while still prioritizing data informativeness and thereby boosting model accuracy. Evidently, ADs represents AL as a multi-objective optimization problem, resulting in selected samples that are not only high-entropy samples but also highly likely to originate from similar machines S𝑆Sitalic_S.

Random Pick: It denotes that training data (20% of the full data) is still specifically extracted, but the additional samples are selected at random. This decision is logically informed: one might imagine that owing to the randomized nature of the data, the anomaly detector model would form different decision boundaries when re-training with a new data split. Therefore, by running multiple instances of a model trained as such, the average performance of these models leads to a more accurate estimate of the true performance of the model.

IV-D Results

The corresponding results, including the anomaly detector’s accuracy, F1 score, and the percentage of samples that are not from the target distribution (%L), are tabulated in Table II. Ideally, the classification accuracy would be 100% (i.e., all selected samples are correctly classified as normal or abnormal), and the percentage of negative samples in the selected unlabeled samples would be 0% (i.e., there are no negative samples in the selected samples). Furthermore, the effect of %L on the model accuracy is visualized in Figure 6 for different choices of queried data and the percentage of complete data used to create the training subset (Q. Data and %Data in Table II). Note that %L is only applicable to Random Pick S+L, ADs without CL and ADs settings, which involve the dissimilar data from the larger machine L𝐿Litalic_L and also present a chance of incorrectly adding its samples into the training pipeline. This is unlike the settings Supervised, Random Pick S+L, which only deal with similar machine data.

TABLE II: Results of the case study by using the shared data from the three additive manufacturing machines.
%Data Q. Data Accuracy F1 Score %L
Supervised 100 n/a 0.9437 0.9449 n/a
Random Pick S 20 400 0.8957 0.8980 n/a
Random Pick S+L 20 400 0.8805 0.8805 45.25
ADs Without CL 20 400 0.9510 0.9507 12.5
ADs 20 400 0.9578 0.9563 0
Refer to caption
((a))
Refer to caption
((b))
Figure 6: The ADs Framework efficiently queried high-quality data and also did not require a full dataset to achieve promising performance. (a) The usage of similarity scores computed from the CL network corresponds to lower %L, which in turn provides higher accuracy as expected. (b) Without CL and similarity scores, we see much higher %L values, resulting in worse accuracy and greater variability of the performance.

As evident from the results in Table II, even with just 20% of the complete data and 400 additional samples selected using our novel query strategy, semi-supervised ADs outperform the supervised baseline, both with and without CL (i.e., whether the similarity score was involved in the joint score computation). This is evidenced specifically by the increase in accuracy (ADs: +1.4%percent1.4+1.4\%+ 1.4 %, ADs Without CL: +0.73%percent0.73+0.73\%+ 0.73 %) and F1 scores (ADs: +1.14%percent1.14+1.14\%+ 1.14 %, ADs Without CL: +0.58%percent0.58+0.58\%+ 0.58 %). Additionally, the results from Figure 6 confirm that enabling CL to generate similarity scores effectively allows the framework to distinguish similar and dissimilar data, evidenced by the significantly lower %L scores in the setting with CL enabled. Additionally, lower %L values consistently correspond to higher accuracy in the downstream anomaly detection task despite variations in the size of the training data. On the contrary, the setting with CL disabled picks large amounts of dissimilar data, which corresponds to generally worse performance as well as a greater degree of result variability.

IV-E Ablation Study

Refer to caption
((a))
Refer to caption
((b))
Refer to caption
((c))
Refer to caption
((d))
Figure 7: Results of the ablation experiments. (a) and (b) evaluate the performance with and without CL, (c) evaluate the performance with varying numbers of initially sampled data, and (d) evaluate the performance with varying total numbers of selected samples.

In order to further reinforce the results of the main experiments, ablation experiments were performed. The goal of these is to (1) establish that the reduction in selection of dissimilar samples %L is not related to the number of selected samples and primarily concerned with the usage of CL, (2) study the relationship between model performance and the number of selected samples per cycle of ADs for a fixed budget of total samples, and (3) study the effect of increasing and decreasing the amount of queried data in each cycle.To accomplish the first two objectives, the number of total queried samples was fixed to 800 in addition to 20% initially sampled data, and the only two variables are whether CL is enabled or not and the number of samples selected per cycle of ADs. There are 16 experiments conducted, and the total queried budget is 800. The samples were evenly queried by each cycle and the number of cycles selected in set {2,4,5,8,10,16,20,25,32,40,50,80,100,160,200,400} respectively.

The results show a consistently and significantly smaller %L for the model with CL as compared to without CL. The fewer the number of samples queried per cycle, the more likely the model has higher values of %L without CL. With CL, the result remains fairly similar regardless of the number of samples queried per cycle. This confirms the hypothesis that CL is the primary factor in reducing the number of dissimilar samples in the final query set, for these ablation experiments are visualized in Figures 7(a) and 7(b). For the third goal, CL is kept enabled, and the number of queried samples is 600 and 800, with 20% initial data in both cases. The model performance is compared over varying samples per cycle for each scenario. While the performance does improve for the case with a larger budget, the difference is considerably small, which is 200 queried samples difference between the two settings. The results of this experiment are visualized in Figures 7(a) and 7(c). We also experimented with different numbers of initial data by keeping the CL enabled and the same number of queried samples. The model performance with 15% and 20% initial data are similar. The results of this experiment are visualized in Figures 7(a) and 7(d).

V conclusion

Data-sharing among machines has great potential to contribute to the wide application of ML methods in manufacturing systems, which addresses the challenges of data scarcity. However, typical data-sharing approaches do not consider the quality of the shared samples, nor the inherent distribution mismatch of the data thus acquired. In this paper, a semi-supervised Active Data-sharing (AD) framework is proposed to address these problems. ADs selects high-quality data that are both informative to the downstream ML task and appear to follow the target distribution. ADs views the problem as a multi-objective optimization problem and employs a novel joint acquisition function to query the Pareto optimal point that satisfies both objectives simultaneously. The joint score itself is computed based on a combination of two individual scores, namely the uncertainty and similarity scores, that are separately obtained via entropy and CL techniques.

Systematic experiments are conducted using ADs to evaluate its effectiveness. The experiment involves real-world in-situ monitoring data from the same additive manufacturing process using data-sharing between three machines, two of which are similar, i.e., the same model, and one large machine from a different manufacturer is considered dissimilar to purposely introduce distribution mismatch. Results show that with only a fraction of initially annotated data and a few cycles of ADs to extend the annotated dataset, the anomaly detection model outperforms the baseline anomaly detector trained on the fully annotated dataset. Furthermore, the excel performance is achieved in a distribution-aware manner, i.e., without querying any of the samples from different distributions to use for training, thereby effectively addressing the distribution mismatch problem. Further ablation studies confirm that the design philosophy of ADs with the combination of AL and CL is indeed the factor in improving performance. The usage of the similarity score directly reduces the mismatched data in the selection, and smaller amounts of mismatched data further improve the accuracy of the anomaly detection task.

In light of these extensive experiments and their results, it is concluded that ADs effectively solve the problems of data quality and distribution mismatch usually prevalent in existing data-sharing approaches and that this work establishes a new baseline for further improvements to data-sharing in the industrial domain.

Appendix A Appendix

A-A Concavity of Entropy

Assuming a C-way classification problem, where the model’s output probabilities are defined in terms of the model parameters as Pθsubscript𝑃𝜃P_{\theta}italic_P start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT and the predicted class is y^isubscript^𝑦𝑖\hat{y}_{i}over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, the Shannon entropy is defined as:

H(x)𝐻𝑥\displaystyle H(x)italic_H ( italic_x ) =iCPθ(y^i|x)logb(1Pθ(y^i|x))absentsuperscriptsubscript𝑖𝐶subscript𝑃𝜃conditionalsubscript^𝑦𝑖𝑥subscript𝑏1subscript𝑃𝜃conditionalsubscript^𝑦𝑖𝑥\displaystyle=\sum_{i}^{C}P_{\theta}(\hat{y}_{i}\;|\;x)\log_{b}\left(\frac{1}{% P_{\theta}(\hat{y}_{i}\;|\;x)}\right)= ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT italic_P start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_x ) roman_log start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ( divide start_ARG 1 end_ARG start_ARG italic_P start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_x ) end_ARG )
=iCPθ(y^i|x)logb(Pθ(y^i|x))absentsuperscriptsubscript𝑖𝐶subscript𝑃𝜃conditionalsubscript^𝑦𝑖𝑥subscript𝑏subscript𝑃𝜃conditionalsubscript^𝑦𝑖𝑥\displaystyle=-\sum_{i}^{C}P_{\theta}(\hat{y}_{i}\;|\;x)\log_{b}(P_{\theta}(% \hat{y}_{i}\;|\;x))= - ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT italic_P start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_x ) roman_log start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ( italic_P start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_x ) )

Where i𝑖iitalic_i represents the ithsuperscript𝑖thi^{\text{th}}italic_i start_POSTSUPERSCRIPT th end_POSTSUPERSCRIPT class and C𝐶Citalic_C represents the total number of classes. In binary classification problems (as in the case presented in this paper), the summation symbol can be discarded:

H(x)=Pθ(y^|x)logb(Pθ(y^|x))𝐻𝑥subscript𝑃𝜃conditional^𝑦𝑥subscript𝑏subscript𝑃𝜃conditional^𝑦𝑥H(x)=-P_{\theta}(\hat{y}\;|\;x)\log_{b}(P_{\theta}(\hat{y}\;|\;x))italic_H ( italic_x ) = - italic_P start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( over^ start_ARG italic_y end_ARG | italic_x ) roman_log start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ( italic_P start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( over^ start_ARG italic_y end_ARG | italic_x ) )

There are several methods to prove the concavity of Shannon entropy. For readability, let Pθ(y^|x)=pxsubscript𝑃𝜃conditional^𝑦𝑥subscript𝑝𝑥P_{\theta}(\hat{y}\;|\;x)=p_{x}italic_P start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( over^ start_ARG italic_y end_ARG | italic_x ) = italic_p start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT. Given that for a function to be concave, the second derivative of the function with respect to all its variables must be non-positive over the entire domain of the function:

H(x)superscript𝐻𝑥\displaystyle H^{\prime}(x)italic_H start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_x ) =ddpx(pxlogbpx)absent𝑑𝑑subscript𝑝𝑥subscript𝑝𝑥subscript𝑏subscript𝑝𝑥\displaystyle=\frac{d}{dp_{x}}\left(-p_{x}\log_{b}p_{x}\right)= divide start_ARG italic_d end_ARG start_ARG italic_d italic_p start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_ARG ( - italic_p start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT roman_log start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT )
=(logbpx+pxpxlnb)absentsubscript𝑏subscript𝑝𝑥subscript𝑝𝑥subscript𝑝𝑥𝑏\displaystyle=-\left(\log_{b}p_{x}+\frac{p_{x}}{p_{x}\ln b}\right)= - ( roman_log start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT + divide start_ARG italic_p start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_ARG start_ARG italic_p start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT roman_ln italic_b end_ARG )
=(logbpx+1lnb)absentsubscript𝑏subscript𝑝𝑥1𝑏\displaystyle=-\left(\log_{b}p_{x}+\frac{1}{\ln b}\right)= - ( roman_log start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT + divide start_ARG 1 end_ARG start_ARG roman_ln italic_b end_ARG )
H′′(x)superscript𝐻′′𝑥\displaystyle H^{\prime\prime}(x)italic_H start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT ( italic_x ) =ddpx(logbpx1lnb)absent𝑑𝑑subscript𝑝𝑥subscript𝑏subscript𝑝𝑥1𝑏\displaystyle=\frac{d}{dp_{x}}\left(-\log_{b}p_{x}-\frac{1}{\ln b}\right)= divide start_ARG italic_d end_ARG start_ARG italic_d italic_p start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_ARG ( - roman_log start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT - divide start_ARG 1 end_ARG start_ARG roman_ln italic_b end_ARG )
=1pxlnbabsent1subscript𝑝𝑥𝑏\displaystyle=-\frac{1}{p_{x}\ln b}= - divide start_ARG 1 end_ARG start_ARG italic_p start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT roman_ln italic_b end_ARG (13)

Here, the probability mass px[0, 1]subscript𝑝𝑥01p_{x}\in[0,\,1]italic_p start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ∈ [ 0 , 1 ] and therefore non-negative by definition, while ln(b)𝑏\ln(b)roman_ln ( italic_b ) is a non-negative constant for b>1𝑏1b>1italic_b > 1 as is the case with logarithmic bases. Therefore, the expression in Equation A-A always evaluates to a non-positive value, which proves that entropy is a concave function.

References

  • [1] J. F. Arinez, Q. Chang, R. X. Gao, C. Xu, and J. Zhang, “Artificial intelligence in advanced manufacturing: Current status and future outlook,” Journal of Manufacturing Science and Engineering-transactions of The Asme, vol. 142, 2020.
  • [2] M. Abdallah, B. G. Joung, W. J. Lee, C. Mousoulis, J. W. Sutherland, and S. Bagchi, “Anomaly detection and inter-sensor transfer learning on smart manufacturing datasets,” Sensors (Basel, Switzerland), vol. 23, 2022.
  • [3] Z. Shi, Y. Li, and C. Liu, “Knowledge distillation-based information sharing for online process monitoring in decentralized manufacturing system,” Journal of Intelligent Manufacturing, 2024.
  • [4] Y. Xu and R. Goodacre, “On splitting training and validation set: a comparative study of cross-validation, bootstrap and systematic sampling for estimating the generalization performance of supervised learning,” Journal of analysis and testing, vol. 2, no. 3, pp. 249–262, 2018.
  • [5] R. Gao and M. Saar-Tsechansky, “Cost-accuracy aware adaptive labeling for active learning,” Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 2569–2576, Apr. 2020.
  • [6] P. Du, S. Zhao, H. Chen, S. Chai, H. Chen, and C. Li, “Contrastive coding for active learning under class distribution mismatch,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 8927–8936, 2021.
  • [7] R. Pathak, C. Ma, and M. J. Wainwright, “A new similarity measure for covariate shift with applications to nonparametric regression,” 2022.
  • [8] H. Yang, K. Huang, I. King, and M. R. Lyu, “Maximum margin semi-supervised learning with irrelevant data,” Neural Networks, vol. 70, pp. 90–102, 2015.
  • [9] Y. Chen, X. Zhu, W. Li, and S. Gong, “Semi-supervised learning under class distribution mismatch,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 3569–3576, 2020.
  • [10] L.-Z. Guo, Z.-Y. Zhang, Y. Jiang, Y.-F. Li, and Z.-H. Zhou, “Safe deep semi-supervised learning for unseen-class unlabeled data,” in International Conference on Machine Learning, pp. 3897–3906, PMLR, 2020.
  • [11] S. Liang, Y. Li, and R. Srikant, “Principled detection of out-of-distribution examples in neural networks,” CoRR, vol. abs/1706.02690, 2017.
  • [12] W. Liu, X. Wang, J. Owens, and Y. Li, “Energy-based out-of-distribution detection,” Advances in neural information processing systems, vol. 33, pp. 21464–21475, 2020.
  • [13] Y. Wang, W. Sun, J. Jin, Z. Kong, and X. Yue, “Wood: Wasserstein-based out-of-distribution detection,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023.
  • [14] C. Lee, X. Wang, J. Wu, and X. Yue, “Failure-averse active learning for physics-constrained systems,” IEEE Transactions on Automation Science and Engineering, 2022.
  • [15] R. Chattopadhyay, Z. Wang, W. Fan, I. Davidson, S. Panchanathan, and J. Ye, “Batch mode active sampling based on marginal probability distribution matching,” ACM Trans. Knowl. Discov. Data, vol. 7, sep 2013.
  • [16] L. Möllenbrok and B. Demir, “Active learning guided fine-tuning for enhancing self-supervised based multi-label classification of remote sensing images,” 2023.
  • [17] B. Settles, “Active learning literature survey,” 2009.
  • [18] R. D. King, K. E. Whelan, F. M. Jones, P. G. Reiser, C. H. Bryant, S. H. Muggleton, D. B. Kell, and S. G. Oliver, “Functional genomic hypothesis generation and experimentation by a robot scientist,” Nature, vol. 427, no. 6971, pp. 247–252, 2004.
  • [19] L. Wang, X. Hu, B. Yuan, and J. Lu, “Active learning via query synthesis and nearest neighbour search,” Neurocomputing, vol. 147, pp. 426–434, 2015.
  • [20] R. Schumann and I. Rehbein, “Active learning via membership query synthesis for semi-supervised sentence classification,” in Proceedings of the 23rd conference on computational natural language learning (CoNLL), pp. 472–481, 2019.
  • [21] J.-J. Zhu and J. Bento, “Generative adversarial active learning,” arXiv preprint arXiv:1702.07956, 2017.
  • [22] Y. Liu, Z. Li, C. Zhou, Y. Jiang, J. Sun, M. Wang, and X. He, “Generative adversarial active learning for unsupervised outlier detection,” IEEE Transactions on Knowledge and Data Engineering, vol. 32, no. 8, pp. 1517–1528, 2019.
  • [23] T. Tran, T.-T. Do, I. Reid, and G. Carneiro, “Bayesian generative active deep learning,” in International Conference on Machine Learning, pp. 6295–6304, PMLR, 2019.
  • [24] D. Cohn, L. Atlas, and R. Ladner, “Improving generalization with active learning,” Machine learning, vol. 15, pp. 201–221, 1994.
  • [25] S. Dasgupta, D. J. Hsu, and C. Monteleoni, “A general agnostic active learning algorithm,” Advances in neural information processing systems, vol. 20, 2007.
  • [26] D. A. Cohn, Z. Ghahramani, and M. I. Jordan, “Active learning with statistical models,” Journal of artificial intelligence research, vol. 4, pp. 129–145, 1996.
  • [27] A. I. Schein and L. H. Ungar, “Active learning for logistic regression: an evaluation,” Machine Learning, vol. 68, pp. 235–265, 2007.
  • [28] D. D. Lewis and J. Catlett, “Heterogeneous uncertainty sampling for supervised learning,” in Machine learning proceedings 1994, pp. 148–156, Elsevier, 1994.
  • [29] D. D. Lewis, “A sequential algorithm for training text classifiers: Corrigendum and additional data,” in Acm Sigir Forum, vol. 29, pp. 13–19, ACM New York, NY, USA, 1995.
  • [30] A. J. Joshi, F. Porikli, and N. Papanikolopoulos, “Multi-class active learning for image classification,” in 2009 ieee conference on computer vision and pattern recognition, pp. 2372–2379, IEEE, 2009.
  • [31] W. Luo, A. Schwing, and R. Urtasun, “Latent structured active learning,” Advances in Neural Information Processing Systems, vol. 26, 2013.
  • [32] K. Wang, D. Zhang, Y. Li, R. Zhang, and L. Lin, “Cost-effective active learning for deep image classification,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 27, no. 12, pp. 2591–2600, 2016.
  • [33] S. Sinha, S. Ebrahimi, and T. Darrell, “Variational adversarial active learning,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5972–5981, 2019.
  • [34] D. Yoo and I. S. Kweon, “Learning loss for active learning,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 93–102, 2019.
  • [35] H. S. Seung, M. Opper, and H. Sompolinsky, “Query by committee,” in Proceedings of the fifth annual workshop on Computational learning theory, pp. 287–294, 1992.
  • [36] A. McCallum, K. Nigam, et al., “Employing em and pool-based active learning for text classification.,” in ICML, vol. 98, pp. 350–358, Citeseer, 1998.
  • [37] R. Burbidge, J. J. Rowland, and R. D. King, “Active learning for regression based on query by committee,” in Intelligent Data Engineering and Automated Learning-IDEAL 2007: 8th International Conference, Birmingham, UK, December 16-19, 2007. Proceedings 8, pp. 209–218, Springer, 2007.
  • [38] D. Roth and K. Small, “Margin-based active learning for structured output spaces,” in Machine Learning: ECML 2006: 17th European Conference on Machine Learning Berlin, Germany, September 18-22, 2006 Proceedings 17, pp. 413–424, Springer, 2006.
  • [39] C. E. Shannon, “A mathematical theory of communication,” The Bell system technical journal, vol. 27, no. 3, pp. 379–423, 1948.
  • [40] Y. Yang, Z. Ma, F. Nie, X. Chang, and A. G. Hauptmann, “Multi-class active learning by uncertainty sampling with diversity maximization,” International Journal of Computer Vision, vol. 113, pp. 113–127, 2015.
  • [41] O. Sener and S. Savarese, “Active learning for convolutional neural networks: A core-set approach,” arXiv preprint arXiv:1708.00489, 2017.
  • [42] Y. Kim and B. Shin, “In defense of core-set: A density-aware core-set selection for active learning,” in Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp. 804–812, 2022.
  • [43] S. Dasgupta and D. Hsu, “Hierarchical sampling for active learning,” in Proceedings of the 25th international conference on Machine learning, pp. 208–215, 2008.
  • [44] J. Zhu, H. Wang, B. K. Tsou, and M. Ma, “Active learning with sampling by uncertainty and density for data annotations,” IEEE Transactions on audio, speech, and language processing, vol. 18, no. 6, pp. 1323–1331, 2009.
  • [45] M. Wang, F. Min, Z.-H. Zhang, and Y.-X. Wu, “Active learning through density clustering,” Expert systems with applications, vol. 85, pp. 305–317, 2017.
  • [46] Z. Xu, R. Akella, and Y. Zhang, “Incorporating diversity and density in active learning for relevance feedback,” in Advances in Information Retrieval: 29th European Conference on IR Research, ECIR 2007, Rome, Italy, April 2-5, 2007. Proceedings 29, pp. 246–257, Springer, 2007.
  • [47] D. Pereira-Santos, R. B. C. Prudêncio, and A. C. de Carvalho, “Empirical investigation of active learning strategies,” Neurocomputing, vol. 326, pp. 15–27, 2019.
  • [48] K. Wei, Y. Liu, K. Kirchhoff, and J. Bilmes, “Using document summarization techniques for speech data subset selection,” in Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 721–726, 2013.
  • [49] S.-J. Huang, R. Jin, and Z.-H. Zhou, “Active learning by querying informative and representative examples,” Advances in neural information processing systems, vol. 23, 2010.
  • [50] S.-J. Huang and Z.-H. Zhou, “Active query driven by uncertainty and diversity for incremental multi-label learning,” in 2013 IEEE 13th international conference on data mining, pp. 1079–1084, IEEE, 2013.
  • [51] Z. Wang and J. Ye, “Querying discriminative and representative samples for batch mode active learning,” ACM Transactions on Knowledge Discovery from Data (TKDD), vol. 9, no. 3, pp. 1–23, 2015.
  • [52] L. Lin, K. Wang, D. Meng, W. Zuo, and L. Zhang, “Active self-paced learning for cost-effective and progressive face identification,” IEEE transactions on pattern analysis and machine intelligence, vol. 40, no. 1, pp. 7–19, 2017.
  • [53] Y.-P. Tang and S.-J. Huang, “Self-paced active learning: Query the right thing at the right time,” in Proceedings of the AAAI conference on artificial intelligence, vol. 33, pp. 5117–5124, 2019.
  • [54] R. Hadsell, S. Chopra, and Y. LeCun, “Dimensionality reduction by learning an invariant mapping,” in 2006 IEEE computer society conference on computer vision and pattern recognition (CVPR’06), vol. 2, pp. 1735–1742, IEEE, 2006.
  • [55] F. Schroff, D. Kalenichenko, and J. Philbin, “Facenet: A unified embedding for face recognition and clustering,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 815–823, 2015.
  • [56] A. Jaiswal, A. R. Babu, M. Z. Zadeh, D. Banerjee, and F. Makedon, “A survey on contrastive self-supervised learning,” Technologies, vol. 9, no. 1, p. 2, 2020.
  • [57] T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, “A simple framework for contrastive learning of visual representations,” in International conference on machine learning, pp. 1597–1607, PMLR, 2020.
  • [58] J. Tack, S. Mo, J. Jeong, and J. Shin, “Csi: Novelty detection via contrastive learning on distributionally shifted instances,” Advances in neural information processing systems, vol. 33, pp. 11839–11852, 2020.
  • [59] M. Caron, I. Misra, J. Mairal, P. Goyal, P. Bojanowski, and A. Joulin, “Unsupervised learning of visual features by contrasting cluster assignments,” Advances in neural information processing systems, vol. 33, pp. 9912–9924, 2020.
  • [60] J. Li, P. Zhou, C. Xiong, and S. C. Hoi, “Prototypical contrastive learning of unsupervised representations,” arXiv preprint arXiv:2005.04966, 2020.
  • [61] A. Makhzani and B. J. Frey, “Winner-take-all autoencoders,” Advances in neural information processing systems, vol. 28, 2015.
  • [62] C.-Y. Chuang, J. Robinson, Y.-C. Lin, A. Torralba, and S. Jegelka, “Debiased contrastive learning,” Advances in neural information processing systems, vol. 33, pp. 8765–8775, 2020.
  • [63] H. Zhao, X. Yang, Z. Wang, E. Yang, and C. Deng, “Graph debiased contrastive learning with joint representation clustering.,” in IJCAI, pp. 3434–3440, 2021.
  • [64] K. Zhou, B. Zhang, W. X. Zhao, and J.-R. Wen, “Debiased contrastive learning of unsupervised sentence representations,” arXiv preprint arXiv:2205.00656, 2022.
  • [65] K. Margatina, G. Vernikos, L. Barrault, and N. Aletras, “Active learning by acquiring contrastive examples,” arXiv preprint arXiv:2109.03764, 2021.
  • [66] R. Marler and J. Arora, “Survey of multi-objective optimization methods for engineering,” Structural and Multidisciplinary Optimization, vol. 26, pp. 369–395, 04 2004.
  • [67] A. Raj and F. Bach, “Convergence of uncertainty sampling for active learning,” 2021.
  • [68] Y. Li, Z. Shi, C. Liu, W. Tian, Z. Kong, and C. B. Williams, “Augmented time regularized generative adversarial network (atr-gan) for data augmentation in online process anomaly detection,” IEEE Transactions on Automation Science and Engineering, vol. 19, no. 4, pp. 3338–3355, 2021.
  • [69] D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in International Conference on Learning Representations (ICLR), (San Diega, CA, USA), 2015.
  • [70] V. Eglajs and P. Audze, “New approach to the design of multifactor experiments,” Problems of Dynamics and Strengths, vol. 35, no. 1, pp. 104–107, 1977.
  • [71] M. D. McKay, R. J. Beckman, and W. J. Conover, “A comparison of three methods for selecting values of input variables in the analysis of output from a computer code,” Technometrics, vol. 42, no. 1, pp. 55–61, 2000.