ADs: Active Data-sharing for Data Quality Assurance in Advanced Manufacturing Systems

Yue Zhao, Yuxuan Li, Chenang Liu, Yinan Wang Yue Zhao and Yinan Wang are with the Department of Industrial and Systems Engineering, Rensselaer Polytechnic Institute, Troy, NY, 12180.
E-mail: {zhaoy23, wangy88}@rpi.edu Yuxuan Li and Chenang Liu are with the School of Industrial Engineering and Management, Oklahoma State University, Stillwater, OK, 74078.
E-mail: {yuxuan.li, chenang.liu}@okstate.eduManuscript received XXX; revised YYY. (Corresponding Author: Yinan Wang)

Abstract

Machine learning (ML) methods are widely used in manufacturing applications, which usually require a large amount of training data. However, data collection needs extensive costs and time investments in the manufacturing system, and data scarcity commonly exists. With the development of the industrial internet of things (IIoT), data-sharing is widely enabled among multiple machines with similar functionality to augment the dataset for building ML models. Despite the machines being designed similarly, the distribution mismatch inevitably exists in their data due to different working conditions, process parameters, measurement noise, etc. However, the effective application of ML methods is built upon the assumption that the training and testing data are sampled from the same distribution. Thus, an intelligent data-sharing framework is needed to ensure the quality of the shared data such that only beneficial information is shared to improve the performance of ML methods. In this work, we propose an Active Data-sharing (ADs) framework to ensure the quality of the shared data among multiple machines. It is designed as a self-supervised learning framework by integrating the architecture of contrastive learning (CL) and active learning (AL). A novel acquisition function is developed for active learning by integrating the information measure for benefiting downstream tasks and the similarity score for data quality assurance. To validate the effectiveness of the proposed ADs framework, we collected real-world in-situ monitoring data from three 3D printers, two of which share identical specifications, while the other one is different. The results demonstrated that our ADs framework could intelligently share monitoring data between identical machines while eliminating the data points from the different machines when training ML methods. With a high-quality augmented dataset generated by our proposed framework, the ML methods can achieve a better performance of accuracy 95.78% when utilizing 26% labeled data. This represents an improvement of 1.41% compared with benchmark methods which used 100% labeled data.

Note to Practitioners

This paper is motivated by the need to share data across machines or processes for building machine learning models and the widely existing low-quality data issue. Low-quality data here refers to data samples collected from machines/processes different from the target one. In the manufacturing system, when building a machine learning model for a target machine/system, those low-quality data will decay the model performance if they are selected and shared for training. Therefore, it hinders the direct application of active learning in data sharing as the classic setting of active learning does not consider the impact of data quality on model performance. The objective of this paper is to develop an intelligent data-sharing framework. It is designed to simultaneously select the most informative data points benefiting the downstream tasks and mitigate the impact of low-quality data. We collected real-world in-situ monitoring data of the same additive manufacturing process from three different machines, two of which are more similar than the rest. The proposed method is applied to train an anomaly detection model for those two similar machines, and the entire data pool from all three machines is available for selecting and annotating. The results demonstrated that our proposed method outperforms the benchmark methods by only requiring 26% of labeled training samples. In addition, all selected data samples are from machines with similar conditions, while the data from the different machines are prevented from misleading the training.

Index Terms:

Multi-objective Optimization, Active Learning, Contrastive learning, Data-sharing, Data Quality Control

I Introduction

Supervised learning has been extensively applied to a wide spectrum of downstream tasks in advanced manufacturing systems [1]. For example, a machine learning (ML) model might be used to detect manufacturing faults or identify anomalous sensor readings[2]. The training process of supervised learning requires large quantities of annotated data to achieve good performance on the required task. Unfortunately, in manufacturing systems, it is often impossible to collect a large amount of annotated data due to the high costs of labor, time, and investment, which causes the lack of overall data samples and annotated samples for supervised learning. Therefore, data-sharing is proposed as a solution to share data among multiple manufacturing processes with similar functionality, which naturally augments the amount of data available for the ML model [3].

Despite the benefits of data-sharing, there are two gaps when applying it in practice. First, although the shared data increases the size of the data pool, they do not necessarily include the most informative subset of data that benefits the model performance on the downstream task, given a limited budget for data annotation. Second, an important assumption to ensure the effectiveness of the ML method is that the training and testing data should be from the same distribution [4]. However, distribution mismatch naturally exists among the collected data from different manufacturing processes or even the same process but slightly different machines. Although with similar functionalities, different processes contain both systematic and stochastic differences in working conditions, process parameters, sensors, machine specifications, etc. Therefore, the objective of data-sharing in advanced manufacturing can be summarized as selecting a subset of data samples across multiple manufacturing processes or different machines sharing identical underly distribution to benefit the performance of downstream tasks under a given annotation budget. In this work, anomaly detection is selected as the downstream task, which is defined to identify normal or abnormal working conditions from multivariate time-series in-situ sensor data collected from different manufacturing processes. In the following description, we use the term target distribution to refer to the data distribution that the ML methods are designed to model.

The concept of active learning (AL) emerged as a plausible solution to the first gap by prioritizing important data to cut down on annotation costs. Generally, AL identifies a much smaller portion of the data pool for annotation (refer to query set) that maximizes the information gain of the ML model. The idea is that training the model on the annotated subset of data achieves a comparable performance to a model trained on the full dataset. This is usually conducted by quantifying the information measure of the unlabeled data for the ML model. AL techniques have generally achieved great success in these limited-budget tasks and, in some cases, even improved the performance of models with a much smaller amount of training data [5]. Despite their effectiveness, most of the existing AL methods also assume all the data points are sampled from the same underlying distribution, which does not always hold in practice, especially in the manufacturing system. This is problematic because although the most informative samples would still be selected by the AL method, there is no guarantee that these samples are representative of the target distribution, thus hindering the performance of supervised learning trained on the selected samples. Therefore, it is essential to consider the issue of distribution mismatch when applying AL methods to applications in the manufacturing system.

The distribution mismatch over training and testing data has also been observed in various machine learning applications. Du et al. [6] note several typical ML scenarios, i.e., medical diagnoses containing unseen lesions and the house annotation of remote sensing images containing numerous natural sceneries, having the issue of distribution mismatch mitigating the model performance [6], including classification, regression [7], etc. Applying ML in these situations is referred to as learning under class distribution mismatch, which is naturally a challenging task [8, 9, 10]. Therefore, various methods have been developed to resolve the distribution mismatch issue from different perspectives. One popular line of research attacks this issue by correctly detecting the data samples not belonging to the target distribution, which is formally formulated as out-of-distribution (OOD) detection [11, 12, 13]. The idea is to passively eliminate the impact of OOD data by filtering them out in the testing phase. The current OOD detection methods are usually built upon two assumptions: (1) the distribution mismatch mainly existed in the label space, which usually refers to a novel class of object (not existing in the training dataset) appearing in the testing phase for the classification task. This phenomenon is demonstrated in Figure 1(a), in which novel classes in the cream background might appear in the testing phases; (2) the “clean” dataset following the target distribution is available when training the ML model. Therefore, in the classification task, the problem of OOD detection is usually formulated as identifying the data samples that do not belong to any class in the training. However, these two assumptions do not hold when sharing data among manufacturing processes due to (1) the distribution mismatch is more likely to exist in the input space due to the systematic and stochastic differences among manufacturing processes, which is demonstrated by the gap between input distributions in blue and orange curves shown in Figure 1(b); and (2) the “clean” dataset following the target distribution is usually unavailable for training due to the data scarcity of each single manufacturing process. Consequently, a new paradigm is needed to resolve the challenge of distribution mismatch in data-sharing during model training.

Refer to caption — Figure 1: (a) An instance of different class distribution mismatch in output space [6]. (b) An instance of distribution mismatch in input space.

With emerging attention to mitigate the issue of distribution mismatch during the training process, researchers started to tackle this challenge in the cycle of the AL method. The idea is to actively select the subset of data samples from a general data pool to ensure they both follow the same target distribution and benefit the downstream task. Failure-averse AL is first proposed to incorporate the physical principles into preparing the data samples for the regression task in the manufacturing system [14]. Chattopadhyay et al. proposed a novel criterion that achieves good generalization performance of a classifier by selecting a set of query samples to minimize the difference in distribution between the labeled and the unlabeled data [15]. An integrated information measure is proposed to score and rank the unlabeled data points such that the top candidates are ensured to both benefit the downstream regression task and follow the same physical principle. It provides a way to intuitively incorporate the measure of target distribution into the information measure in AL. However, it still cannot be intuitively applied in a general data-sharing scenario, as the target population usually has no closed-form explicit expression that can be directly exploited. With the recent development of self-supervised learning, contrastive learning (CL) offers the solution to evaluate the similarity over features extracted from different input data [6]. Therefore, dissimilar features naturally correspond to data samples from different input distributions. As a self-supervised learning method, CL does not require label information, which is trained by forming data samples into positive pairs (anchor-positive, i.e., following similar distributions) and negative pairs (anchor-negative, i.e., following dissimilar distributions). The intuitive idea of CL is to train the feature extractor by encouraging it to drag positive pairs close to each other in the feature space while pushing negative pairs away from each other. In that sense, a model trained within the CL framework is capable of embedding data samples with similar distributions close to each other in the feature space, while those with dissimilar distributions are embedded farther away. Thus, it models the distributional differences present in the data. It is worth noting that CL still requires some data samples from the target distribution to initiate the training but is much less data-demanding compared with supervised learning methods due to the advantage of self-supervised schema [16].

Du et al. [6] first propose to incorporate the CL into the cycle of AL to mitigate the distribution mismatch issue in the output space shown in Figure 1(a), which is referred to as class distribution mismatch in the classification problem. It is not intuitively applicable to our scenario. Similar to our previous discussion, the distribution mismatch in data-sharing mainly exists in the input space. Using anomaly detection as an example, the output space across different manufacturing processes should uniformly contain two classes of normal or abnormal while the monitoring data representing the same working condition (i.e., either normal or abnormal) might follow different distributions across different manufacturing processes due to distribution mismatch in the input space.

To fill in the research gap, the proposed framework, termed Active Data-sharing (ADs), views the problem as a multi-objective AL where the first objective is to select highly uncertain samples for the anomaly detection task and the second is to ensure that the selected data matches the target distribution. The approach is based primarily on the observation that AL and CL can achieve these objectives independently. Therefore, combining them might result in a feasible joint solution. That is, in the context of multi-objective optimization (MOO), the resulting solution would be Pareto optimal. In ADs, each objective is optimized by a separate model, wherein the first involves uncertainty sampling in the form of entropy of classifier predictions, and the second utilizes a CL network that learns similarity features within the data to distinguish data from similar and dissimilar machines. These models are trained on a small set of annotated data as initialization and then applied to the unlabeled data to form two sets of scores corresponding to each objective. A joint query strategy is then used to integrate the two scores into a joint set of scores, which is used to select the best samples for the human annotator. Thus, the proposed framework allows high-quality data-sharing that satisfies both objectives.

The major contributions of the work are: (1) A novel Active Data-sharing (ADs) framework is proposed to ensure the quality of industrial data-sharing when subject to data scarcity, distribution mismatch, and low annotation budget. (2) A novel acquisition function is developed for AL under distribution mismatch in the input space by integrating the informativeness and distribution similarity scores. (3) The effectiveness of the framework is evaluated on real-world in-situ monitoring data from additive manufacturing (AM) processes.

The paper is structured in the following manner. Section II reviews the literature related to AL, CL, and AL under distribution mismatch, which refers to applications where AL in isolation fails to produce desirable results. Section III elaborates on the proposed ADs framework for effective annotation and data-sharing under distribution mismatch. Section III-E presents a formal theoretical background and explanation for the mathematical validity of ADs. Section IV evaluates the proposed method in various settings with a comprehensive case study involving a real-world industrial additive manufacturing process by using in-situ monitoring data recorded from three AM processes.

II Literature Review

This section summarizes the literature and recent work on the key domains of AL, CL, and particularly semi-supervised learning, all of which were critical to developing ADs.

II-A Active Learning (AL)

AL offers various strategies for reducing the annotation budget by actively selecting the most informative or valuable data points and feeding them to the annotator for label acquisition [17]. These strategies fall into three categories: query synthesis (conventional [18, 19, 20], generative [21, 22, 23]), sequential sampling [24, 25], and pool-based sampling [26, 27].

The problem presented in this work belongs to pool-based sampling wherein the unlabeled data pool is the collected in-situ monitoring data from three AM processes, and the goal is to evaluate the information measure over the entire unlabeled data pool prior to querying the candidate samples for sharing. Almost all pool-based methods score the samples based on their informativeness. The key idea of these approaches is to select only the most informative subset of samples so as to maximize the information gain for the model. Examples include uncertainty sampling approaches [28, 29, 30, 31, 32, 33, 34], variance reduction [26, 27], query-by-committee [35, 36, 37], etc. A sample’s uncertainty may be quantified in several ways, e.g., marginal uncertainty [38] or, most popularly, entropy [39], which utilizes the posterior probability of the model’s prediction on all samples to select the best one. This is particularly intuitive if the data can be represented as a probabilistic distribution. Recently, Sinha et al. [33] employed variational autoencoders to determine uncertainty by comparing the distribution of the annotated and unlabeled data. A task-agnostic approach by Yoo et al. [34] proposes to use a parametric loss predictor module to predict the loss of the unlabeled sample and, therefore, measure its uncertainty.

Another class of pool-based AL methods is based on representative measures, such as diversity [40, 41, 42], density [43, 44, 45] or a combination of the two [46]. Diversity methods favor exploration and prefer the selection of dissimilar instances, whereas density-based approaches assume that either dense or sparse groups of data points contain the most information. Therefore, they prefer selecting instances that are either similar or dissimilar to several other instances [47]. Hierarchical clustering [43] and density estimation methods [44, 45] are commonly used for density-based approaches. Amongst diversity-based approaches, the classical core-set approach [48] is the most popular – it aims to identify a diverse set of samples, i.e., the core or cover set, by minimizing the distance between each sampled point and the remaining points. The expected result is that a model trained on the core-set is at least equivalent to a model trained on the complete data in terms of performance. The core-set was first adapted to batched inputs for convolutional neural networks (CNNs) by Sener et al. [41]. They proposed a robust k-center algorithm operating on Euclidean distances of the last layer’s feature vectors for a batch of input images. Kim et al. [42] propose a density-aware core-set approach (DACS) to select diverse samples from locally sparse regions, which is useful if the sparse regions contain informative samples compared to densely grouped instances.

An inherent flaw of the representative approaches is that all samples are treated equally in terms of informativeness, which is not necessarily the case. It is also possible, however, that a sample’s representativeness is related to its importance, in which case the representative measures might be more effective. Recently, it has been verified that a combination of both informative and representative approaches results in better performance [44, 49, 50, 51, 52, 53].

However, the underlying assumption of the active-learning-based methods, as well as most other methods focusing on maximizing information gain, is that the distribution of the data is the same, whether the samples are annotated or not. Upon violation of this assumption, the performance of these methods deteriorates sharply [6]. Recall that class distribution mismatch and mitigate it in data-sharing tasks is a central focus of this work. Therefore, even though pool-based uncertainty sampling seems to be a viable solution for at least picking informative data points, it is not advisable to apply it unless the mismatched data is somehow identified and excluded.

II-B Contrastive Learning (CL)

CL is a self-supervised, task-agnostic technique to learn effective high-dimensional feature representations of data, usually to the benefit of a downstream task. Practically, CL is utilized to improve the performance of a model by exposing it to pairs of annotated negative and positive samples in addition to actual samples (anchors) to produce high-dimensional embeddings of the data and designing the loss function [54, 55] so as to maximize positive-anchor and minimize negative-anchor distances. The labels positive and negative are termed the similarity labels in this context. In practice, CL has been successfully applied to various tasks related to vision, natural language processing, etc. It greatly improves model performance under distribution mismatch or when the inputs share similar features but belong to distinct classes [56].

Owing to its widespread success in self-supervised learning and task-independent nature, the technique has been applied to several domains and continues to receive significant attention and development. Transformations of the input data to generate positive augmentations of anchors are crucial, and the type of transformation(s) used should be based on the format of the data. For example, SimCLR [57] composes several different types of augmentations via local and global image transformations such as crops, rotations, cutouts, blurs, color distortions, etc., and additionally provides an account of augmentations that, when considered positive, seem to worsen performance. Building off of these findings, CSI [58] proposed to term these problematic augmentations as distribution-shifting transformations and instead feed them as negative samples to the CL framework. This led to further study on learning shift-invariant feature representations by integrating CL and clustering approaches [59, 60]. The fully unsupervised Winner-Take-All (WTK) autoencoder architecture [61] enforces sparsity in addition to shift-invariance of the learned representations, which can be used to reliably generate in-distribution augmentations. Furthermore, the WTK architecture allows joint back-propagation of gradients through both the encoder and decoder paths, which results in quicker and more stable training.

It is worth noting that in the absence of similarity labels, CL becomes susceptible to sampling bias, which leads to the inclusion of false negatives because, in the simplest case, negative samples for an anchor are randomly picked from the dataset. While similarity labels are assumed available for a subset of the data in ADs, Chuang et al. [62], and several others [63, 64] propose debiased CL frameworks for specific downstream tasks that focus on minimizing sampling bias for fully unsupervised CL.

II-C Constrained Active Learning

AL approaches generally have a drawback that they are inherently unaware of the underlying distribution of the data and, therefore, cannot be used when the data suffers from class distribution mismatch. At the same time, applying CL approaches to the described problem is not a plausible solution either because (1) being task-agnostic, they do not extract features specifically for improving the performance on the downstream task, and (2) human annotation can resolve the issue of lacking labels but cannot identifying distribution mismatch, while the CL can identify distribution mismatch.

Several papers in recent years have proposed and developed constrained AL techniques in order to leverage the benefits of AL in certain feasible regions. Constrained AL can mitigate the decay on downstream task performance when directly applying AL [6, 65, 14]. Du et al. [6] propose CCAL, an AL framework with a joint query strategy wherein each sample is scored on the basis of two separate scores that are then combined into a joint score used to select the best samples. They also prove a tight upper boundary on the consequent error function termed the CCAL error and compare the results for the task of image classification under distribution mismatch with the two other semi-supervised learning methods (DS³L [10] and UASD [9]) under distribution mismatch. It should be noted that semi-supervised learning is different from AL in that it directly utilizes the unlabeled data during training. In addition, no extra annotation efforts will be applied to the unlabeled data.

Lee et al. [14] explored incorporating physical constraints into AL to accommodate a more practical context, such as an engineering system, where physical constraints are present. Violation of these constraints could result in fatal system failure. The authors suggest the existence of a safe region in the sample space and attempt to make AL focus on exploring the safe region. At the same time, it is desirable to explore the safe region as thoroughly as possible with limited data samples. The authors propose the PhyCAL framework, utilizing safe variance reduction and safe region expansion to trade-off between information maximization and safe exploration of the design space.

The ADs framework proposed in this paper expands on these methods. More specifically, (1) instead of AL for image classification under distribution mismatch over the output space, ADs focus on resolving the distribution mismatch in the input space when sharing data over multiple manufacturing processes, (2) a new joint query strategy is proposed to select the samples simultaneously following the target distribution and benefiting the downstream tasks. In addition, the convergence of the joint function to the optimal point is proven with methods pertaining to convex analysis and Pareto optimal.

III Methodology

This section comprehensively explores the proposed ADs framework to solve the problem of data sharing among multiple manufacturing processes to jointly select the most informative samples for the downstream task and mitigate distribution mismatch over the input space. First, the general notations and nomenclature are introduced. Then, the calculation of the individual scores and their integration into the joint score is explained. Finally, a theoretical investigation of the Pareto optimality of the proposed method is conducted through convex analysis of the loss functions with the proposed individual and joint acquisition functions. For brevity, the data pre-processing stage and data augmentation model are left for Section IV, where data management is more relevant.

This study involves data-sharing on in-situ monitoring data among various manufacturing processes assigned to produce identical objects. Here, we assume small and large machines are used respectively and are the main reason incurring the difference among manufacturing processes. Monitoring data collected from small machines with identical manufacturers and model numbers are under similar distributions, while those from large machines with different manufacturers and model numbers are recognized as having dissimilar distributions. Consequently, a distribution mismatch is observed in the input space. The downstream task of the study is anomaly detection, which distinguishes between normal and abnormal working conditions of the manufacturing process.

Consider that machines $S_{1}$ and $S_{2}$ are similar, providing data that follows a similar distribution. Suppose the data from the dissimilar machine $L_{1}$ , with a mismatched distribution relative to $S_{1}$ and $S_{2}$ , is additionally introduced to form a combined dataset $\mathcal{D}=(S_{1},S_{2},L_{1})$ . Let $\mathcal{L}=\{\text{normal},\,\text{abnormal}\}$ be the labels or classes of each sample associated with the downstream task, regardless of which machine they were sourced from. Let $\mathbf{X}\in\mathbb{R}^{D\times 3}$ be the matrix representation storing all samples in $\mathcal{D}$ . Then, $\mathbf{X}^{S_{1}},\mathbf{X}^{S_{2}}$ and $\mathbf{X}^{L_{1}}$ are all disjoint sub-matrices representing the data from each machine. As they are disjoint, the concatenation and disjointed union (or logical XOR) of any combination of these subsets is equivalent, i.e., $\mathbf{X}^{S_{1}}\frown\mathbf{X}^{S_{2}}=\mathbf{X}^{S_{1}}\oplus\mathbf{X}^% {S_{2}}$ , etc. Finally, suppose that a human annotator labels a small portion of these samples as normal and abnormal. Thus, there is a small annotated pool of samples in the form of a matrix $\mathbf{D}_{\text{label}}$ , and a large pool of unlabeled data $\mathbf{D}_{\text{unlabel}}=\mathbf{X}\setminus\mathbf{D}_{\text{label}}$ , such that $\mathbf{D}_{\text{unlabel}}>>\mathbf{D}_{\text{label}}$ , i.e., the unlabeled portion is much larger than the annotated portion. The purpose of ADs is to extend AL to form a query set $\mathbf{D}^{Q}$ from the unlabeled data pool $\mathbf{D}_{\text{{unlabel}}}$ , such that each queried sample, $\mathbf{x}_{i}^{Q}$ is not only among the most informative samples in $\mathbf{D}_{\text{unlabel}}$ , but also has a high likelihood of belonging to the similar machine data:

\forall\;\mathbf{x}_{i}^{Q}\in\mathbf{D}^{Q}:\underbrace{\text{entropy}(% \mathbf{x}_{i}^{Q})\to 1}_{\text{high uncertainty}}\;\wedge\;\underbrace{% \mathbf{x}_{i}^{Q}\in(\mathbf{X}^{S_{1}}\oplus\mathbf{X}^{S_{2}})}_{\text{high% similarity}}

In order to satisfy both objectives, the joint query strategy employed in ADs consists of integrating two separately calculated scores, termed the similarity and uncertainty scores, to form a joint acquisition function $\mathbf{J}$ , which scores the unlabeled data based on how well they satisfy both objectives simultaneously. The individual scores are computed based on the feature representations obtained by two separate models. Specifically, the similarity score is obtained via the feature vector extracted from the CL model, and the uncertainty score is obtained via the current anomaly detection model.

For the subsequent subsections, let feature $Z_{s}(\cdot)$ and feature $Z_{u}(\cdot)\in[0,1]$ denote a forward pass through the CL and uncertainty sampling networks to generate a feature vector, and let $D_{l}$ and $D_{u}$ be the number of samples in the labeled and unlabeled data pools respectively.

III-A Learning Similarity Features

The CL model is trained to quantitatively evaluate the similarity between data from the target distribution ( $S_{1}\cup S_{2}$ ) and data from a different distribution ( $L_{1}$ ). In this sense, it can be further exploited to facilitate the objective that queried samples $\mathbf{D}^{Q}$ from the unlabeled pool $\mathbf{D}_{\text{unlabel}}$ closely match the target distribution of data $X^{S_{1}}\oplus X^{S_{2}}$ . In our application, it indicates that for each cycle of AL, all the selected samples would be from similar machines. This naturally leverages the fact that the distribution of in-situ monitoring data varies slightly between the similar agents $(S_{1},S_{2})$ , but considerably in the case of $(S_{1},L_{1})$ or $(S_{2},L_{1})$ .

To prepare positive and negative pairs of samples for training purposes, the initially sampled and annotated data from $S_{1}\cup S_{2}$ are fed into a data augmentation model (trained as part of the preliminary processing detailed in Section IV-B) to generate two positive augmentations based on the augmentation Equations (11), (12). These augmentations are used as the positive samples while the original data sample is kept as the anchor. Similarly, the initially annotated samples from L1 are randomly chosen as negative samples.

With the prepared positive and negative pairs of samples, triplet loss function [55] is used to train the CL model to better differentiate samples belonging to different distributions. Its input consists of positive and negative samples for the same anchor, which is the data sample from the target distribution. Additionally, a distance function must be defined to quantify the similarity between positive (anchor versus positive sample) and negative (anchor versus negative sample) pairs. Regardless of the choice of distance function, the triplet loss is optimized toward maximizing the similarity for the positive pair and minimizing the similarity for the negative pair. The expression of triplet loss is given as follows

l_{\text{triplet}}(\mathbf{x}_{a},\mathbf{x}_{p},\mathbf{x}_{n})=\max(f_{d}(% \mathbf{x}_{a},\mathbf{x}_{p})-f_{d}(\mathbf{x}_{a},\mathbf{x}_{n})+\epsilon,% \,0),

(1)

where $\mathbf{x}_{p}$ is a positive sample, $\mathbf{x}_{a}$ is the anchor, and $\mathbf{x}_{n}$ is a negative sample; $f_{d}$ is the general distance function that can be applied to each pair; $\epsilon$ is a small positive constant added for numerical stability. In this work, cosine similarity is used as the distance function. The triplet loss in Equation (1) for sample $\mathbf{x}_{j}\in\mathbf{D}_{\text{label}}$ where $j=\{a,p,n\}$ is then:

l_{\text{triplet}}(\mathbf{x}_{a},\mathbf{x}_{p},\mathbf{x}_{n})=\max(\cos(% \mathbf{x}_{a},\mathbf{x}_{n})-\cos(\mathbf{x}_{a},\mathbf{x}_{p})+\epsilon,0)

(2)

Assuming a batch of data samples $\mathcal{B}=\{x\}_{i=1}^{B}$ , where $B$ is the number of samples in the batch, the loss function in Equation (2) is updated as the average loss over all the samples in the batch:

l(\mathcal{B})=\frac{1}{B}\sum_{i=1}^{B}l_{\text{triplet}}(\mathbf{x}_{ai},% \mathbf{x}_{pi},\mathbf{x}_{ni})

(3)

The contrastive model needs to be trained on a small initial subset of annotated data to be capable of evaluating the similarity among incoming sample pairs. For each unlabeled sample $\mathbf{x}_{i}^{U}\in\mathbf{D}_{\text{unlabel}}\text{ where }i=\{1,\ldots,d_{% u}\}$ , the corresponding similarity score $s_{i}$ is obtained as follows: (1) All annotated samples $\mathbf{D}_{\text{label}}\in\mathbb{R}^{d_{l}\times 3}$ are passed through the trained contrastive model, generating an array of feature vectors $\mathbf{F}^{L}=Z_{s}(\mathbf{D}_{\text{label}})\in\mathbb{R}^{d_{l}\times d}$ , where row $j$ denotes a feature vector $\mathbf{f}_{j}^{L}\in\mathbb{R}^{d}$ ; (2) unlabeled sample $\mathbf{x}_{i}^{U}$ is passed through the trained contrastive model, generating a $d$ -dimensional feature vector $\mathbf{f}_{i}^{U}=Z_{s}(\mathbf{x}_{i}^{U})$ ; (3) The pairwise cosine similarity between $\mathbf{f}_{i}^{U}$ and $\mathbf{F}^{L}$ is computed to obtain a vector of cosine similarities $\mathbf{c}$ with $d_{l}$ elements and an arbitrary element is $\cos(\mathbf{f}_{j}^{L},\mathbf{f}_{i}^{U})\in[-1,1]$ . (4) Maximum value in $\mathbf{c}$ is identified as the final similarity score $s_{i}=\max\mathbf{c}$ for the incoming sample $\mathbf{x}_{i}^{U}$ , which is expressed as

s_{i}=\max_{j=\{1,...,d_{l}\}}\,\cos\left(Z_{s}\left(\mathbf{x}_{j}^{L}\right)% ,Z_{s}\left(\mathbf{x}_{i}^{U}\right)\right)

(4)

A higher similarity score corresponds to an increased probability of the unlabeled data point belonging to the target distribution. Notice that this score involves both the annotated data pool and the unlabeled data in the forward pass. To get the complete vector, steps 2–4 are repeated for all remaining samples in the unlabeled data pool $\mathbf{D}_{\text{unlabel}}$ , resulting in a vector of contrastive similarity scores $\mathbf{S}^{\prime}$ through iterative concatenation $\mathbf{S}^{\prime}:=\mathbf{S}^{\prime}\frown s_{i},\;\forall i$ :

The CL model is applied to evaluate the similarity of semantic features among data points and elevated similarity scores for data originating from a shared target distribution. This enables us to selectively choose datapoints within the similar distribution (smaller machines). This selection is visually represented in Figure 3 by the mauve region.

III-B Uncertainty Sampling

The entropy for unlabeled samples can be used to compute the informativeness of that sample. Entropy sampling is a popular AL technique [17] under the category of uncertainty sampling approaches. Samples exhibiting high uncertainty are ideal samples for annotation as they lie near the decision boundary and thus contain potentially valuable information for the model to learn. For a binary classification problem where $x$ is a sample, and $\hat{y}$ is the predicted class, entropy is defined as:

	$\displaystyle H(x)$	$\displaystyle=P_{\theta}(\hat{y}\;\|\;x)\log\left(\frac{1}{P_{\theta}(\hat{y}\;% \|\;x)}\right)$		(5)
		$\displaystyle=-P_{\theta}(\hat{y}\;\|\;x)\log(P_{\theta}(\hat{y}\;\|\;x))$		(5)

The computation of the vector of uncertainty scores $\mathbf{U}\in[0,1]^{d_{u}}$ is now discussed. Naturally, it involves the uncertainty model described above, trained on the same initial subset of annotated data as the CL model. Consider a sample from the unlabeled data pool $\mathbf{x}_{i}^{U}\text{ where }i=\{1,\ldots,d_{u}\}$ . Let $c_{i}^{U}$ represent its predicted class label $\operatorname*{argmax}_{k}Z_{u}(x_{i}^{U})$ for $k\in\mathcal{L}$ . In addition, suppose $p_{i,k}^{U}$ represents the normalized probability that a sample $x_{i}^{U}$ belongs to class $k$ , so that $p_{i,k}^{U}=p({g^{U}_{i}}=k\;|\;\mathbf{x}_{i}^{U})$ , where $g^{U}_{i}$ represents the probability of any class. The unlabeled sample $\mathbf{x}_{i}^{U}=(x_{i},y_{i},z_{i})$ is passed through the trained classifier model to generate the features $Z_{u}$ , and Shannon entropy $H$ is calculated for $\mathbf{p}_{i}^{U}$ , which represents the uncertainty score $u_{i}$ .

u_{i}=H\left(Z_{u}\left(\mathbf{x}_{i}^{U}\right)\right)=H(\mathbf{p}_{i}^{U})% =-\sum_{k}p_{i,k}^{U}\log_{2}p_{i,k}^{U}

(6)

The vector of uncertainty scores is also generated iteratively through $\mathbf{U}:=\mathbf{U}\frown u_{i},\;\forall i$ . Therefore, Uncertainty sampling aims to select instances that improve model performance by focusing on the most informative or uncertain datapoints. The uncertainty score guides the selection of data points within the blue region depicted in Figure 3b. This strategy is particularly useful in scenarios where labeling data is expensive or time-consuming, as it helps make the most out of the limited labeled data available.

III-C Joint Query Strategy

In the framework of AL, the idea is to iteratively select and annotate a certain number of informative samples in each cycle to improve the performance of the downstream task. In our problem, the unlabeled samples are evaluated from two aspects: (1) the closeness to the target distribution that is evaluated by the similarity score, and (2) the potential improvement to the anomaly detection task that is evaluated by the uncertainty score. Guided by the idea that we want to prevent the distribution-mismatch issue in data sharing, the joint query strategy is designed to select the samples with a high uncertainty score (benefit the downstream task) conditioned on they are close to the target distribution (with a high similarity score).

Therefore, the similarity score $\mathbf{S}^{\prime}$ is firstly binarized to set samples with the top $w\%$ similarity scores as 1 and the remaining as 0. The binarized similarity score is denoted as $\mathbf{S}$ . The binarization will be repeated in each cycle of the proposed ADs. Therefore, the value of $w$ is determined to make sure that there are enough samples to be queried. This binarization of $\mathbf{S}^{\prime}$ is also motivated by the fact in the case study that the number of negative samples $\ell(\mathbf{X}^{L_{1}})$ might be larger than the number of positive samples $\ell(\mathbf{X}^{S_{1}}\oplus\mathbf{X}^{S_{2}})$ in the given dataset. Given this situation, we tend to set the value of $w$ as small as possible to ensure only samples close to the target distribution (with the highest similarity scores) are selected. With these two criteria, the value of $w$ is determined by ensuring there are just enough samples with the binarized similarity score of 1 to be queried. This work quotes 25% of the amount of unlabeled data as the choice of $w$ that consistently satisfied the criterion for the various test settings.

The binarization described above is formally defined as a function $\operatorname*{locmax}_{n}$ that maps an $m$ -dimensional vector $\mathbf{X}$ to another $m$ -dimensional mask vector $\mathbf{Y}$ , such that each element of $\mathbf{Y}\in\{0,1\}$ denotes whether the corresponding element of $\mathbf{X}$ was part of the top- $n$ maximum set (i.e., 1) or not (i.e., 0). For example, $y_{1}=1$ indicates the $x_{1}$ has the top- $n$ maximum in $\mathbf{X}$ , otherwise $y_{1}=0$ . Then, for $i=1,2,\ldots,n$ :

		$\displaystyle\operatorname*{locmax}_{n}:$		(7)
		$\displaystyle\mathbf{X}\in\mathbb{R}^{m}\mapsto\mathbf{Y}\in\mathbb{R}^{m}\;\|% \;y_{i}=1\text{ iff }x_{i}\in\max_{n}(\mathbf{X})\text{ else }y_{i}=0$		(7)

where $\max_{n}$ is simply a vector of $n$ maximum values of a target vector. It can be formally defined as:

$\displaystyle\max_{n}:{}$	$\displaystyle\mathbf{X}\in\mathbb{R}^{m}\mapsto\mathbf{Y}\in\mathbb{R}^{n}\;\|% \;n<m$	(8)
	$\displaystyle\mathbf{X}_{0}=\emptyset$
	$\displaystyle\mathbf{X}_{n+1}=\mathbf{X}_{n}\cup\{\max(\mathbf{X}\setminus% \mathbf{X}_{n})\}\text{ for }n\geq 0$

The binarized similarity score is then given as $\mathbf{S}=\operatorname*{locmax}_{\lfloor k\times d_{u}\rfloor}(\mathbf{S}^{% \prime})$ . The joint score can now be computed. Observe that there are two $d_{u}$ -dimensional vectors $\mathbf{S}$ and $\mathbf{U}$ representing the similarity and uncertainty scores for the unlabeled data pool. The joint score vector $\mathbf{J}$ is now computed by calculating their Hadamard product:

	$\displaystyle\mathbf{J}$	$\displaystyle=\mathbf{S}\odot\mathbf{U}$		(9)
	$\displaystyle\left(J_{1},\ldots,J_{d_{u}}\right)$	$\displaystyle=\left(s_{1},\ldots,s_{d_{u}}\right)\odot\left(u_{1},\ldots,u_{d_% {u}}\right)$		(10)

Given that $s_{i}\in\{0,1\}$ and $u_{i}\in[0,1]$ , it follows that $J_{i}=s_{i}u_{i}\in[0,1]$ . The higher the joint score $J_{i}$ for a sample, the more likely that it is both similar to the datasets $X^{S_{1}}\oplus X^{S_{2}}$ , and has high entropy. Thus, by selecting samples with high joint score values, we effectively query samples for annotation that are (1) highly likely to belong to the desired similar machine distribution, and (2) benefit the downstream anomaly detection task, i.e., they lie close to the classification boundary. Our objective is to integrate the architectures of CL and AL, capitalizing on the unique strengths of each approach. In CL, we measure the similarity of semantic features among data points, assigning high similarity scores to data from the same underlying distribution. This similarity score is then employed to selectively pick data points from a smaller machine (the mauve region in Figure 3a) that shares a similar distribution. In addition, AL utilizes entropy to score the least certain instances for data point selection(the blue region in Figure 3b). This uncertainty measure proves valuable for distinguishing between normal and abnormal data during classification. This strategy is particularly useful in scenarios where labeling data is expensive or time-consuming. Combine both the high similarity scores and high uncertainty scores to select samples for labeling (the green region in Figure 3c).

III-D Main Active Learning Cycle

Input : Small machine labeled data

D^{S}_{\text{label}}

\{S_{1}^{X},S_{1}^{Y},S_{2}^{X},S_{2}^{Y}\}

Large Machine labeled data

D^{L}_{\text{label}}

\{L_{1}^{X},L_{1}^{Y}\}

Unlabeled data pool

D_{\text{unlabel}}

Output : Trained classifier

\theta_{u}

5Train the similarity model

\theta_{s}

with

\{D^{S}_{\text{label}},D^{L}_{\text{label}}\}

6for

I\ in\ ALcycle

8 Train the classifier Model

\theta_{u}

with

D^{S}_{\text{label}}

10 Calculate the uncertainty feature of

D_{\text{unlabel}}

using

\theta_{u}

z_{u}({D_{\text{unlabel}}})=\theta_{u}(D_{\text{unlabel}}

)

12 Calculate

S_{\text{uncertainty}}(D_{\text{unlabel}})

using Eq. (6);

13 Calculate the similarity feature of

D^{S}_{\text{label}}

and

D_{\text{{unlabel}}}

using

\theta_{s}

z_{s}(D^{S}_{\text{{label}}})=\theta_{s}(D^{S}_{\text{label}}),z_{s}(D_{\text{% unlabel}})=\theta_{s}(D_{\text{unlabel}})

15 Calculate

S_{\text{similarity}}(D_{\text{unlabel}})

using Eq. (9)

17 Calculate joint score

J_{\text{unlabel}}

using Eq. (10) with

S_{\text{uncertainty}}(D_{\text{unlabel}})

and

S_{\text{similarity}}(D_{\text{unlabel}})

18 Query the samples with max

J_{\text{unlabel}}

to query set

X_{\text{Queried}}

19 Request oracle to annotate all labels in

Y_{\text{Queried}}

D_{\text{labeled}}^{S}\leftarrow D_{\text{labeled}}^{S}\cup\{X_{\text{Queried}% },Y_{\text{Queried}}\}

return Trained classifier

\theta_{u}

Algorithm 1 Active Data-sharing

The Algorithm 1 for our proposed ADs framework is presented in this subsection and also shown in Figure 2. Initiated with Latin Hypercube Sampling (LHS), we meticulously select the initial labeled dataset from two small machines $D^{S}_{\text{label}}$ : $\{S_{1}^{X},S_{1}^{Y},S_{2}^{X},S_{2}^{Y}\}$ and a large machine $D^{L}_{\text{label}}$ : $\{L_{1}^{X},L_{1}^{Y}\}$ . This ensures that the set of random numbers represents the genuine variability of the initial dataset. Both pre-trained CL models are trained using this initial labeled data. The steps are summarized as follows:

1.

In the AL phase, the classifier is trained using the initial labeled data.
2.

Leveraging existing classifiers and pre-trained CL, we extract the similarity feature and uncertainty feature for both labeled and unlabeled data.
3.

The features are used to calculate the similarity score with Equation (4) and uncertainty score with Equation (6) of unlabeled data $D_{\text{unlabel}}$ , which are then combined using Equation (9).
4.

The data annotator (oracle) labels the queried data $\{X_{\text{Queried}},Y_{\text{Queried}}\}$ with the highest joint score.
5.

The labeled dataset undergoes an update by incorporating the queried data, and the size of the annotated data pool increases while the size of the unlabeled data pool decreases by the same amount.
6.

Signifying the inception of a cyclic process that iteratively repeats steps 1-5. Notably, the classifier model evolves with each cycle, updating dynamically with new labeled data from the oracle.

III-E Pareto Optimality of ADs

It is evident that the described problem can be modeled as a multi-objective optimization problem (MOO) due to the distinct nature of the two objectives involved. Such problems typically have multiple optimal solutions, and it becomes necessary to define the concept of Pareto optimality.

Definition 1 (Pareto Optimal [66]).

A point, $x^{*}\in X$ , is Pareto optimal iff there does not exist another point, $x\in X$ , such that $F(x)\leq F(x^{*})$ , and $F_{i}(x)<F_{i}(x^{*})$ for at least one function $F_{i}$ .

Definition 2 (Weakly Pareto Optimal [66]).

A point, $x^{*}\in X$ , is weakly Pareto optimal iff there does not exist another point, $x\in X$ , such that $F(x)<F(x^{*})$ .

For the first objective, i.e., to differentiate data samples from different distributions, the acquisition function is a similarity score $\mathbf{S}$ based on the cosine similarity metric. It is important to note that this could be any similarity metric, and the approach would still remain valid. To achieve the second objective, i.e., to decide the informativeness of samples, uncertainty sampling [17] is leveraged to compute the uncertainty score $\mathbf{U}$ . By harmonizing these distinct objective functions to form an integrated criterion that represents both objectives, the queried data $\mathbf{D}^{Q}$ meeting the Pareto optimality can ensure meeting both objectives. To that end, the $\mathbf{J}$ in Equation (9) is proposed, forming a scalarized combination of the two individual acquisition functions using element-wise multiplication. In effect, $t$ target samples will be selected for the annotator based on the calculated $\mathbf{J}$ at the end of each iteration in the active learning. In the following description in this section, we will prove that the Pareto optimal of two objectives in our formulation exists and can be achieved by optimizing the proposed joint acquisition function $\mathbf{J}$ .

Convex Analysis of Joint Acquisition Function: The joint acquisition function $\mathbf{J}$ is defined as the Hadamard product between the similarity score $\mathbf{S}$ and uncertainty $\mathbf{U}$ . The definition of similarity score (shown in Equation (4)) indicates that for the $i_{\text{th}}$ sample in the unlabelled data pool, its similarity score ( $s_{i}$ ) is the maximum of cosine similarities to all the annotated samples. Once $s_{i}$ has been computed for all samples to compute the vector of cosine similarities $\mathbf{S}^{\prime}$ , it is further processed by setting $k$ % maximum scores to one and the rest to zero, resulting in a sparse vector $\mathbf{S}$ . Because of this, the similarity score vector is essentially just an indicator vector to select the subset of unlabelled samples that are close to the target population. In addition, the uncertainty score vector $\mathbf{U}$ (shown in Equation (6)) is defined upon entropy, which is a concave function (The proof is detailed in Appendix A-A). Additionally, recall that the similarity score vector $\mathbf{S}$ is a sparse indicator vector so that the Hadamard product representing the joint score function $\mathbf{J}=\mathbf{S}\odot\mathbf{U}$ essentially reduces to the entropy values for selected samples. In other words, the joint score $J_{i}=s_{i}u_{i}$ for a sample is the product of a scalar with a concave function. Therefore, the joint acquisition function is also concave given the following Theorem 1.

Theorem 1 (Canonical Combinations of Convex Functions).

Consider a set of convex functions $f_{1},\ldots,f_{n}$ mapping $\mathbf{x}\to\mathbb{R}$ , and $\alpha_{1},\ldots,\alpha_{n}$ be a set of non-negative scalars, then:

g(\mathbf{x})=\sum_{i=1}^{n}a_{i}f_{i}=a_{1}f_{1}+a_{2}f_{2}+\ldots+a_{n}f_{n}

is also convex. Furthermore, if any $f_{i}$ is strictly convex, then $g(\mathbf{x})$ is strictly convex.

The concavity of the joint acquisition function indicates the existence of its global optimum. Next, we will demonstrate the existence of Pareto optimal in the MOO setting.

Theorem 2 (Sufficient Condition for Pareto Optimality [66]).

Let $F\in Z$ , $x^{*}\in X$ , and $F^{*}=F(x^{*})$ . Let a scalar global criterion $F_{g}(F):Z\rightarrow\mathbb{R}$ be differentiable with $\nabla_{F}F_{g}(F)>0\ \forall F\in Z$ . Assume $F_{g}(F^{*})=\min\{F_{g}(F)<F\in Z\}$ . Then, $x^{*}$ is Pareto optimal.

Define that $Z=\{\mathbf{S},\mathbf{U}\}$ , $x^{*}\in X$ , the joint acquisition function is the scalar global criterion $\mathbf{J}:Z\rightarrow\mathbb{R}$ , we need to prove that $\mathbf{J}$ increases monotonically with respect to $\mathbf{S},\mathbf{U}$ , which means that $\nabla_{F}\mathbf{J}>0$ for all $F\in Z$ . Since $\partial\mathbf{J}/\partial\mathbf{S}=\mathbf{U}$ , the uncertainty score $\mathbf{U}$ is based on entropy so $\partial\mathbf{J}/\partial\mathbf{S}>0$ . Similarly, since $\partial\mathbf{J}/\partial\mathbf{U}=\mathbf{S}$ , the similarity score $\mathbf{S}$ is designed based on the triplet loss with cosine similarity as a distance function, so $\partial\mathbf{J}/\partial\mathbf{U}>0$ intuitively. Therefore, we have $\nabla_{F}\mathbf{J}>0,\text{for all }F\in Z$ , for our proposed joint acquisition function. Following Theorem 2, we have the optimal solution of a global function $\mathbf{J}$ is sufficient for achieving the Pareto optimality of separate objectives if $\mathbf{J}$ increases monotonically with respect to each objective function.

Given the existence of the optimum of joint acquisition function and the Pareto optimal of separate objectives, we further demonstrate the design of AL can converge to the optimal point with the increase of queried samples. Referring to Equation (9), the similarity scores serve as a filter to keep uncertainty scores for those unlabelled samples that are close to the target population. It essentially reduces the problem into a classical AL setting, where no distribution mismatch exists (data samples from different distributions are eliminated). Raj et al.[67] show that the uncertainty sampling in classic AL converges to the optimal predictor for binary classification and provide the provide a non-asymptotic rate of convergence of order $O(1/n)$ , where n is the number of iterations of the AL.

IV Case Study

This section presents the motivation, core design choices, as well as a comprehensive overview of the experiments conducted as a case study to verify the effectiveness of the proposed ADs framework in industrial tasks.

IV-A Experimental Setup

ML applications in industrial tasks are generally hindered by challenges, including data scarcity, limited annotation budgets, and lack of prior knowledge with regard to data distribution and informativeness, which leads to the shared data is not aligned with the target distribution and uninformative. Data sharing is commonly purported as a solution, but it only addresses the problems of data scarcity and, to an extent, annotation budget. The issues of distribution mismatch and uninformative of the shared data remain prevalent, thereby delaying the performance of the downstream task. It follows that the purpose of any framework attempting to solve these is to further satisfy that (1) the distribution of the selected data matches the target distribution, i.e., the distribution mismatch of shared data is minimized, and (2) the selected data contributes valuable information towards the model dealing with the downstream task.

The problem scenario for the case study is carefully designed to simulate these issues in the real world. Data is collected for the same additive manufacturing process from three machines with the ultimate goal of enabling data-sharing to enhance the downstream task of anomaly detection in the manufactured object. The particular anomaly to detect is a manufacturing fault, like the one shown in Figure 5. Research has shown that the monitoring data can reflect this anomaly [68]. Of the three machines, two are smaller in size and of the same make and model, and the third is a larger model from a separate manufacturer, as evident from Figure 4. In this section, the two similar small machines are termed $S=\{S_{1},S_{2}\}$ , and the dissimilar large machine is termed $L$ .

The purpose of having two similar and one dissimilar machine is to explicitly introduce the distribution mismatch in a typical data-sharing problem. Specifically, the case study considers the scenario where all three machines are tasked to manufacture the same object, a cube with dimensions $2\times 2\times 2\text{cm}^{3}$ . It is natural to treat the monitoring data obtained from similar machines following a similar distribution compared to that from the dissimilar machine owing to the variety in build design, software, specifications, and other manufacturing differences. Therefore, distribution mismatch needs to be considered when sharing the monitoring data from these three machines with each other. As discussed earlier, this is the issue that the contrastive model in AD addresses. Furthermore, given that the data is scarce and annotations are costly in terms of manual time and effort, it becomes critical to have only the most informative samples annotated for the downstream task. The uncertainty sampling model of ADs is well-suited for this task. The model requires a classifier to identify normal and abnormal data, which, in this work, is based on a Convolutional Neural Network (CNN) architecture with cross-entropy loss and Adam optimization[69].

IV-B Data Preprocessing and Augmentation

In the ADs framework, the initial phase involves preprocessing the data and training the augmentation model. This model is specifically designed to generate positive pairs for time sequence data, serving as the training input for the CL model. The entirety of the data is min-max normalized to ensure consistent scaling throughout. An initial subset of the complete data (e.g., 20%) is then sampled for manual annotation. Rather than use random sampling, ADs employs Latin Hypercube Sampling (LHS) [70, 71] to ensure that the sampled data is uniformly distributed within the parameter space and therefore representative of the real variability of the distribution. Following this, the Winner-Take-All autoencoder of [61] is trained in an unsupervised fashion to learn shift-invariant and sparse, high-dimensional latent embeddings of the normalized data. Once trained, transformed augmentations of the input data are generated by passing the input through the encoder to generate the latent input embedding, manipulating it in the latent space, and finally reconstructing the input by decoding the modified latent embedding. The reconstruction result serves as the transformed augmentation. ADs defines two augmentation transformations in this way:

Additive Gaussian Noise: The first augmentation $\mathbf{y}_{a_{1}}$ of the $n$ -dimensional latent space output embedding $\mathbf{y}_{e}$ is obtained by adding a Gaussian noise vector $\mathbf{n}$ to $\mathbf{y}_{e}$ :

\mathbf{y}_{a_{1}}=\mathbf{y}_{e}+\mathbf{n}

(11)

\mathbf{n}=\frac{r\sigma_{e}}{5}\times\mathbf{M}

Where $r\sim\mathcal{N}(0,1)$ , $\sigma_{e}$ indicates the standard deviation of the latent space output, and $\mathbf{M}$ is a 2D binary mask array based on $\mathbf{y}_{e}$ :

\mathbf{M}=\left\{\begin{array}[]{lcr}1&\text{if}&y>\sigma_{e}\\ 0&\text{if}&y\leq\sigma_{e}\end{array},\;\forall\;y\in\mathbf{y}_{e}\right.

Shortly, $M$ enforces that noise is only added to those components of the output embedding $\mathbf{y}_{e}$ that lie outside the range of one standard deviation of the vector. This helps ensure that the generated augmentation exploits the variability of the embedding as much as possible by adding noise only to the components that deviate significantly from the mean. Thus, the generated augmentation retains components most closely clustered around the mean, thereby preserving the most characteristic features. This is desirable as the augmentations are meant to be used as positive examples while training the contrastive network. Consequently, they should not deviate too much from the actual sample (i.e., the anchor) to mitigate the risk of generating an out-of-distribution (OOD) augmentation.

Embedding Deviation Thresholding: The second augmentation $\mathbf{y}_{a_{2}}$ is generated by multiplying $\mathbf{y}_{e}$ with its absolute value $\lvert\,\mathbf{y}_{e}\rvert$ to produce a resultant vector $\mathbf{y}_{e}^{\prime}$ . Each component of squared components $\mathbf{y}_{e}^{\prime}$ is then compared with $\sigma_{e}$ , such that all components greater than $\sigma_{e}$ are set to 1 and the rest to 0. Mathematically:

\mathbf{y}_{e}^{\prime}=\mathbf{y}_{e}\odot|\,\mathbf{y}_{e}|

\mathbf{y}_{a_{2}}=\left\{\begin{array}[]{lcr}1&\text{if}&y>\sigma_{e}\\ 0&\text{if}&y\leq\sigma_{e}\end{array},\;\forall\;y\in\mathbf{y}_{e}^{\prime}\right.

(12)

This augmentation is similar to the mask in the first one, with the main differences being that no noise is added, and the threshold is applied to the squared components of the output embedding, thereby exaggerating the distance, or lack thereof, of each component from the mean.

In order to evaluate the effectiveness of ADs as well as perform a comparative analysis with various other settings and methods, data from the additive manufacturing machines were acquired as they printed a cube with a small square void as the anomaly on one of the faces as shown in Figure 5. It is important to understand the data handling procedure to ensure the quality of the evaluation. To that end, a few terms are now introduced. Source identification is defined as the task of identifying whether a sample belongs to $S$ or $L$ , and Classification as assigning it the normal or abnormal class. For ease of reference, data may be considered notorious, i.e., it is unknown which machine produced them As the first step in data pre-processing for this case study, the actual identity and class labels of all position recordings are isolated and stored separately to compute evaluation metrics. As such, the incoming data is both anonymous and unclassified (unlabeled). The data is then combined into a single, large dataset that is hereafter referred to as the initial dataset. Additionally, to make the task more challenging, the ratio of the data $L:S$ is kept large so that the majority of the combined dataset is populated by data from the dissimilar machine. Finally, since the application of a supervised ML approach requires some annotated data, a small initial subset of the data is sampled via LHS and annotated so that almost all of the data remains unlabeled.

These initial annotations consist of both source identification ( $S$ or $L$ ) as input for a CL model and data classification (normal or abnormal). Note that source identification is not related to the downstream task of anomaly detection – it is only included to allow training the CL model in the ADs architecture. All successive annotations in the AL cycle consist purely of normal/abnormal classification so that the classifier can be retrained on the high-entropy data.

IV-C Experiment Settings

TABLE I: Experiment settings by using the shared data from the three additive manufacturing machines.

	$\mathcal{D}_{inital}$	$\mathcal{D}_{training}$	$\mathcal{D}_{testing}$
Supervised	N/A	$\{S_{1},S_{2}\}^{\text{100\% train}}$	$S_{1}^{\text{test}},S_{2}^{\text{test}}$
Random Pick S	$20\%S_{1},20\%S_{2},20\%L$	$\{S_{1},S_{2}\}^{\text{random pick}}$	$S_{1}^{\text{test}},S_{2}^{\text{test}}$
Random Pick S+L	$20\%S_{1},20\%S_{2},20\%L$	$\{S_{1},S_{2},L\}^{\text{random pick}}$	$S_{1}^{\text{test}},S_{2}^{\text{test}}$
ADs Without CL	$20\%S_{1},20\%S_{2},20\%L$	$\{S_{1},S_{2},L\}^{\text{queried}}$	$S_{1}^{\text{test}},S_{2}^{\text{test}}$
ADs	$20\%S_{1},20\%S_{2},20\%L$	$\{S_{1},S_{2}\}^{\text{queried}}$	$S_{1}^{\text{test}},S_{2}^{\text{test}}$

At this point, the data is ready for input to ADs. Comparative experiments among the proposed ADs and other benchmark methods for anomaly detection are conducted for more conclusive and insightful results. To that end, anomaly detection is performed in five different settings that are now discussed in order. The experiment setting is shown in Table I. Note that, in all settings: (1) In-text variables should be considered specific to the setting they are in. (2) Testing dataset for all settings was kept the same for a fair comparison. (3) Per typical training regulations, the training, testing, and validation sets $\mathcal{D}_{\text{train}},\mathcal{D}_{\text{test}},\mathcal{D}_{\text{val}}$ are pairwise disjoint.

Supervised: This is the usual setting for most deep-learning approaches dealing with classification. It assumes that labels are available for all samples, which is the opposite of our scenario as it implies that data collection and annotations are not a problem, so data-sharing is not needed. As such, its performance on the downstream task is considered the baseline.

Random Pick S: In this setting, only the data from similar machines is made available to the model, so $\mathcal{D}=\{S_{1},S_{2}\}$ . By removing the dissimilar data from the dataset, this setting allows observing the effect of the uncertainty score $\mathbf{U}$ from ADs in isolation, thus evaluating its effectiveness.

Random Pick S+L: This is similar to the Random Pick S setting, with the main difference being that the data available to the model is provided by all three machines, so $\mathcal{D}=\{S_{1},S_{2},L\}$ . As such, this setting better allows studying the usefulness of both the similarity score $\mathbf{S}$ and uncertainty score $\mathbf{U}$ . 20% of the entire data alongside 400 randomly picked samples from the remaining data forms the subset $\mathcal{D}^{\text{sub}}\subset\mathcal{D}$ . Evidently, only the training dataset changes.

ADs Without CL: This setting strips away the contrastive model from the ADs framework so that instead of a joint score $\mathbf{J}$ comprised of a similarity score $\mathbf{S}$ and uncertainty score $\mathbf{U}$ , each sample is only assigned $\mathbf{U}$ to generate the query set for the next cycle. Thus, it represents a purely AL-based approach to data-sharing. In this setting, data is sourced from all three machines, so $\mathcal{D}=\{S_{1},S_{2},L\}$ . Like in the Random Pick S+L settings, a subset of the data is formed by sampling 20% of this data, but no random samples are added. Instead, the 400 new samples are selected via AL. ADs is set to terminate after 5 cycles with 80 samples per cycle to ultimately select these 400 samples, and after adding them to the sampled 20%, the final subset of data $\mathcal{D}^{\text{sub}}\subset\mathcal{D}$ for training and validation is ready. Evidently, the only difference is how the 400 samples are selected.

ADs: This setting is exactly the same as ADs Without CL, except for keeping the contrastive model in the ADs framework so that each sample is selected based on the joint score $\mathbf{J}$ , which is in turn computed via an integration of the similarity score $\mathbf{S}$ and uncertainty score $\mathbf{U}$ . Consequently, this setting minimizes the number of dissimilar data $L$ introduced into the anomaly detector’s training while still prioritizing data informativeness and thereby boosting model accuracy. Evidently, ADs represents AL as a multi-objective optimization problem, resulting in selected samples that are not only high-entropy samples but also highly likely to originate from similar machines $S$ .

Random Pick: It denotes that training data (20% of the full data) is still specifically extracted, but the additional samples are selected at random. This decision is logically informed: one might imagine that owing to the randomized nature of the data, the anomaly detector model would form different decision boundaries when re-training with a new data split. Therefore, by running multiple instances of a model trained as such, the average performance of these models leads to a more accurate estimate of the true performance of the model.

IV-D Results

The corresponding results, including the anomaly detector’s accuracy, F1 score, and the percentage of samples that are not from the target distribution (%L), are tabulated in Table II. Ideally, the classification accuracy would be 100% (i.e., all selected samples are correctly classified as normal or abnormal), and the percentage of negative samples in the selected unlabeled samples would be 0% (i.e., there are no negative samples in the selected samples). Furthermore, the effect of %L on the model accuracy is visualized in Figure 6 for different choices of queried data and the percentage of complete data used to create the training subset (Q. Data and %Data in Table II). Note that %L is only applicable to Random Pick S+L, ADs without CL and ADs settings, which involve the dissimilar data from the larger machine $L$ and also present a chance of incorrectly adding its samples into the training pipeline. This is unlike the settings Supervised, Random Pick S+L, which only deal with similar machine data.

TABLE II: Results of the case study by using the shared data from the three additive manufacturing machines.

	%Data	Q. Data	Accuracy	F1 Score	%L
Supervised	100	n/a	0.9437	0.9449	n/a
Random Pick S	20	400	0.8957	0.8980	n/a
Random Pick S+L	20	400	0.8805	0.8805	45.25
ADs Without CL	20	400	0.9510	0.9507	12.5
ADs	20	400	0.9578	0.9563	0

As evident from the results in Table II, even with just 20% of the complete data and 400 additional samples selected using our novel query strategy, semi-supervised ADs outperform the supervised baseline, both with and without CL (i.e., whether the similarity score was involved in the joint score computation). This is evidenced specifically by the increase in accuracy (ADs: $+1.4\%$ , ADs Without CL: $+0.73\%$ ) and F1 scores (ADs: $+1.14\%$ , ADs Without CL: $+0.58\%$ ). Additionally, the results from Figure 6 confirm that enabling CL to generate similarity scores effectively allows the framework to distinguish similar and dissimilar data, evidenced by the significantly lower %L scores in the setting with CL enabled. Additionally, lower %L values consistently correspond to higher accuracy in the downstream anomaly detection task despite variations in the size of the training data. On the contrary, the setting with CL disabled picks large amounts of dissimilar data, which corresponds to generally worse performance as well as a greater degree of result variability.

IV-E Ablation Study

In order to further reinforce the results of the main experiments, ablation experiments were performed. The goal of these is to (1) establish that the reduction in selection of dissimilar samples %L is not related to the number of selected samples and primarily concerned with the usage of CL, (2) study the relationship between model performance and the number of selected samples per cycle of ADs for a fixed budget of total samples, and (3) study the effect of increasing and decreasing the amount of queried data in each cycle.To accomplish the first two objectives, the number of total queried samples was fixed to 800 in addition to 20% initially sampled data, and the only two variables are whether CL is enabled or not and the number of samples selected per cycle of ADs. There are 16 experiments conducted, and the total queried budget is 800. The samples were evenly queried by each cycle and the number of cycles selected in set {2,4,5,8,10,16,20,25,32,40,50,80,100,160,200,400} respectively.

The results show a consistently and significantly smaller %L for the model with CL as compared to without CL. The fewer the number of samples queried per cycle, the more likely the model has higher values of %L without CL. With CL, the result remains fairly similar regardless of the number of samples queried per cycle. This confirms the hypothesis that CL is the primary factor in reducing the number of dissimilar samples in the final query set, for these ablation experiments are visualized in Figures 7(a) and 7(b). For the third goal, CL is kept enabled, and the number of queried samples is 600 and 800, with 20% initial data in both cases. The model performance is compared over varying samples per cycle for each scenario. While the performance does improve for the case with a larger budget, the difference is considerably small, which is 200 queried samples difference between the two settings. The results of this experiment are visualized in Figures 7(a) and 7(c). We also experimented with different numbers of initial data by keeping the CL enabled and the same number of queried samples. The model performance with 15% and 20% initial data are similar. The results of this experiment are visualized in Figures 7(a) and 7(d).

V conclusion

Data-sharing among machines has great potential to contribute to the wide application of ML methods in manufacturing systems, which addresses the challenges of data scarcity. However, typical data-sharing approaches do not consider the quality of the shared samples, nor the inherent distribution mismatch of the data thus acquired. In this paper, a semi-supervised Active Data-sharing (AD) framework is proposed to address these problems. ADs selects high-quality data that are both informative to the downstream ML task and appear to follow the target distribution. ADs views the problem as a multi-objective optimization problem and employs a novel joint acquisition function to query the Pareto optimal point that satisfies both objectives simultaneously. The joint score itself is computed based on a combination of two individual scores, namely the uncertainty and similarity scores, that are separately obtained via entropy and CL techniques.

Systematic experiments are conducted using ADs to evaluate its effectiveness. The experiment involves real-world in-situ monitoring data from the same additive manufacturing process using data-sharing between three machines, two of which are similar, i.e., the same model, and one large machine from a different manufacturer is considered dissimilar to purposely introduce distribution mismatch. Results show that with only a fraction of initially annotated data and a few cycles of ADs to extend the annotated dataset, the anomaly detection model outperforms the baseline anomaly detector trained on the fully annotated dataset. Furthermore, the excel performance is achieved in a distribution-aware manner, i.e., without querying any of the samples from different distributions to use for training, thereby effectively addressing the distribution mismatch problem. Further ablation studies confirm that the design philosophy of ADs with the combination of AL and CL is indeed the factor in improving performance. The usage of the similarity score directly reduces the mismatched data in the selection, and smaller amounts of mismatched data further improve the accuracy of the anomaly detection task.

In light of these extensive experiments and their results, it is concluded that ADs effectively solve the problems of data quality and distribution mismatch usually prevalent in existing data-sharing approaches and that this work establishes a new baseline for further improvements to data-sharing in the industrial domain.

Appendix A Appendix

A-A Concavity of Entropy

Assuming a C-way classification problem, where the model’s output probabilities are defined in terms of the model parameters as $P_{\theta}$ and the predicted class is $\hat{y}_{i}$ , the Shannon entropy is defined as:

	$\displaystyle H(x)$	$\displaystyle=\sum_{i}^{C}P_{\theta}(\hat{y}_{i}\;\|\;x)\log_{b}\left(\frac{1}{% P_{\theta}(\hat{y}_{i}\;\|\;x)}\right)$
		$\displaystyle=-\sum_{i}^{C}P_{\theta}(\hat{y}_{i}\;\|\;x)\log_{b}(P_{\theta}(% \hat{y}_{i}\;\|\;x))$

Where $i$ represents the $i^{\text{th}}$ class and $C$ represents the total number of classes. In binary classification problems (as in the case presented in this paper), the summation symbol can be discarded:

H(x)=-P_{\theta}(\hat{y}\;|\;x)\log_{b}(P_{\theta}(\hat{y}\;|\;x))

There are several methods to prove the concavity of Shannon entropy. For readability, let $P_{\theta}(\hat{y}\;|\;x)=p_{x}$ . Given that for a function to be concave, the second derivative of the function with respect to all its variables must be non-positive over the entire domain of the function:

$\displaystyle H^{\prime}(x)$	$\displaystyle=\frac{d}{dp_{x}}\left(-p_{x}\log_{b}p_{x}\right)$
	$\displaystyle=-\left(\log_{b}p_{x}+\frac{p_{x}}{p_{x}\ln b}\right)$
	$\displaystyle=-\left(\log_{b}p_{x}+\frac{1}{\ln b}\right)$
$\displaystyle H^{\prime\prime}(x)$	$\displaystyle=\frac{d}{dp_{x}}\left(-\log_{b}p_{x}-\frac{1}{\ln b}\right)$
	$\displaystyle=-\frac{1}{p_{x}\ln b}$	(13)

Here, the probability mass $p_{x}\in[0,\,1]$ and therefore non-negative by definition, while $\ln(b)$ is a non-negative constant for $b>1$ as is the case with logarithmic bases. Therefore, the expression in Equation A-A always evaluates to a non-positive value, which proves that entropy is a concave function.

References

[1] J. F. Arinez, Q. Chang, R. X. Gao, C. Xu, and J. Zhang, “Artificial intelligence in advanced manufacturing: Current status and future outlook,” Journal of Manufacturing Science and Engineering-transactions of The Asme, vol. 142, 2020.
[2] M. Abdallah, B. G. Joung, W. J. Lee, C. Mousoulis, J. W. Sutherland, and S. Bagchi, “Anomaly detection and inter-sensor transfer learning on smart manufacturing datasets,” Sensors (Basel, Switzerland), vol. 23, 2022.
[3] Z. Shi, Y. Li, and C. Liu, “Knowledge distillation-based information sharing for online process monitoring in decentralized manufacturing system,” Journal of Intelligent Manufacturing, 2024.
[4] Y. Xu and R. Goodacre, “On splitting training and validation set: a comparative study of cross-validation, bootstrap and systematic sampling for estimating the generalization performance of supervised learning,” Journal of analysis and testing, vol. 2, no. 3, pp. 249–262, 2018.
[5] R. Gao and M. Saar-Tsechansky, “Cost-accuracy aware adaptive labeling for active learning,” Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 2569–2576, Apr. 2020.
[6] P. Du, S. Zhao, H. Chen, S. Chai, H. Chen, and C. Li, “Contrastive coding for active learning under class distribution mismatch,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 8927–8936, 2021.
[7] R. Pathak, C. Ma, and M. J. Wainwright, “A new similarity measure for covariate shift with applications to nonparametric regression,” 2022.
[8] H. Yang, K. Huang, I. King, and M. R. Lyu, “Maximum margin semi-supervised learning with irrelevant data,” Neural Networks, vol. 70, pp. 90–102, 2015.
[9] Y. Chen, X. Zhu, W. Li, and S. Gong, “Semi-supervised learning under class distribution mismatch,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 3569–3576, 2020.
[10] L.-Z. Guo, Z.-Y. Zhang, Y. Jiang, Y.-F. Li, and Z.-H. Zhou, “Safe deep semi-supervised learning for unseen-class unlabeled data,” in International Conference on Machine Learning, pp. 3897–3906, PMLR, 2020.
[11] S. Liang, Y. Li, and R. Srikant, “Principled detection of out-of-distribution examples in neural networks,” CoRR, vol. abs/1706.02690, 2017.
[12] W. Liu, X. Wang, J. Owens, and Y. Li, “Energy-based out-of-distribution detection,” Advances in neural information processing systems, vol. 33, pp. 21464–21475, 2020.
[13] Y. Wang, W. Sun, J. Jin, Z. Kong, and X. Yue, “Wood: Wasserstein-based out-of-distribution detection,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023.
[14] C. Lee, X. Wang, J. Wu, and X. Yue, “Failure-averse active learning for physics-constrained systems,” IEEE Transactions on Automation Science and Engineering, 2022.
[15] R. Chattopadhyay, Z. Wang, W. Fan, I. Davidson, S. Panchanathan, and J. Ye, “Batch mode active sampling based on marginal probability distribution matching,” ACM Trans. Knowl. Discov. Data, vol. 7, sep 2013.
[16] L. Möllenbrok and B. Demir, “Active learning guided fine-tuning for enhancing self-supervised based multi-label classification of remote sensing images,” 2023.
[17] B. Settles, “Active learning literature survey,” 2009.
[18] R. D. King, K. E. Whelan, F. M. Jones, P. G. Reiser, C. H. Bryant, S. H. Muggleton, D. B. Kell, and S. G. Oliver, “Functional genomic hypothesis generation and experimentation by a robot scientist,” Nature, vol. 427, no. 6971, pp. 247–252, 2004.
[19] L. Wang, X. Hu, B. Yuan, and J. Lu, “Active learning via query synthesis and nearest neighbour search,” Neurocomputing, vol. 147, pp. 426–434, 2015.
[20] R. Schumann and I. Rehbein, “Active learning via membership query synthesis for semi-supervised sentence classification,” in Proceedings of the 23rd conference on computational natural language learning (CoNLL), pp. 472–481, 2019.
[21] J.-J. Zhu and J. Bento, “Generative adversarial active learning,” arXiv preprint arXiv:1702.07956, 2017.
[22] Y. Liu, Z. Li, C. Zhou, Y. Jiang, J. Sun, M. Wang, and X. He, “Generative adversarial active learning for unsupervised outlier detection,” IEEE Transactions on Knowledge and Data Engineering, vol. 32, no. 8, pp. 1517–1528, 2019.
[23] T. Tran, T.-T. Do, I. Reid, and G. Carneiro, “Bayesian generative active deep learning,” in International Conference on Machine Learning, pp. 6295–6304, PMLR, 2019.
[24] D. Cohn, L. Atlas, and R. Ladner, “Improving generalization with active learning,” Machine learning, vol. 15, pp. 201–221, 1994.
[25] S. Dasgupta, D. J. Hsu, and C. Monteleoni, “A general agnostic active learning algorithm,” Advances in neural information processing systems, vol. 20, 2007.
[26] D. A. Cohn, Z. Ghahramani, and M. I. Jordan, “Active learning with statistical models,” Journal of artificial intelligence research, vol. 4, pp. 129–145, 1996.
[27] A. I. Schein and L. H. Ungar, “Active learning for logistic regression: an evaluation,” Machine Learning, vol. 68, pp. 235–265, 2007.
[28] D. D. Lewis and J. Catlett, “Heterogeneous uncertainty sampling for supervised learning,” in Machine learning proceedings 1994, pp. 148–156, Elsevier, 1994.
[29] D. D. Lewis, “A sequential algorithm for training text classifiers: Corrigendum and additional data,” in Acm Sigir Forum, vol. 29, pp. 13–19, ACM New York, NY, USA, 1995.
[30] A. J. Joshi, F. Porikli, and N. Papanikolopoulos, “Multi-class active learning for image classification,” in 2009 ieee conference on computer vision and pattern recognition, pp. 2372–2379, IEEE, 2009.
[31] W. Luo, A. Schwing, and R. Urtasun, “Latent structured active learning,” Advances in Neural Information Processing Systems, vol. 26, 2013.
[32] K. Wang, D. Zhang, Y. Li, R. Zhang, and L. Lin, “Cost-effective active learning for deep image classification,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 27, no. 12, pp. 2591–2600, 2016.
[33] S. Sinha, S. Ebrahimi, and T. Darrell, “Variational adversarial active learning,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5972–5981, 2019.
[34] D. Yoo and I. S. Kweon, “Learning loss for active learning,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 93–102, 2019.
[35] H. S. Seung, M. Opper, and H. Sompolinsky, “Query by committee,” in Proceedings of the fifth annual workshop on Computational learning theory, pp. 287–294, 1992.
[36] A. McCallum, K. Nigam, et al., “Employing em and pool-based active learning for text classification.,” in ICML, vol. 98, pp. 350–358, Citeseer, 1998.
[37] R. Burbidge, J. J. Rowland, and R. D. King, “Active learning for regression based on query by committee,” in Intelligent Data Engineering and Automated Learning-IDEAL 2007: 8th International Conference, Birmingham, UK, December 16-19, 2007. Proceedings 8, pp. 209–218, Springer, 2007.
[38] D. Roth and K. Small, “Margin-based active learning for structured output spaces,” in Machine Learning: ECML 2006: 17th European Conference on Machine Learning Berlin, Germany, September 18-22, 2006 Proceedings 17, pp. 413–424, Springer, 2006.
[39] C. E. Shannon, “A mathematical theory of communication,” The Bell system technical journal, vol. 27, no. 3, pp. 379–423, 1948.
[40] Y. Yang, Z. Ma, F. Nie, X. Chang, and A. G. Hauptmann, “Multi-class active learning by uncertainty sampling with diversity maximization,” International Journal of Computer Vision, vol. 113, pp. 113–127, 2015.
[41] O. Sener and S. Savarese, “Active learning for convolutional neural networks: A core-set approach,” arXiv preprint arXiv:1708.00489, 2017.
[42] Y. Kim and B. Shin, “In defense of core-set: A density-aware core-set selection for active learning,” in Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp. 804–812, 2022.
[43] S. Dasgupta and D. Hsu, “Hierarchical sampling for active learning,” in Proceedings of the 25th international conference on Machine learning, pp. 208–215, 2008.
[44] J. Zhu, H. Wang, B. K. Tsou, and M. Ma, “Active learning with sampling by uncertainty and density for data annotations,” IEEE Transactions on audio, speech, and language processing, vol. 18, no. 6, pp. 1323–1331, 2009.
[45] M. Wang, F. Min, Z.-H. Zhang, and Y.-X. Wu, “Active learning through density clustering,” Expert systems with applications, vol. 85, pp. 305–317, 2017.
[46] Z. Xu, R. Akella, and Y. Zhang, “Incorporating diversity and density in active learning for relevance feedback,” in Advances in Information Retrieval: 29th European Conference on IR Research, ECIR 2007, Rome, Italy, April 2-5, 2007. Proceedings 29, pp. 246–257, Springer, 2007.
[47] D. Pereira-Santos, R. B. C. Prudêncio, and A. C. de Carvalho, “Empirical investigation of active learning strategies,” Neurocomputing, vol. 326, pp. 15–27, 2019.
[48] K. Wei, Y. Liu, K. Kirchhoff, and J. Bilmes, “Using document summarization techniques for speech data subset selection,” in Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 721–726, 2013.
[49] S.-J. Huang, R. Jin, and Z.-H. Zhou, “Active learning by querying informative and representative examples,” Advances in neural information processing systems, vol. 23, 2010.
[50] S.-J. Huang and Z.-H. Zhou, “Active query driven by uncertainty and diversity for incremental multi-label learning,” in 2013 IEEE 13th international conference on data mining, pp. 1079–1084, IEEE, 2013.
[51] Z. Wang and J. Ye, “Querying discriminative and representative samples for batch mode active learning,” ACM Transactions on Knowledge Discovery from Data (TKDD), vol. 9, no. 3, pp. 1–23, 2015.
[52] L. Lin, K. Wang, D. Meng, W. Zuo, and L. Zhang, “Active self-paced learning for cost-effective and progressive face identification,” IEEE transactions on pattern analysis and machine intelligence, vol. 40, no. 1, pp. 7–19, 2017.
[53] Y.-P. Tang and S.-J. Huang, “Self-paced active learning: Query the right thing at the right time,” in Proceedings of the AAAI conference on artificial intelligence, vol. 33, pp. 5117–5124, 2019.
[54] R. Hadsell, S. Chopra, and Y. LeCun, “Dimensionality reduction by learning an invariant mapping,” in 2006 IEEE computer society conference on computer vision and pattern recognition (CVPR’06), vol. 2, pp. 1735–1742, IEEE, 2006.
[55] F. Schroff, D. Kalenichenko, and J. Philbin, “Facenet: A unified embedding for face recognition and clustering,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 815–823, 2015.
[56] A. Jaiswal, A. R. Babu, M. Z. Zadeh, D. Banerjee, and F. Makedon, “A survey on contrastive self-supervised learning,” Technologies, vol. 9, no. 1, p. 2, 2020.
[57] T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, “A simple framework for contrastive learning of visual representations,” in International conference on machine learning, pp. 1597–1607, PMLR, 2020.
[58] J. Tack, S. Mo, J. Jeong, and J. Shin, “Csi: Novelty detection via contrastive learning on distributionally shifted instances,” Advances in neural information processing systems, vol. 33, pp. 11839–11852, 2020.
[59] M. Caron, I. Misra, J. Mairal, P. Goyal, P. Bojanowski, and A. Joulin, “Unsupervised learning of visual features by contrasting cluster assignments,” Advances in neural information processing systems, vol. 33, pp. 9912–9924, 2020.
[60] J. Li, P. Zhou, C. Xiong, and S. C. Hoi, “Prototypical contrastive learning of unsupervised representations,” arXiv preprint arXiv:2005.04966, 2020.
[61] A. Makhzani and B. J. Frey, “Winner-take-all autoencoders,” Advances in neural information processing systems, vol. 28, 2015.
[62] C.-Y. Chuang, J. Robinson, Y.-C. Lin, A. Torralba, and S. Jegelka, “Debiased contrastive learning,” Advances in neural information processing systems, vol. 33, pp. 8765–8775, 2020.
[63] H. Zhao, X. Yang, Z. Wang, E. Yang, and C. Deng, “Graph debiased contrastive learning with joint representation clustering.,” in IJCAI, pp. 3434–3440, 2021.
[64] K. Zhou, B. Zhang, W. X. Zhao, and J.-R. Wen, “Debiased contrastive learning of unsupervised sentence representations,” arXiv preprint arXiv:2205.00656, 2022.
[65] K. Margatina, G. Vernikos, L. Barrault, and N. Aletras, “Active learning by acquiring contrastive examples,” arXiv preprint arXiv:2109.03764, 2021.
[66] R. Marler and J. Arora, “Survey of multi-objective optimization methods for engineering,” Structural and Multidisciplinary Optimization, vol. 26, pp. 369–395, 04 2004.
[67] A. Raj and F. Bach, “Convergence of uncertainty sampling for active learning,” 2021.
[68] Y. Li, Z. Shi, C. Liu, W. Tian, Z. Kong, and C. B. Williams, “Augmented time regularized generative adversarial network (atr-gan) for data augmentation in online process anomaly detection,” IEEE Transactions on Automation Science and Engineering, vol. 19, no. 4, pp. 3338–3355, 2021.
[69] D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in International Conference on Learning Representations (ICLR), (San Diega, CA, USA), 2015.
[70] V. Eglajs and P. Audze, “New approach to the design of multifactor experiments,” Problems of Dynamics and Strengths, vol. 35, no. 1, pp. 104–107, 1977.
[71] M. D. McKay, R. J. Beckman, and W. J. Conover, “A comparison of three methods for selecting values of input variables in the analysis of output from a computer code,” Technometrics, vol. 42, no. 1, pp. 55–61, 2000.