Machine Learning
See recent articles
Showing new listings for Thursday, 17 April 2025
- [1] arXiv:2504.11516 [pdf, html, other]
-
Title: FEAT: Free energy Estimators with Adaptive TransportJiajun He, Yuanqi Du, Francisco Vargas, Yuanqing Wang, Carla P. Gomes, José Miguel Hernández-Lobato, Eric Vanden-EijndenComments: 29 pages, 2 tables, 3 figuresSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Chemical Physics (physics.chem-ph); Computational Physics (physics.comp-ph)
We present Free energy Estimators with Adaptive Transport (FEAT), a novel framework for free energy estimation -- a critical challenge across scientific domains. FEAT leverages learned transports implemented via stochastic interpolants and provides consistent, minimum-variance estimators based on escorted Jarzynski equality and controlled Crooks theorem, alongside variational upper and lower bounds on free energy differences. Unifying equilibrium and non-equilibrium methods under a single theoretical framework, FEAT establishes a principled foundation for neural free energy calculations. Experimental validation on toy examples, molecular simulations, and quantum field theory demonstrates improvements over existing learning-based methods.
- [2] arXiv:2504.11554 [pdf, other]
-
Title: Normalizing Flow Regression for Bayesian Inference with Offline Likelihood EvaluationsComments: Accepted at the Proceedings track of the 7th Symposium on Advances in Approximate Bayesian Inference (AABI 2025). 40 pages, 10 figuresSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
Bayesian inference with computationally expensive likelihood evaluations remains a significant challenge in many scientific domains. We propose normalizing flow regression (NFR), a novel offline inference method for approximating posterior distributions. Unlike traditional surrogate approaches that require additional sampling or inference steps, NFR directly yields a tractable posterior approximation through regression on existing log-density evaluations. We introduce training techniques specifically for flow regression, such as tailored priors and likelihood functions, to achieve robust posterior and model evidence estimation. We demonstrate NFR's effectiveness on synthetic benchmarks and real-world applications from neuroscience and biology, showing superior or comparable performance to existing methods. NFR represents a promising approach for Bayesian inference when standard methods are computationally prohibitive or existing model evaluations can be recycled.
- [3] arXiv:2504.11609 [pdf, html, other]
-
Title: Towards Interpretable Deep Generative Models via Causal Representation LearningSubjects: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Methodology (stat.ME)
Recent developments in generative artificial intelligence (AI) rely on machine learning techniques such as deep learning and generative modeling to achieve state-of-the-art performance across wide-ranging domains. These methods' surprising performance is due in part to their ability to learn implicit "representations'' of complex, multi-modal data. Unfortunately, deep neural networks are notoriously black boxes that obscure these representations, making them difficult to interpret or analyze. To resolve these difficulties, one approach is to build new interpretable neural network models from the ground up. This is the goal of the emerging field of causal representation learning (CRL) that uses causality as a vector for building flexible, interpretable, and transferable generative AI. CRL can be seen as a culmination of three intrinsically statistical problems: (i) latent variable models such as factor analysis; (ii) causal graphical models with latent variables; and (iii) nonparametric statistics and deep learning. This paper reviews recent progress in CRL from a statistical perspective, focusing on connections to classical models and statistical and causal identifiablity results. This review also highlights key application areas, implementation strategies, and open statistical questions in CRL.
- [4] arXiv:2504.11610 [pdf, html, other]
-
Title: Generalized probabilistic canonical correlation analysis for multi-modal data integration with full or partial observationsSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
Background: The integration and analysis of multi-modal data are increasingly essential across various domains including bioinformatics. As the volume and complexity of such data grow, there is a pressing need for computational models that not only integrate diverse modalities but also leverage their complementary information to improve clustering accuracy and insights, especially when dealing with partial observations with missing data. Results: We propose Generalized Probabilistic Canonical Correlation Analysis (GPCCA), an unsupervised method for the integration and joint dimensionality reduction of multi-modal data. GPCCA addresses key challenges in multi-modal data analysis by handling missing values within the model, enabling the integration of more than two modalities, and identifying informative features while accounting for correlations within individual modalities. The model demonstrates robustness to various missing data patterns and provides low-dimensional embeddings that facilitate downstream clustering and analysis. In a range of simulation settings, GPCCA outperforms existing methods in capturing essential patterns across modalities. Additionally, we demonstrate its applicability to multi-omics data from TCGA cancer datasets and a multi-view image dataset. Conclusion: GPCCA offers a useful framework for multi-modal data integration, effectively handling missing data and providing informative low-dimensional embeddings. Its performance across cancer genomics and multi-view image data highlights its robustness and potential for broad application. To make the method accessible to the wider research community, we have released an R package, GPCCA, which is available at this https URL.
- [5] arXiv:2504.11775 [pdf, html, other]
-
Title: Discrimination-free Insurance Pricing with Privatized Sensitive AttributesSubjects: Machine Learning (stat.ML); Computers and Society (cs.CY); Machine Learning (cs.LG); Risk Management (q-fin.RM)
Fairness has emerged as a critical consideration in the landscape of machine learning algorithms, particularly as AI continues to transform decision-making across societal domains. To ensure that these algorithms are free from bias and do not discriminate against individuals based on sensitive attributes such as gender and race, the field of algorithmic bias has introduced various fairness concepts, along with methodologies to achieve these notions in different contexts. Despite the rapid advancement, not all sectors have embraced these fairness principles to the same extent. One specific sector that merits attention in this regard is insurance. Within the realm of insurance pricing, fairness is defined through a distinct and specialized framework. Consequently, achieving fairness according to established notions does not automatically ensure fair pricing in insurance. In particular, regulators are increasingly emphasizing transparency in pricing algorithms and imposing constraints on insurance companies on the collection and utilization of sensitive consumer attributes. These factors present additional challenges in the implementation of fairness in pricing algorithms. To address these complexities and comply with regulatory demands, we propose an efficient method for constructing fair models that are tailored to the insurance domain, using only privatized sensitive attributes. Notably, our approach ensures statistical guarantees, does not require direct access to sensitive attributes, and adapts to varying transparency requirements, addressing regulatory demands while ensuring fairness in insurance pricing.
- [6] arXiv:2504.12175 [pdf, other]
-
Title: Approximation Bounds for Transformer Networks with Application to RegressionSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
We explore the approximation capabilities of Transformer networks for Hölder and Sobolev functions, and apply these results to address nonparametric regression estimation with dependent observations. First, we establish novel upper bounds for standard Transformer networks approximating sequence-to-sequence mappings whose component functions are Hölder continuous with smoothness index $\gamma \in (0,1]$. To achieve an approximation error $\varepsilon$ under the $L^p$-norm for $p \in [1, \infty]$, it suffices to use a fixed-depth Transformer network whose total number of parameters scales as $\varepsilon^{-d_x n / \gamma}$. This result not only extends existing findings to include the case $p = \infty$, but also matches the best known upper bounds on number of parameters previously obtained for fixed-depth FNNs and RNNs. Similar bounds are also derived for Sobolev functions. Second, we derive explicit convergence rates for the nonparametric regression problem under various $\beta$-mixing data assumptions, which allow the dependence between observations to weaken over time. Our bounds on the sample complexity impose no constraints on weight magnitudes. Lastly, we propose a novel proof strategy to establish approximation bounds, inspired by the Kolmogorov-Arnold representation theorem. We show that if the self-attention layer in a Transformer can perform column averaging, the network can approximate sequence-to-sequence Hölder functions, offering new insights into the interpretability of self-attention mechanisms.
- [7] arXiv:2504.12189 [pdf, html, other]
-
Title: Leave-One-Out Stable Conformal PredictionComments: Accepted at ICLR 2025Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
Conformal prediction (CP) is an important tool for distribution-free predictive uncertainty quantification. Yet, a major challenge is to balance computational efficiency and prediction accuracy, particularly for multiple predictions. We propose Leave-One-Out Stable Conformal Prediction (LOO-StabCP), a novel method to speed up full conformal using algorithmic stability without sample splitting. By leveraging leave-one-out stability, our method is much faster in handling a large number of prediction requests compared to existing method RO-StabCP based on replace-one stability. We derived stability bounds for several popular machine learning tools: regularized loss minimization (RLM) and stochastic gradient descent (SGD), as well as kernel method, neural networks and bagging. Our method is theoretically justified and demonstrates superior numerical performance on synthetic and real-world data. We applied our method to a screening problem, where its effective exploitation of training data led to improved test power compared to state-of-the-art method based on split conformal.
New submissions (showing 7 of 7 entries)
- [8] arXiv:2504.11555 (cross-list from math.OC) [pdf, html, other]
-
Title: Sub-optimality of the Separation Principle for Quadratic Control from Bilinear ObservationsSubjects: Optimization and Control (math.OC); Machine Learning (cs.LG); Systems and Control (eess.SY); Machine Learning (stat.ML)
We consider the problem of controlling a linear dynamical system from bilinear observations with minimal quadratic cost. Despite the similarity of this problem to standard linear quadratic Gaussian (LQG) control, we show that when the observation model is bilinear, neither does the Separation Principle hold, nor is the optimal controller affine in the estimated state. Moreover, the cost-to-go is non-convex in the control input. Hence, finding an analytical expression for the optimal feedback controller is difficult in general. Under certain settings, we show that the standard LQG controller locally maximizes the cost instead of minimizing it. Furthermore, the optimal controllers (derived analytically) are not unique and are nonlinear in the estimated state. We also introduce a notion of input-dependent observability and derive conditions under which the Kalman filter covariance remains bounded. We illustrate our theoretical results through numerical experiments in multiple synthetic settings.
- [9] arXiv:2504.11831 (cross-list from cs.LG) [pdf, html, other]
-
Title: Support is All You Need for Certified VAE TrainingComments: 21 pages, 3 figures, ICLR '25Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Variational Autoencoders (VAEs) have become increasingly popular and deployed in safety-critical applications. In such applications, we want to give certified probabilistic guarantees on performance under adversarial attacks. We propose a novel method, CIVET, for certified training of VAEs. CIVET depends on the key insight that we can bound worst-case VAE error by bounding the error on carefully chosen support sets at the latent layer. We show this point mathematically and present a novel training algorithm utilizing this insight. We show in an extensive evaluation across different datasets (in both the wireless and vision application areas), architectures, and perturbation magnitudes that our method outperforms SOTA methods achieving good standard performance with strong robustness guarantees.
- [10] arXiv:2504.11848 (cross-list from stat.ME) [pdf, html, other]
-
Title: Proximal Inference on Population Intervention Indirect EffectComments: 60 pages, 3 figuresSubjects: Methodology (stat.ME); Statistics Theory (math.ST); Machine Learning (stat.ML)
The population intervention indirect effect (PIIE) is a novel mediation effect representing the indirect component of the population intervention effect. Unlike traditional mediation measures, such as the natural indirect effect, the PIIE holds particular relevance in observational studies involving unethical exposures, when hypothetical interventions that impose harmful exposures are inappropriate. Although prior research has identified PIIE under unmeasured confounders between exposure and outcome, it has not fully addressed the confounding that affects the mediator. This study extends the PIIE identification to settings where unmeasured confounders influence exposure-outcome, exposure-mediator, and mediator-outcome relationships. Specifically, we leverage observed covariates as proxy variables for unmeasured confounders, constructing three proximal identification frameworks. Additionally, we characterize the semiparametric efficiency bound and develop multiply robust and locally efficient estimators. To handle high-dimensional nuisance parameters, we propose a debiased machine learning approach that achieves $\sqrt{n}$-consistency and asymptotic normality to estimate the true PIIE values, even when the machine learning estimators for the nuisance functions do not converge at $\sqrt{n}$-rate. In simulations, our estimators demonstrate higher confidence interval coverage rates than conventional methods across various model misspecifications. In a real data application, our approaches reveal an indirect effect of alcohol consumption on depression risk mediated by depersonalization symptoms.
- [11] arXiv:2504.12156 (cross-list from cs.LG) [pdf, html, other]
-
Title: Predictive Multiplicity in Survival Models: A Method for Quantifying Model Uncertainty in Predictive Maintenance ApplicationsSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
In many applications, especially those involving prediction, models may yield near-optimal performance yet significantly disagree on individual-level outcomes. This phenomenon, known as predictive multiplicity, has been formally defined in binary, probabilistic, and multi-target classification, and undermines the reliability of predictive systems. However, its implications remain unexplored in the context of survival analysis, which involves estimating the time until a failure or similar event while properly handling censored data. We frame predictive multiplicity as a critical concern in survival-based models and introduce formal measures -- ambiguity, discrepancy, and obscurity -- to quantify it. This is particularly relevant for downstream tasks such as maintenance scheduling, where precise individual risk estimates are essential. Understanding and reporting predictive multiplicity helps build trust in models deployed in high-stakes environments. We apply our methodology to benchmark datasets from predictive maintenance, extending the notion of multiplicity to survival models. Our findings show that ambiguity steadily increases, reaching up to 40-45% of observations; discrepancy is lower but exhibits a similar trend; and obscurity remains mild and concentrated in a few models. These results demonstrate that multiple accurate survival models may yield conflicting estimations of failure risk and degradation progression for the same equipment. This highlights the need to explicitly measure and communicate predictive multiplicity to ensure reliable decision-making in process health management.
- [12] arXiv:2504.12287 (cross-list from stat.ME) [pdf, other]
-
Title: Trend Filtered Mixture of Experts for Automated Gating of High-Frequency Flow Cytometry DataComments: 23 page (including supplement), 9 figures (including supplement)Subjects: Methodology (stat.ME); Applications (stat.AP); Machine Learning (stat.ML)
Ocean microbes are critical to both ocean ecosystems and the global climate. Flow cytometry, which measures cell optical properties in fluid samples, is routinely used in oceanographic research. Despite decades of accumulated data, identifying key microbial populations (a process known as ``gating'') remains a significant analytical challenge. To address this, we focus on gating multidimensional, high-frequency flow cytometry data collected {\it continuously} on board oceanographic research vessels, capturing time- and space-wise variations in the dynamic ocean. Our paper proposes a novel mixture-of-experts model in which both the gating function and the experts are given by trend filtering. The model leverages two key assumptions: (1) Each snapshot of flow cytometry data is a mixture of multivariate Gaussians and (2) the parameters of these Gaussians vary smoothly over time. Our method uses regularization and a constraint to ensure smoothness and that cluster means match biologically distinct microbe types. We demonstrate, using flow cytometry data from the North Pacific Ocean, that our proposed model accurately matches human-annotated gating and corrects significant errors.
Cross submissions (showing 5 of 5 entries)
- [13] arXiv:2309.14073 (replaced) [pdf, other]
-
Title: Neural Network Parameter-optimization of Gaussian pmDAGsComments: 48 pagesSubjects: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Probability (math.PR)
Finding the parameters of a latent variable causal model is central to causal inference and causal identification. In this article, we show that existing graphical structures that are used in causal inference are not stable under marginalization of Gaussian Bayesian networks, and present a graphical structure that faithfully represent margins of Gaussian Bayesian networks. We present the first duality between parameter optimization of a latent variable model and training a feed-forward neural network in the parameter space of the assumed family of distributions. Based on this observation, we develop an algorithm for parameter optimization of these graphical structures based on a given observational distribution. Then, we provide conditions for causal effect identifiability in the Gaussian setting. We propose an meta-algorithm that checks whether a causal effect is identifiable or not. Moreover, we lay a grounding for generalizing the duality between a neural network and a causal model from the Gaussian to other distributions.
- [14] arXiv:2402.12668 (replaced) [pdf, html, other]
-
Title: Randomization Can Reduce Both Bias and Variance: A Case Study in Random ForestsSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
We study the often overlooked phenomenon, first noted in \cite{breiman2001random}, that random forests appear to reduce bias compared to bagging. Motivated by an interesting paper by \cite{mentch2020randomization}, where the authors explain the success of random forests in low signal-to-noise ratio (SNR) settings through regularization, we explore how random forests can capture patterns in the data that bagging ensembles fail to capture. We empirically demonstrate that in the presence of such patterns, random forests reduce bias along with variance and can increasingly outperform bagging ensembles when SNR is high. Our observations offer insights into the real-world success of random forests across a range of SNRs and enhance our understanding of the difference between random forests and bagging ensembles. Our investigations also yield practical insights into the importance of tuning $mtry$ in random forests.
- [15] arXiv:2405.13950 (replaced) [pdf, html, other]
-
Title: Learning to sample fibers for goodness-of-fit testingSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
We consider the problem of constructing exact goodness-of-fit tests for discrete exponential family models. This classical problem remains practically unsolved for many types of structured or sparse data, as it rests on a computationally difficult core task: to produce a reliable sample from lattice points in a high-dimensional polytope. We translate the problem into a Markov decision process and demonstrate a reinforcement learning approach for learning `good moves' for sampling. We illustrate the approach on data sets and models for which traditional MCMC samplers converge too slowly due to problem size, sparsity structure, and the requirement to use prohibitive non-linear algebra computations in the process. The differentiating factor is the use of scalable tools from \emph{linear} algebra in the context of theoretical guarantees provided by \emph{non-linear} algebra. Our algorithm is based on an actor-critic sampling scheme, with provable convergence.
The discovered moves can be used to efficiently obtain an exchangeable sample, significantly cutting computational times with regards to statistical testing. - [16] arXiv:2406.08307 (replaced) [pdf, html, other]
-
Title: Measuring training variability from stochastic optimization using robust nonparametric testingSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
Deep neural network training often involves stochastic optimization, meaning each run will produce a different model. This implies that hyperparameters of the training process, such as the random seed itself, can potentially have significant influence on the variability in the trained models. Measuring model quality by summary statistics, such as test accuracy, can obscure this dependence. We propose a robust hypothesis testing framework and a novel summary statistic, the $\alpha$-trimming level, to measure model similarity. Applying hypothesis testing directly with the $\alpha$-trimming level is challenging because we cannot accurately describe the distribution under the null hypothesis. Our framework addresses this issue by determining how closely an approximate distribution resembles the expected distribution of a group of individually trained models and using this approximation as our reference. We then use the $\alpha$-trimming level to suggest how many training runs should be sampled to ensure that an ensemble is a reliable representative of the true model performance. We also show how to use the $\alpha$-trimming level to measure model variability and demonstrate experimentally that it is more expressive than performance metrics like validation accuracy, churn, or expected calibration error when taken alone. An application of fine-tuning over random seed in transfer learning illustrates the advantage of our new metric.
- [17] arXiv:2410.03098 (replaced) [pdf, html, other]
-
Title: Forest Proximities for Time SeriesSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
RF-GAP has recently been introduced as an improved random forest proximity measure. In this paper, we present PF-GAP, an extension of RF-GAP proximities to proximity forests, an accurate and efficient time series classification model. We use the forest proximities in connection with Multi-Dimensional Scaling to obtain vector embeddings of univariate time series, comparing the embeddings to those obtained using various time series distance measures. We also use the forest proximities alongside Local Outlier Factors to investigate the connection between misclassified points and outliers, comparing with nearest neighbor classifiers which use time series distance measures. We show that the forest proximities seem to exhibit a stronger connection between misclassified points and outliers than nearest neighbor classifiers.
- [18] arXiv:2502.12999 (replaced) [pdf, html, other]
-
Title: Asymptotic Optimism of Random-Design Linear and Kernel Regression ModelsComments: 56 pages;Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
We derived the closed-form asymptotic optimism of linear regression models under random designs, and generalizes it to kernel ridge regression. Using scaled asymptotic optimism as a generic predictive model complexity measure, we studied the fundamental different behaviors of linear regression model, tangent kernel (NTK) regression model and three-layer fully connected neural networks (NN). Our contribution is two-fold: we provided theoretical ground for using scaled optimism as a model predictive complexity measure; and we show empirically that NN with ReLUs behaves differently from kernel models under this measure. With resampling techniques, we can also compute the optimism for regression models with real data.
- [19] arXiv:2503.18309 (replaced) [pdf, html, other]
-
Title: Efficient Transformed Gaussian Process State-Space Models for Non-Stationary High-Dimensional Dynamical SystemsComments: 13 pages, 6 figuresSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Signal Processing (eess.SP)
Gaussian process state-space models (GPSSMs) offer a principled framework for learning and inference in nonlinear dynamical systems with uncertainty quantification. However, existing GPSSMs are limited by the use of multiple independent stationary Gaussian processes (GPs), leading to prohibitive computational and parametric complexity in high-dimensional settings and restricted modeling capacity for non-stationary dynamics. To address these challenges, we propose an efficient transformed Gaussian process state-space model (ETGPSSM) for scalable and flexible modeling of high-dimensional, non-stationary dynamical systems. Specifically, our ETGPSSM integrates a single shared GP with input-dependent normalizing flows, yielding an expressive implicit process prior that captures complex, non-stationary transition dynamics while significantly reducing model complexity. For the inference of the implicit process, we develop a variational inference algorithm that jointly approximates the posterior over the underlying GP and the neural network parameters defining the normalizing flows. To avoid explicit variational parameterization of the latent states, we further incorporate the ensemble Kalman filter (EnKF) into the variational framework, enabling accurate and efficient state estimation. Extensive empirical evaluations on synthetic and real-world datasets demonstrate the superior performance of our ETGPSSM in system dynamics learning, high-dimensional state estimation, and time-series forecasting, outperforming existing GPSSMs and neural network-based SSMs in terms of computational efficiency and accuracy.
- [20] arXiv:2210.06672 (replaced) [pdf, html, other]
-
Title: Variance-Aware Estimation of Kernel Mean EmbeddingSubjects: Statistics Theory (math.ST); Machine Learning (cs.LG); Machine Learning (stat.ML)
An important feature of kernel mean embeddings (KME) is that the rate of convergence of the empirical KME to the true distribution KME can be bounded independently of the dimension of the space, properties of the distribution and smoothness features of the kernel. We show how to speed-up convergence by leveraging variance information in the reproducing kernel Hilbert space. Furthermore, we show that even when such information is a priori unknown, we can efficiently estimate it from the data, recovering the desiderata of a distribution agnostic bound that enjoys acceleration in fortuitous settings. We further extend our results from independent data to stationary mixing sequences and illustrate our methods in the context of hypothesis testing and robust parametric estimation.
- [21] arXiv:2311.13017 (replaced) [pdf, html, other]
-
Title: W-Kernel and Its Principal Space for Frequentist Evaluation of Bayesian EstimatorsComments: The introductory sections have been revised to clarify the relationship with previous work. The discussion of Bayesian-frequentist duality and the Z matrix has also been revised. The analysis of numerical experiments is substantially extended. The title has been updatedSubjects: Methodology (stat.ME); Statistical Mechanics (cond-mat.stat-mech); Machine Learning (stat.ML)
Evaluating the variability of posterior estimates is a key aspect of Bayesian model assessment. In this study, we focus on the posterior covariance matrix W, defined using the log-likelihood of each observation. Previous studies have examined the role of the principal space of W in Bayesian sensitivity analysis, notably MacEachern and Peruggia (2002) and Thomas et al. (2018). In this work, we show that the principal space of W is also relevant for frequentist evaluation, using the recently proposed Bayesian infinitesimal jackknife (IJ) approximation Giordano and Broderick (2023) as a key tool. We next consider the relationship between the matrix W and the Fisher kernel. We show that the Fisher kernel can be regarded as an approximation to W; the matrix W, in itself, can be interpreted as a reproducing kernel, which we refer to as the W-kernel. Based on this connection, we examine the dual relationship between the W-kernel formulation in the data space and the classical asymptotic formulation in the parameter space. These ideas suggest a form of Bayesian-frequentist duality that emerges through the dual structure of kernel PCA, where posterior and frequentist covariances serve as inner products in their respective spaces. As an application, we consider an approximate bootstrap of posterior means based on posterior samples generated by MCMC. We show that the projection onto the principal space of W facilitates frequentist evaluation, particularly of the higher-order term in this procedure. In one of the appendices, we introduce incomplete Cholesky decomposition as an efficient method for computing the principal space of W and discuss the related concept of representative subsets of the observations.
- [22] arXiv:2402.10504 (replaced) [pdf, html, other]
-
Title: Resilience of Rademacher chaos of low degreeComments: Small corrections from previous versionSubjects: Probability (math.PR); Information Theory (cs.IT); Machine Learning (cs.LG); Combinatorics (math.CO); Machine Learning (stat.ML)
The resilience of a Rademacher chaos is the maximum number of adversarial sign-flips that the chaos can sustain without having its largest atom probability significantly altered. Inspired by probabilistic lower-bound guarantees for the resilience of linear Rademacher chaos, obtained by Bandeira, Ferber, and Kwan (Advances in Mathematics, Vol. $319$, $2017$), we provide probabilistic lower-bound guarantees for the resilience of Rademacher chaos of arbitrary yet sufficiently low degree.
Our main results distinguish between Rademacher chaos of order two and those of higher order. In that, our first main result pertains to the resilience of decoupled bilinear Rademacher forms where different asymptotic behaviour is observed for sparse and dense matrices. For our second main result, we bootstrap our first result in order to provide resilience guarantees for quadratic Rademacher chaos. Our third main result, generalises the first and handles the resilience of decoupled Rademacher chaos of arbitrary yet sufficiently low order.
Our results for decoupled Rademacher chaos of order two and that of higher order whilst are established through the same conceptual framework, differ substantially. A difference incurred due to the implementation of the same conceptual argument. The order two result is established using Dudley's maximal inequality for sub-Gaussian processes, the Hanson-Wright inequality, as well as the Kolmogorov-Rogozin inequality. To handle higher order chaos, appeals to Dudley's inequality as well as the Hanson-Wright inequality are replaced with tools suited for random tensors. Appeals to the Hanson-Wright inequality are replaced with appeals to a concentration result for random tensors put forth by Adamczak and Wolff.
Our results are instance-dependent and thus allow for the efficient computation of resilience guarantees provided the order of the chaos is constant. - [23] arXiv:2407.00143 (replaced) [pdf, other]
-
Title: InfoNCE: Identifying the Gap Between Theory and PracticeEvgenia Rusak, Patrik Reizinger, Attila Juhos, Oliver Bringmann, Roland S. Zimmermann, Wieland BrendelSubjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
Prior theory work on Contrastive Learning via the InfoNCE loss showed that, under certain assumptions, the learned representations recover the ground-truth latent factors. We argue that these theories overlook crucial aspects of how CL is deployed in practice. Specifically, they either assume equal variance across all latents or that certain latents are kept invariant. However, in practice, positive pairs are often generated using augmentations such as strong cropping to just a few pixels. Hence, a more realistic assumption is that all latent factors change with a continuum of variability across all factors. We introduce AnInfoNCE, a generalization of InfoNCE that can provably uncover the latent factors in this anisotropic setting, broadly generalizing previous identifiability results in CL. We validate our identifiability results in controlled experiments and show that AnInfoNCE increases the recovery of previously collapsed information in CIFAR10 and ImageNet, albeit at the cost of downstream accuracy. Finally, we discuss the remaining mismatches between theoretical assumptions and practical implementations.
- [24] arXiv:2407.17112 (replaced) [pdf, html, other]
-
Title: Neural Dueling Bandits: Preference-Based Optimization with Human FeedbackComments: Accepted at ICLR 2025. Also, accepted at ICML 2024 Workshop on Foundations of Reinforcement Learning and ControlSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
Contextual dueling bandit is used to model the bandit problems, where a learner's goal is to find the best arm for a given context using observed noisy human preference feedback over the selected arms for the past contexts. However, existing algorithms assume the reward function is linear, which can be complex and non-linear in many real-life applications like online recommendations or ranking web search results. To overcome this challenge, we use a neural network to estimate the reward function using preference feedback for the previously selected arms. We propose upper confidence bound- and Thompson sampling-based algorithms with sub-linear regret guarantees that efficiently select arms in each round. We also extend our theoretical results to contextual bandit problems with binary feedback, which is in itself a non-trivial contribution. Experimental results on the problem instances derived from synthetic datasets corroborate our theoretical results.
- [25] arXiv:2410.00078 (replaced) [pdf, html, other]
-
Title: Shuffled Linear Regression via Spectral MatchingComments: This work has been submitted to the IEEE for possible publicationSubjects: Statistics Theory (math.ST); Information Theory (cs.IT); Machine Learning (cs.LG); Signal Processing (eess.SP); Spectral Theory (math.SP); Machine Learning (stat.ML)
Shuffled linear regression (SLR) seeks to estimate latent features through a linear transformation, complicated by unknown permutations in the measurement dimensions. This problem extends traditional least-squares (LS) and Least Absolute Shrinkage and Selection Operator (LASSO) approaches by jointly estimating the permutation, resulting in shuffled LS and shuffled LASSO formulations. Existing methods, constrained by the combinatorial complexity of permutation recovery, often address small-scale cases with limited measurements. In contrast, we focus on large-scale SLR, particularly suited for environments with abundant measurement samples. We propose a spectral matching method that efficiently resolves permutations by aligning spectral components of the measurement and feature covariances. Rigorous theoretical analyses demonstrate that our method achieves accurate estimates in both shuffled LS and shuffled LASSO settings, given a sufficient number of samples. Furthermore, we extend our approach to address simultaneous pose and correspondence estimation in image registration tasks. Experiments on synthetic datasets and real-world image registration scenarios show that our method outperforms existing algorithms in both estimation accuracy and registration performance.
- [26] arXiv:2410.20068 (replaced) [pdf, html, other]
-
Title: Understanding the Effect of GCN Convolutions in Regression TasksComments: 25 pagesSubjects: Machine Learning (cs.LG); Statistics Theory (math.ST); Machine Learning (stat.ML)
Graph Convolutional Networks (GCNs) have become a pivotal method in machine learning for modeling functions over graphs. Despite their widespread success across various applications, their statistical properties (e.g., consistency, convergence rates) remain ill-characterized. To begin addressing this knowledge gap, we consider networks for which the graph structure implies that neighboring nodes exhibit similar signals and provide statistical theory for the impact of convolution operators. Focusing on estimators based solely on neighborhood aggregation, we examine how two common convolutions - the original GCN and GraphSAGE convolutions - affect the learning error as a function of the neighborhood topology and the number of convolutional layers. We explicitly characterize the bias-variance type trade-off incurred by GCNs as a function of the neighborhood size and identify specific graph topologies where convolution operators are less effective. Our theoretical findings are corroborated by synthetic experiments, and provide a start to a deeper quantitative understanding of convolutional effects in GCNs for offering rigorous guidelines for practitioners.
- [27] arXiv:2503.21138 (replaced) [pdf, html, other]
-
Title: A Computational Framework for Efficient Model Evaluation with Causal GuaranteesSubjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Statistics Theory (math.ST); Machine Learning (stat.ML)
In order to reduce the cost of experimental evaluation for models, we introduce a computational theory of evaluation for prediction and decision models: build evaluation model to accelerate the evaluation procedures. We prove upper bounds of generalized error and generalized causal effect error of given evaluation models. We also prove efficiency, and consistency to estimated causal effect from deployed subject to evaluation metric by prediction. To learn evaluation models, we propose a meta-learner to handle heterogeneous evaluation subjects space problem. Comparing with existed evaluation approaches, our (conditional) evaluation model reduced 24.1\%-99.0\% evaluation errors across 12 scenes, including individual medicine, scientific simulation, social experiment, business activity, and quantum trade. The evaluation time is reduced 3-7 order of magnitude comparing with experiments or simulations.