Sanjiv Kumar is a Google Fellow and VP at Google Research, where he is leading a team on large Foundation Models including LLMs, and Generative AI for Gemini. His research interests include rethinking existing modeling and compute paradigms in LLMs for faster training and inference, and drastically improved reasoning capabilities. Many of these techniques are powering Google Gemini models.
He also leads research in deep retrieval and ranking, and massive-scale similarity search, driving a large number of applications in Google Search, YouTube, Ads, Cloud, Android, Gmail and Chrome. He has led the development of widely used state-of-the-art open-source similarity search engine, ScaNN.
Sanjiv has published more than 125 papers in the field of machine learning, computer vision and robotics, and holds 60+ patents. His works have received multiple awards, e.g., convergence of Adam in ICLR 2018, and speculative cascades in ICLR 2025. He is an action editor of JMLR and holds a PhD (2005) from the School of Computer Science at Carnegie Mellon University.
LLMs and Generative AI, Large Scale Machine Learning, Health AI, Robotics, Computer Vision
EECS6898: Large-Scale Machine Learning, Fall 2010, Columbia University, New York, NY.
Structured Preconditioners in Adaptive Optimization: A Unified Analysis [pdf]
International Conference on Machine Learning (ICML), 2025.
LAuReL: Learned Augmented Residual Layer [pdf]
International Conference on Machine Learning (ICML), 2025.
Bipartite Ranking From Multiple Labels: On Loss Versus Label Aggregation [pdf]
International Conference on Machine Learning (ICML), 2025.
Faster Cascades via Speculative Decoding [pdf]
International Conference on Learning Representations (ICLR), 2025.
Reasoning with Latent Thoughts: On the Power of Looped Transformers [pdf]
International Conference on Learning Representations (ICLR), 2025.
LoRA Done RITE: Robust Invariant Transformation Equilibration for LoRA Optimization [pdf]
International Conference on Learning Representations (ICLR), 2025.
Better autoregressive regression with LLMs via regression-aware fine-tuning [pdf]
International Conference on Learning Representations (ICLR), 2025.
Efficient stagewise pretraining via progressive subnetworks [pdf]
International Conference on Learning Representations (ICLR), 2025.
On the Convergence of Adam and Beyond [pdf]
International Conference on Learning Representations (ICLR), 2018.
On the Inductive Bias of Stacking Towards Improving Reasoning [pdf]
Neural Information Processing Systems (NeurIPS), 2024.
Accelerating Blockwise Parallel Language Models with Draft Refinement [pdf]
Neural Information Processing Systems (NeurIPS), 2024.
Can Looped Transformers Learn to Implement Multi-step Gradient Descent for In-context Learning? [pdf]
International Conference on Machine Learning (ICML), 2024.
Promises and Pitfalls of Generative Masked Language Modeling: Theoretical Framework and Practical Guidelines [pdf]
International Conference on Machine Learning (ICML), 2024.
USTAD: Unified Single-model Training Achieving Diverse Scores for Information Retrieval [pdf]
International Conference on Machine Learning (ICML), 2024.
Tandem Transformers for Inference Efficient LLMs [pdf]
International Conference on Machine Learning (ICML), 2024.
Think Before You Speak: Training Language Models with Pause Tokens [pdf]
International Conference on Learning Representations (ICLR), 2024.
Plugin Estimators for Selective Classification with Out-Of-Distribution Detection [pdf]
International Conference on Learning Representations (ICLR), 2024.
Language Model Cascades: Token-Level Uncertainty and Beyond [pdf]
International Conference on Learning Representations (ICLR), 2024.
Two-Stage LLM Fine-Tuning with Less Specialization and More Generalization [pdf]
International Conference on Learning Representations (ICLR), 2024.
DistillSpec: Improving Speculative Decoding Via Knowledge Distillation [pdf]
International Conference on Learning Representations (ICLR), 2024.
Functional Interpolation for Relative Positions Improves Long Context Transformers [pdf]
International Conference on Learning Representations (ICLR), 2024.
On Bias-Variance Alignment in Deep Models [pdf]
International Conference on Learning Representations (ICLR), 2024.
Learning to Reject Meets Long-Tail Learning [pdf]
International Conference on Learning Representations (ICLR), 2024.
On Student-Teacher Deviations in Distillation: Does It Pay to Disobey? [pdf]
Neural Information Processing Systems (NeurIPS), 2023.
ResMem: Learn What You Can and Memorize the Rest [pdf]
Neural Information Processing Systems (NeurIPS), 2023.
SOAR: Improved Indexing for Approximate Nearest Neighbor Search [pdf]
Neural Information Processing Systems (NeurIPS), 2023.
When Does Confidence-Based Cascade Deferral Suffice? [pdf]
Neural Information Processing Systems (NeurIPS), 2023.
Efficient Training of Language Models using Few-Shot Learning [pdf]
International Conference on Machine Learning (ICML), 2023.
Leveraging Importance Weights in Subset Selection [pdf]
International Conference on Learning Representations (ICLR), 2023.
The Lazy Neuron Phenomenon: On Emergence of Activation Sparsity in Transformers [pdf]
International Conference on Learning Representations (ICLR), 2023.
Teacher Guided Training: An Efficient Framework for Knowledge Distillation [pdf]
International Conference on Learning Representations (ICLR), 2023.
Supervision Complexity and its Role in Knowledge Distillation [pdf]
International Conference on Learning Representations (ICLR), 2023.
Automating Nearest Neighbor Search Configuration with Constrained Optimization [pdf]
International Conference on Learning Representations (ICLR), 2023.
Serving Graph Compression for Graph Neural Networks [pdf]
International Conference on Learning Representations (ICLR), 2023.
Decoupled Context Processing for Context Augmented Language Modeling [pdf]
Neural Information Processing Systems (NeurIPS), 2022.
Post-hoc Estimators for Learning to Defer to an Expert [pdf]
Neural Information Processing Systems (NeurIPS), 2022.
TPU-KNN: K Nearest Neighbor Search at Peak FLOP/s [pdf]
Neural Information Processing Systems (NeurIPS), 2022.
In Defense of Dual-Encoders for Neural Ranking [pdf]
International Conference on Machine Learning (ICML), 2022.
Robust Training of Neural Networks Using Scale Invariant Architectures [pdf]
International Conference on Machine Learning (ICML), 2022.
Efficient Training of Retrieval Models Using Negative Cache [pdf]
Neural Information Processing Systems (NeurIPS), 2021.
Batch Active Learning at Scale [pdf]
Neural Information Processing Systems (NeurIPS), 2021.
A Statistical Perspective on Distillation [pdf]
International Conference on Machine Learning (ICML), 2021.
Disentangling Labeling and Sampling Bias for Learning in Large-output Spaces [pdf]
International Conference on Machine Learning (ICML), 2021.
RankDistil: Knowledge Distillation for Ranking [pdf]
International Conference on Artificial Intelligence and Statistics (AISTATS) 2021.
Overparameterisation and Worst-case Generalisation: Friend or Foe? [pdf]
International Conference on Learning Representations (ICLR), 2021.
Adaptive Federated Optimization [pdf]
International Conference on Learning Representations (ICLR), 2021.
Long-tail Learning via Logit Adjustment [pdf]
International Conference on Learning Representations (ICLR), 2021.
Evaluations and Methods for Explanation Through Robustness Analysis [pdf]
International Conference on Learning Representations (ICLR), 2021.
Coping With Label Shift via Distributionally Robust Optimisation [pdf]
International Conference on Learning Representations (ICLR), 2021.
O(n) Connections are Expressive Enough: Universal Approximability of Sparse Transformers [pdf]
Neural Information Processing Systems (NeurIPS), 2020.
Why are Adaptive Methods Good for Attention Models? [pdf]
Neural Information Processing Systems (NeurIPS), 2020.
Robust Large-Margin Learning in Hyperbolic Space [pdf]
Neural Information Processing Systems (NeurIPS), 2020.
Learning Discrete Distributions: User vs Item-Level Privacy [pdf]
Neural Information Processing Systems (NeurIPS), 2020.
Multi-Stage Influence Function [pdf]
Neural Information Processing Systems (NeurIPS), 2020.
Low-Rank Bottleneck in Multi-head Attention Models [pdf]
International Conference on Machine Learning (ICML), 2020.
Does Label Smoothing Mitigate Label Noise? [pdf]
International Conference on Machine Learning (ICML), 2020.
Accelerating Large-Scale Inference with Anisotropic Vector Quantization [pdf]
International Conference on Machine Learning (ICML), 2020.
Federated Learning with Only Positive Labels [pdf]
International Conference on Machine Learning (ICML), 2020.
Large Batch Optimization for Deep Learning: Training BERT in 76 Minutes [pdf]
International Conference on Learning Representations (ICLR), 2020.
Are Transformers Universal Approximators of Sequence-to-Sequence Functions [pdf]
International Conference on Learning Representations (ICLR), 2020.
Can Gradient Clipping Mitigate Label Noise? [pdf]
International Conference on Learning Representations (ICLR), 2020.
Pre-training Tasks for Embedding-based Large-scale Retrieval [pdf]
International Conference on Learning Representations (ICLR), 2020.
Learning to Learn by Zeroth-Order Oracle [pdf]
International Conference on Learning Representations (ICLR), 2020.
How Does Noise Help Robustness? Explanation and Exploration under the Neural SDE Framework [pdf]
International Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
Multilabel reductions: what is my loss optimising? [pdf]
Neural Information Processing Systems (NeurIPS), 2019.
Breaking the Glass Ceiling for Embedding-Based Classifiers for Large Output Spaces [pdf]
Neural Information Processing Systems (NeurIPS), 2019.
Sampled Softmax with Random Fourier Features [pdf]
Neural Information Processing Systems (NeurIPS), 2019.
Escaping Saddle Points with Adaptive Gradient Methods [pdf]
International Conference on Machine Learning (ICML), 2019.
Learning a Compressed Sensing Measurement Matrix via Gradient Unrolling [pdf]
International Conference on Machine Learning (ICML), 2019.
Learning to Screen for Fast Softmax Inference on Large Vocabulary Neural Networks [pdf]
International Conference on Learning Representations (ICLR), 2019.
Stochastic Negative Mining for Learning with Large Output Spaces [pdf]
International Conference on Artificial Intelligence and Statistics (AISTATS), 2019.
Optimal Noise-Adding Mechanism in Additive Differential Privacy [pdf]
International Conference on Artificial Intelligence and Statistics (AISTATS), 2019.
Adaptive Methods for Nonconvex Optimization [pdf]
Loss Decomposition for Fast Learning in Large Output Spaces [pdf]
International Conference on Machine Learning (ICML), 2018.
cpSGD: Communication-efficient and differentially-private distributed SGD [pdf]
Neural Information Processing Systems (NIPS), 2018.
Multiscale Quantization for Fast Similarity Search [pdf]
Neural Information Processing Systems (NIPS), 2017.
Stochastic Generative Hashing [pdf]
International Conference on Machine Learning (ICML), 2017.
Distributed Mean Estimation with Limited Communication [pdf]
International Conference on Machine Learning (ICML), 2017.
Learning Spread-out Local Feature Descriptors [pdf]
International Conference on Computer Vision (ICCV), 2017.
Fast Classification with Binary Prototypes [pdf]
International Conference on Artificial Intelligence and Statistics (AISTATS), 2017.
Orthogonal Random Features [pdf]
Neural Information Processing Systems (NIPS), 2016.
Binary Embeddings with Structured Hash Projections [pdf]
International Conference on Machine Learning (ICML), 2016.
Quantization based Fast Inner Product Search [pdf]
International Conference on Artificial Intelligence and Statistics (AISTATS), 2016.
Spherical Random Features [pdf]
Neural Information Processing Systems (NIPS), 2015.
Structured Transforms for Small-Footprint Deep Learning [pdf]
Neural Information Processing Systems (NIPS), 2015.
Fast Orthogonal Projection Based on Kronecker Product [pdf]
International Conference on Computer Vision (ICCV), 2015.
An Exploration of Parameter Redundancy in Deep Networks with Circulant Projections [pdf]
International Conference on Computer Vision (ICCV), 2015.
Quantization based Fast Inner Product Search [pdf]
arXiv:1509.01469, 2015.
Discrete Graph Hashing [pdf] [supplementary]
Neural Information Processing Systems (NIPS), 2014.
Circulant Binary Embedding [pdf] [code]
International Conference on Machine Learning (ICML), 2014.
pSVM for Learning with Label Proportions [pdf] [supplementary]
International Conference on Machine Learning (ICML), 2013.
Learning Binary Codes for High-Dimensional Data Using Bilinear Projections [pdf]
IEEE Computer Vision and Pattern Recognition (CVPR), 2013.
Large-scale SVD and Manifold Learning [pdf]
Journal of Machine Learning Research (JMLR), 2013.
Angular Quantization-based Binary Codes for Fast Similarity Search [pdf]
Advances in Neural Information Processing Systems (NIPS), 2012.
On the Difficulty of Nearest Neighbor Search [pdf] [supplementary]
International Conference on Machine Learning (ICML), 2012.
Compact Hyperplane Hashing with Bilinear Functions [pdf] [supplementary]
International Conference on Machine Learning (ICML), 2012.
Semi-Supervised Hashing for Large Scale Search [pdf]
IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 2012.
Sampling Methods for the Nystrom Method [pdf]
Twelfth International Conference on Artificial Intelligence and Statistics (AISTATS), 2009.
Hashing With Graphs [pdf]
International Conference on Machine Learning (ICML), 2011.
Large-Scale Manifold Learning [pdf]
IEEE Computer Vision and Pattern Recognition (CVPR), 2008.
Ensemble Nystrom Method [pdf]
Neural Information Processing Systems (NIPS), 2009.
A New Baseline for Image Annotation [pdf]
European Conference on Computer Vision (ECCV), 2008.
Sequential Projection Learning for Hashing with Compact Codes [pdf]
International Conference on Machine Learning (ICML), 2010.
YouTubeCat: Learning to Categorize Wild Web Videos [pdf]
IEEE Computer Vision and Pattern Recognition (CVPR), 2010.
Semi-Supervised Hashing for Scalable Image Retrieval [pdf]
IEEE Computer Vision and Pattern Recognition (CVPR), 2010.
On Sampling-based Approximate Spectral Decomposition [pdf]
International Conference on Machine Learning (ICML), 2009.
Sampling Techniques for the Nystrom Method [pdf]
Twelfth International Conference on Artificial Intelligence and Statistics (AISTATS), 2009.
Face Tracking and Recognition with Visual Constraints in Real-World Videos [pdf]
IEEE Computer Vision and Pattern Recognition (CVPR), 2008.
Classification of Weakly-Labeled Data with Partial Equivalence Relations [pdf] [additional results]
IEEE International Conference on Computer Vision (ICCV), 2007.
Discriminative Random Fields [pdf]
International Journal of Computer Vision (IJCV), 68(2), 179-201, 2006.
Exploiting Inference for Approximate Parameter Learning in Discriminative Fields: An Empirical Study [pdf]
Energy Minimization Methods in Computer Vision and Pattern Recognition (EMMCVPR), 2005.
Models for Learning Spatial Interactions in Natural Images for Context-Based Classification [pdf]
PhD Thesis, The Robotics Institute, School of Computer Science, Carnegie Mellon University, September 2005.
A Hierarchical Field Framework for Unified Context-Based Classification [pdf]
IEEE International Conference on Computer Vision (ICCV), 2005.
Digital Tapestry [pdf]
International Conference on Computer Vision and Pattern Recognition (CVPR), June, 2005.
Approximate Parameter Learning in Discriminative Fields [pdf] [dataset]
Snowbird Learning Workshop, Utah, 2004.
Multiclass Discriminative Fields for Parts-Based Object Detection [pdf]
Snowbird Learning Workshop, Utah, 2004.
Discriminative Fields for Modeling Spatial Dependencies in Natural Images [pdf] [dataset]
Advances in Neural Information Processing Systems, NIPS 16, 2004.
Discriminative Random Fields: A Discriminative Framework for Contextual Interaction in Classification [pdf]
IEEE International Conference on Computer Vision (ICCV), 2003.
Man-Made Structure Detection in Natural Images using a Causal Multiscale Random Field [pdf] [additional results] [dataset]
IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), 2003.
An Observation-Constrained Generative Approach for Probabilistic Classification of Image Regions [pdf]
Image and Vision Computing, 21, pp. 87-97, 2003.
Probabilistic Classification of Image Regions using an Observation-Constrained Generative Approach [pdf]
ECCV Workshop on Generative Models based Vision (GMBV), 2002.