Hi, I'm Etash Guha

I'm a Ph.D. student at the University of Washington Computer Science and Engineering Department.
I research how to design and improve training data curation protocols for training large text and image models. This includes synthetic data generation, data filtering, and online data sampling.
I am also currently a Researcher at SambaNova Systems working on the reliability of Large Language Models. Most recently, I was a Research Intern under Dr. Emtiyaz Khan on the Approximate Bayesian Inference Team at RIKEN AIP in Tokyo, Japan. I was both an undergraduate student and Research Assistant at Georgia Tech where I worked with Vidya Muthukumar, Ashwin Pananjady, Jacob Abernethy, and Xiaoming Huo.
I have worked with researchers, traders, and software engineers while working at SambaNova Systems, FORT LP, and SAS.

Featured Research Publications

DataComp-LM
Conference on Neural Information Processing Systems Dataset and Benchmarks Track
Architecture Search Framework for LLM Inference
Submitted on 10 Dec 2024 (preprint)
Conformalization of Sparse Generalized Linear Models
International Conference of Machine Learning 2023
MINT-1T
Conference on Neural Information Processing Systems Dataset and Benchmarks Track
Generalization Bounds for Magnitude-Based Pruning via Sparse Matrix Sketching
ICLR 2024 Workshop on Bridging the Gap Between Practice and Theory in Deep Learning
Solving Robust MDPs through No-Regret Dynamics
Transactions of Machine Learning Research
Inverse Reinforcement Learning
Conference on Uncertainty in Artificial Intelligence 2024
On Accelerated Perceptrons and Beyond
International Conference of Learning Representations 2023