Hi, I'm Etash Guha

I'm a Ph.D. student at the University of Washington Computer Science and Engineering Department.

I research how to design and improve training data curation protocols for training large text and image models. This includes synthetic data generation, data filtering, and online data sampling. I am extremely fortunate to be advised by the amazing Professors Ludwig Schmidt and Yejin Choi. I'm graciously supported by the NSF Graduate Research Fellowship.

I was a researcher at SambaNova Systems working on the reliability of Large Language Models. Most recently, I was a research intern under Dr. Emtiyaz Khan on the Approximate Bayesian Inference Team at RIKEN AIP in Tokyo, Japan. I was both an undergraduate student and research assistant at Georgia Tech where I worked with Vidya Muthukumar, Ashwin Pananjady, Jacob Abernethy, and Xiaoming Huo.

I have worked with researchers, traders, and software engineers while working at SambaNova Systems, FORT LP, and SAS.

Here's my CV.

Featured Research Publications

OpenThoughts

pre-print

DataComp

DataComp-LM

Conference on Neural Information Processing Systems Dataset and Benchmarks Track