Etash Guha

Researcher

I research how to design training data curation protocols for training large text and image models. This includes synthetic data generation, data filtering, and online data sampling

Education

Aug. 2024
Ph.D. in Computer Science and Engineering
University of Washington, Seattle, Washington
Researching data curation
Aug. 2018 — May 2022
B.S. in Computer Science
Georgia Institute of Technology, Atlanta, GA
Overall GPA: 3.97/4.00, Threads in Intelligence and Theory

Industry Research Experience

Oct. 2023 - Current
SambaNova Systems, Palo Alto, California
Researcher, NLP
Initiated the improvement of reliability and safety of open-source Large Language Models for large customers
May 2021 — May 2023
SambaNova Systems, Palo Alto, California
Research Intern, ML for PnR
Developed Learned Cost Model using Graph Neural Networks to predict quality of chip placement, beating several man-made heuristics
May 2020 — Aug. 2021
FORT LP, New York City, New York
Quantitative Research Intern, Transaction Analysis Group
Implemented Pipeline for analyzing slippage data and Neural Net strategy for Price prediction with NLP data
May 2019 - Aug. 2019
SAS, Cary, North Carolina
Software Engineering Intern, SAS Model Manager
Integrated Bidirectional Encoder Representations from Transformers NLP Model into SAS Products using PyTorch
Jan. 2019 - May 2019
Parmonic, Atlanta, Georgia
Software Engineering Intern,
Using Python with libraries such as SciKit and OpenCV for Data and Video Analysis

Academic Research Experience

May 2023 - Oct. 2023
RIKEN Project for Artificial Intelligence, Tokyo, Japan
Research Intern,
Advisor: Mohammad Emtiyaz Khan
Collaborated with the Approximate Bayesian Inference Team under Emtiyaz Khan to use Bayesian principles to push ML safety. Developing simple and practical Uncertainty Quantification methods for Deep Learning resulting in ICLR 2024 submission.
May 2022 - May 2023
Georgia Institute of Technology, Atlanta, Georgia
Research Assistant,
Advisor: Jacob Abernethy, Xiaoming Huo
Developped Generalization Bounds for Magnitude-Based Pruning. Proved that Sparse and Wide Neural Networks Forget Less. Developped an No-Regret Framework to solve Robust Markov Decision Processes.
Aug 2018 - May 2022
Georgia Institute of Technology, Atlanta, Georgia
Undergraduate Research Assistant,
Advisor: Jacob Abernethy, Vidya Muthukumar, Ashwin Pananjady
Working on designing general class of frameworks for two player Fenchel Games to model perceptron algorithms. Tested a Hebbian Plasticity based learning system and analyzed its computational capacity. Analyzing Efficient Algorithm for generating Conformal Prediction Sets. Developed an Inverse Reinforcement Learning algorithm for Linear Stochastic Bandits. Developed a learned methodology to efficiently and accurately generate solutions to NP-Hard problems.
Aug. 2019 - May 2019
IVALab, Atlanta, Georgia
Undergraduate Research Assistant, IVALab
Advisor: Patricio Vela
Developed a more efficient autonomous exploration method for robots with 9.8% increased accuracy over standard Frontier Based Exploration using ROS and C++

Honors and Awards

2018-2022
Stamps President's Scholarship at Georgia Tech
Full-Ride Merit Scholarship given to 40 Freshman at Georgia Tech
2018-2022
Faculty List
Awarded to students with 4.0 GPA
2023
Fatima Fellowship
An International Mentorship Program for Aspiring Researchers in Computer Science given to 40 students

Patents

2022
US Patent on "Learned Cost Models For Performance Optimization On Dataflow Architecture"
Awarded a Patent based on work done at SambaNova Systems.
2022
US Patent on "Performance Optimization of Dataflow Processors"
Awarded a Patent based on work done at SambaNova Systems
2022
US Patent on "Estimating Resource Costs for Computing Tasks for a Reconfigurable Dataflow Computing System"
Awarded a Patent based on work done at SambaNova Systems
2018
US Patent on "Volume controllable toilet flush systems and methods of use"
Awarded a Patent based on novel toilet design for adjustable water usage

Publications

Conference

C11
MINT-1T: Scaling Open-Source Multimodal Data by 10x: A Multimodal Dataset with One Trillion Tokens
Anas Awadalla, Le Xue, Oscar Lo, Manli Shu, Hannah Lee, Etash Guha, Matt Jordan, Sheng Shen, Mohamed Awadalla, Silvio Savarese, Caiming Xiong, Ran Xu, Yejin Choi, Ludwig Schmidt
Conference on Neural Information Processing Systems Dataset and Benchmarks Track.
Project PDF
C10
DataComp-LM: In search of the next generation of training sets for language models
Jeffery Li, Alex Fang, Georgios Smyrnis, Maor Ivgi, Matt Jordan, Samir Gadre, Hritik Bansal, Etash Guha, ..., Achal Dave, Ludwig Schmidt, Vaishaal Shankar
Conference on Neural Information Processing Systems Dataset and Benchmarks Track.
Project PDF
C9
On the Diminishing Returns of Width for Continual Learning
Etash Guha, Vihan Lakshman
International Conference of Machine Learning 2024; ICLR 2024 Workshop on Bridging the Gap Between Practice and Theory in Deep Learning.
Project PDF
C8
Conformal Prediction via Regression-as-Classification
Etash Guha, Shlok Natarajan, Thomas Möllenhoff, Emtiyaz Khan, Eugene Ndiaye
International Conference of Learning Representations 2024.
Project PDF
C7
Generalization Bounds for Magnitude-Based Pruning via Sparse Matrix Sketching
Etash Guha, Prasanjit Dubey, Xiaoming Huo
ICLR 2024 Workshop on Bridging the Gap Between Practice and Theory in Deep Learning.
Project PDF
C6
Solving Robust MDPs through No-Regret Dynamics
Etash Guha
Transactions of Machine Learning Research.
Project PDF
C5
One Shot Inverse Reinforcement Learning for Stochastic Linear Bandits
Etash Guha, Jim James, Krishna Acharya, Ashwin Pananjady, Vidya Muthukumar
Conference on Uncertainty in Artificial Intelligence 2024.
Project PDF
C4
Conformalization of Sparse Generalized Linear Models
Etash Guha, Eugene Ndiaye, Xiaoming Huo
International Conference of Machine Learning 2023.
Project PDF
C3
On Accelerated Perceptrons and Beyond
Guanghui Wang, Rafael Hanashiro, Etash Guha, Jacob Abernethy
International Conference of Learning Representations 2023.
Project PDF
C2
A Variational Approach for Combinatorial Optimization Problems on Graphs
Haoran Sun, Etash Guha, Hanjun Dai, Le Song
International OPT Workshop on Optimization for Machine Learning @ NeurIPS 2023.
Project PDF
C1
Learned Cost Model for Placement on Reconfigurable Dataflow Hardware
Etash Guha, Tianxiao Jiang, Andrew Deng, Muthu Annamalai, Jian Zhang
Design Automation Conference (poster) 2022.
Project PDF

Poster

P1
Predicting Situations with Loop Closures in Frontier Based Robotic Exploration
Etash Guha, Patricio Vela
National Conference of Undergraduate Research 2019.
Project PDF Poster

Service

Reviewer

International Conference of Learning Represenations (ICLR) 2024
Conference on Neural Information Processing Systems (NeurIPS) 2023
Conference on Artificial Intelligence and Statistics (AISTATS) 2024
Optimization for Machine Learning Workshop @ NeurIPS (OPT) 2023
Duality Principles for Modern Machine Learning Workshop @ ICML (DP4ML) 2023