Unsupervised Learning

Ep 74: Chief Scientist of Together.AI Tri Dao On The End of Nvidia's Dominance, Why Inference Costs Fell & The Next 10X in Speed

Episode Summary

Tri Dao, Chief Scientist at Together AI and Princeton professor who created Flash Attention and Mamba, discusses how inference optimization has driven costs down 100x since ChatGPT's launch through memory optimization, sparsity advances, and hardware-software co-design. He predicts the AI hardware landscape will shift from Nvidia's current 90% dominance to a more diversified ecosystem within 2-3 years, as specialized chips emerge for distinct workload categories: low-latency agentic systems, high-throughput batch processing, and interactive chatbots. Dao shares his surprise at AI models becoming genuinely useful for expert-level work, making him 1.5x more productive at GPU kernel optimization through tools like Claude Code and O1. The conversation explores whether current transformer architectures can reach expert-level AI performance or if approaches like mixture of experts and state space models are necessary to achieve AGI at reasonable costs. Looking ahead, Dao sees another 10x cost reduction coming from continued hardware specialization, improved kernels, and architectural advances like ultra-sparse models, while emphasizing that the biggest challenge remains generating expert-level training data for domains lacking extensive internet coverage.

Episode Notes

Fill out this short listener survey to help us improve the show: https://forms.gle/bbcRiPTRwKoG2tJx8

 

Tri Dao, Chief Scientist at Together AI and Princeton professor who created Flash Attention and Mamba, discusses how inference optimization has driven costs down 100x since ChatGPT's launch through memory optimization, sparsity advances, and hardware-software co-design. He predicts the AI hardware landscape will shift from Nvidia's current 90% dominance to a more diversified ecosystem within 2-3 years, as specialized chips emerge for distinct workload categories: low-latency agentic systems, high-throughput batch processing, and interactive chatbots. Dao shares his surprise at AI models becoming genuinely useful for expert-level work, making him 1.5x more productive at GPU kernel optimization through tools like Claude Code and O1. The conversation explores whether current transformer architectures can reach expert-level AI performance or if approaches like mixture of experts and state space models are necessary to achieve AGI at reasonable costs. Looking ahead, Dao sees another 10x cost reduction coming from continued hardware specialization, improved kernels, and architectural advances like ultra-sparse models, while emphasizing that the biggest challenge remains generating expert-level training data for domains lacking extensive internet coverage.

 

(0:00) Intro

(1:58) Nvidia's Dominance and Competitors

(4:01) Challenges in Chip Design

(6:26) Innovations in AI Hardware

(9:21) The Role of AI in Chip Optimization

(11:38) Future of AI and Hardware Abstractions

(16:46) Inference Optimization Techniques

(33:10) Specialization in AI Inference

(35:18) Deep Work Preferences and Low Latency Workloads

(38:19) Fleet Level Optimization and Batch Inference

(39:34) Evolving AI Workloads and Open Source Tooling

(41:15) Future of AI: Agentic Workloads and Real-Time Video Generation

(44:35) Architectural Innovations and AI Expert Level

(50:10) Robotics and Multi-Resolution Processing

(52:26) Balancing Academia and Industry in AI Research

(57:37) Quickfire

 

With your co-hosts: 

@jacobeffron 

- Partner at Redpoint, Former PM Flatiron Health 

@patrickachase 

- Partner at Redpoint, Former ML Engineer LinkedIn 

@ericabrescia 

- Former COO Github, Founder Bitnami (acq’d by VMWare) 

@jordan_segall 

- Partner at Redpoint