Can Models Prove Their Own Work?
The Brief
Models are making it cheaper to generate more attempts: faster kernels, possible protein binders, new data-center demand, and software output from coding agents. The harder question is whether models can prove their own work, or where tests, labs, buyers, infrastructure, and human judgment still have to decide what is real.
Features
Makora Wants To Automate CUDA Performance, Then Prove The Speedup Is Real

The SemiAnalysis conversation starts with Makora's original premise: automate the manual work of AI performance engineering. Makora began with generated GPU kernels. In the interview, kernel generation sits inside a wider system for choosing, testing, and deploying performance improvements across AI infrastructure. That system includes inference algorithms, low-precision number formats, evaluation sandboxes, and customer-specific hardware constraints. Inference is the act of running a trained model to generate outputs for users. For AI businesses, better inference performance can mean lower cost, lower latency, or more tokens served on the same hardware. The interview is mostly a company-side account. Abdelfattah gives concrete technical details, and some claims line up with Makora's own blog posts and an arXiv paper. The product's general production performance still needs independent evidence.
Makora points to the next AI infrastructure bottleneck. The scarce work is shifting from writing every kernel by hand to proving that an automatically found optimization is real, correct, portable, and worth deploying.
Generated kernels can reward-hack benchmarks, so the scarce asset includes the sandbox, profiler, reward signal, integration layer, and hardware-specific judgment that separates real speed from benchmark theater.
Makora's SMC-SD and FP4 examples depend on operating conditions: low-batch, low-latency inference for SMC-SD; approximation rather than exact sampling; and low-precision tricks that behave differently on NVIDIA and AMD hardware. Modern AI performance is becoming too hardware-specific, evaluation-sensitive, and deployment-dependent to treat raw generated code as the finished product.
Biohub Is Turning Evolution Into a Search Engine for Protein Design

Alex Rives' core argument is that protein biology is starting to look like a scaling problem. Proteins are chains of amino acids. Evolution has generated billions of those chains, and the choices inside them reflect folding, function, molecular interactions, and survival pressure. Biohub's bet is that a large enough model trained on enough evolutionary diversity can learn those hidden constraints. The episode frames ESMC as a protein world model: a predictive representation of a domain. For proteins, that means a model that can learn enough about sequence, structure, and function to help map unknown biology, predict shapes, and search for new designs. Biohub's claim stops well short of solved drug discovery. It is building open protein infrastructure: ESMC as the language model, ESMFold2 for structure prediction, and ESM Atlas as a giant searchable map of protein space.
Rives argues for scale while still leaving room for scientific priors, structure data, and experimental validation. ESMFold2 remains a structure-prediction system, and Biohub's public model cards keep validation requirements explicit.
The work shifts from encoding every biological assumption by hand toward training large representations on evolution's archive, searching protein space with those representations, and spending scarce experimental capacity on the data and hypotheses most worth testing.
Antibodies are the key test. If ESMC-derived representations help design scFvs and other binders in wet-lab assays, open protein foundation models may move the bottleneck from model architecture toward assays, robotics, measurement, safety testing, and the speed of experimental feedback.
AI Data Centers Are Becoming the Buyer of First Resort for Hard Tech
McCormick's accessible argument starts from an unpopular fact: AI data centers are becoming highly visible local demands on power, land, water, construction, and grid capacity. He thinks the backlash misses a second-order effect. Data centers may be one of the few commercial customers big enough to fund hard technologies before those technologies reach mature economics. Many physical technologies do not improve on software timelines. A better reactor, transformer, geothermal system, optical interconnect, or modular-construction method may need factories, permits, pilots, interconnection, certifications, and operating experience before it becomes cheap. Without an early buyer, it gets trapped between a promising prototype and a bankable market. This feature uses the public preview and corroborating outside context.
Data centers are becoming a procurement regime. They turn AI demand into hard orders for physical capacity, and those orders can finance technologies that normal markets might leave stranded at pilot scale.
Apollo was publicly funded and mission-driven. AI data centers are privately owned, profit-seeking, and locally sited. Their spillovers may be real, but they are not automatic. If ratepayers absorb grid costs, if gas generation fills the gap, or if communities receive little durable benefit, then learning curves can become private upside and public burden.
The allocation question is who captures the upside. If data-center demand accelerates cleaner firm power, better grid hardware, faster construction, cheaper photonics, and more resilient supply chains, the spillovers could matter beyond AI. If it mostly locks up scarce power and equipment for a narrow set of hyperscalers and labs, the backlash will look much more rational.
Waste Tokens, Save Human Time

The conversation treats AI software work as an allocation problem. Code and tokens can both be scarce in some settings, but the binding constraint is often human time: deciding what should exist, steering the model, rejecting bad output, choosing the right building blocks, and verifying that the result works. Guillermo Rauch opens with the idea of software factories. An engineer is judged less by the code they personally type and more by the systems, prompts, tools, and workflows they leave behind. The gap between operators can widen because one person's judgment can coordinate much more machine output. Max Hodak pushes back on measuring that output by visible volume. Token consumption, he says, can become the new lines-of-code metric: easy to count, weak as a measure of value. The useful questions are whether the model work saved time, improved the final result, or helped a human get unstuck. Naval then gives the sharper heuristic: spend tokens when they are cheaper than human hours. If the output is rough, spend more tokens on review and rewrite. But he keeps the claim tied to checkable work.
Token spend buys useful leverage only at the edge of a checkable task. Model activity still has to turn into working output.
Human judgment remains the scarce control layer. The model can write more code, propose tradeoffs, and try another route. The human still has to choose the problem, supply the missing context, pick the architecture, decide what to reuse, and know when the result is good enough.
The building-block argument checks the "waste tokens" slogan. If agents spend every run recreating queues, databases, deployment patterns, integrations, and observability from scratch, the team is burning model work on solved coordination. Good infrastructure lets tokens go toward the new edge of the task instead of the foundation.
Supplementary Resources
Retained same-date feed resources that support the issue, linked directly to the original sources.