More Than a GPT-Wrapper: Why Top-Tier ML Engineers Master the Geometry of Math
What distinguishes a "GPT-wrapper" developer from a top-tier Machine Learning Engineer? According to our recent chat with Sondre, a ReLU alumnus, the answer often lies in Linear Algebra.
Sondre is currently pursuing a Master’s in Computational and Mathematical Engineering (ICME) at Stanford University. While his time at NTNU provided a strong technical foundation, his current program has pushed him deep into the theoretical underpinnings of the algorithms we often take for granted.
We caught up with him to discuss the importance of mathematical intuition, why the best Quants in the world still love Linear Regression, and the geometric concepts every data scientist should master.
The Shift: From Implementation to "Pure Math"
While many Computer Science programs focus heavily on implementation and software development, Sondre’s track at Stanford is distinctively different. It is less about importing libraries and more about proving why those libraries work.
"It’s a lot of work, and the focus is heavily theoretical," Sondre explains. "Many students here, especially those from France, come from pure math backgrounds. We spend a lot of time on proofs and Linear Algebra. It’s a different perspective than the standard Computer Science approach, but it gives you total control over what you are building."
The Academic Take: Why You Should Revisit Linear Regression
One of the most surprising insights from Sondre’s time in Silicon Valley is the industry’s reliance on "basic" models. He recounts speaking with the Head of Risk at a major quantitative firm who relies almost exclusively on Linear Regression.
Why? Interpretability and Control.
"When you use a complex neural network, you often lose the ability to explain why the model made a specific prediction," Sondre notes. "In high-stakes fields like finance, if you can't explain the output, nobody will listen to you."
But it goes deeper than just "keeping it simple." Sondre highlights that a deep understanding of Regularization allows Linear Regression to handle complex, correlated data just as well as black-box models.
1. The Geometry of Regularization (L1 vs. L2)
Sondre is currently taking a course taught by Trevor Hastie, a legend in the field and co-author of The Elements of Statistical Learning. A key takeaway from Hastie’s course is the geometric understanding of norms.
"We often just memorize the formulas for L1 (Lasso) and L2 (Ridge) regularization, but understanding their geometry changes how you view the model," Sondre says.
L1 Norm (Lasso): Geometrically, the constraint region looks like a diamond. Because the solution often hits the "corners" of this diamond, L1 regularization drives coefficients to exactly zero. This creates sparse solutions, effectively performing feature selection for you.
L2 Norm (Ridge): The constraint is a circle. It shrinks coefficients but rarely zeros them out entirely, which is better for handling multicollinearity without removing variables.
"I’ve become a bit in love with regressions because of this," Sondre admits. "When you understand the geometry, you realize that Linear Regression involves projecting data onto a subspace. It’s elegant, and you know exactly what is happening to your variables."
2. Orthogonality and PCA
Another concept Sondre wishes he had utilized more during his projects at NTNU is Principal Component Analysis (PCA)—not just for dimensionality reduction, but for handling correlation.
"In real-world projects, like the ones we do in ReLU, you often have variables that are highly correlated," he explains. "If you feed that straight into a model, your coefficients (betas) stop representing the marginal effect of a single variable, because they are entangled with others."
By using techniques like PCA, you can transform your features to be orthogonal (statistically independent). This ensures that your model is stable and your interpretations are valid. "Standardizing variables and understanding their correlation structure is much more important than people think," he adds.
Sondre specifically reflected on his previous work experience at Norges Bank Investment Management (NBIM) to illustrate this point. "I think I could have used these techniques a lot more in the project I had at NBIM," he says. Working with financial data often means dealing with a massive number of variables that move together. Looking back, he realizes that simply feeding that data into a black-box model isn't optimal. By using PCA to transform those variables so they are orthogonal (statistically independent) before modeling, he could have achieved a much cleaner, more robust interpretation of the market drivers.
A sunny afternoon at Stanford’s historic Main Quad (Photo: Sondre Rogde)
The "AI Bubble" vs. Deep Engineering
Living in the Bay Area, Sondre sees the hype cycle up close. He distinguishes between "GPT-wrappers", startups that simply call an API, and companies doing "Deep Engineering."
"I don't think the AI industry is a bubble for the big players, but for the shallow applications, it might be," he observes.
He points to tools like Cursor as an example of AI done right. "The value isn't just in using a language model. The value is in the engineering required to make math agents understand a codebase and perform complex reasoning. That requires a level of mathematical maturity that goes beyond just prompting a model."
Final Advice for ReLU Students
Sondre’s advice to current members is not to shy away from the advanced models, but to ensure the foundation is solid first.
"At NTNU, we have the freedom to build really cool, practical things. But my advice is to take the time to understand the data before you throw a Deep Learning model at it. Look at the correlations. Consider a baseline Linear Regression. Understand the geometry of your loss function."
"My vision for ReLU was always to bridge the gap between academia and industry," Sondre concludes. "And in the industry, the best engineers are often the ones who master the basics."