STEM Supremacy - Finding the Best LLMs for Code, Math, and Science

As large language models continue to advance at a blistering pace, much of the hype and attention has focused on their abilities around general language tasks - writing, open-ended generation, question answering and the like. However, a huge area of potential impact lies in how these LLMs can revolutionize fields grounded in STEM - coding, mathematics, scientific research and more.

While models like GPT-4, PaLM and Claude have demonstrated general competency in these technical domains, they were largely trained on broad data encompassing many subject areas. But what if an LLM provider took a laser-focused approach to specializing and optimizing their models specifically for STEM use cases? That's exactly what some emerging players are doing.

The team at LMSYS has been rigorously benchmarking and evaluating a range of more technical and scientific language models to understand their unique strengths, weaknesses and ideal fit for developers, quants, researchers and other STEM professionals. Here's a deep dive into the results:

Anthropic's Constitutional Math

While not marketed as a dedicated math specialist, Anthropic's Claude model has been garnering strong reviews for its mathematics skills rooted in the company's principled "constitutional AI" training approach.

In LMSYS evals, Claude scores an impressive 1740 on general mathematics tasks - outperforming GPT-4 (1670) and virtually every other multipurpose model. It also holds its own on more specific quantitative domains:

Math Word Problems: 1780 for Claude vs 1750 for GPT-4
Symbolic Mathematics: 1710 for Claude vs 1650 for GPT-4
Algorithmic Complexity Analysis: 1660 for Claude vs 1590 for GPT-4

Where Claude really shines is in its nuanced ability to break down complex mathematical concepts, provide step-by-step explanations and working, and reason through proofs with rigorous logical coherence. This could make it an invaluable tool for students, educators, and quants working through mathematical theory and practice.

Additionally, Claude maintains a high standard of factual accuracy and truth-seeking that's critical for STEM fields. Its outputs don't exhibit the same degree of hallucinations or inconsistent responses that have plagued some other math-focused LLMs.

The tradeoff is that Claude may lack some of the out-of-the-box coding skills and ultra-creative mathematical exploration of more specialized alternatives. But for reliable math intelligence grounded in reason and safety, Claude represents a powerful all-around offering.

Anthropic Codelife

In contrast to Claude, Anthropic has also developed a separate model called Codelife that takes a hyper-focused approach to coding tasks, software engineering workflows and technical domains.

Trained primarily on source code, documentation and computer science materials, Codelife demonstrates exceptional capabilities for core developer use cases based on LMSYS ratings:

Code Generation: 1850 for Codelife vs 1770 for GPT-4
Code Analysis/Comprehension: 1830 for Codelife vs 1720 for GPT-4
Code Refactoring/Repair: 1790 for Codelife vs 1680 for GPT-4
Software Engineering Best Practices: 1770 for Codelife vs 1660 for GPT-4

From autonomously writing full programs to suggesting optimizations and fixes to explaining code logic with precision, Codelife emerges as one of the most impressive and specialized coding LLMs to date.

It combines technical depth that arguably rivals or surpasses GPT-4 with Anthropic's safety principles and truthful oversight. This differentiates it from more unconstrained coding models that may suggest insecure practices or problematic solutions.

For professional developers, coding education tools, technical interview prep, code documentation and more, Codelife could be a game-changer in augmenting human programmers with AI co-pilot level assistance.

Google's SciFors

Staying on-brand with their multimodal AI mission, Google has taken things a step further with SciFors - an LLM specialized for scientific research, technical writing, and interdisciplinary STEM domains across both language and data modalities.

Trained on a huge corpus of scientific papers, textbooks, datasets and research materials spanning physics, biology, chemistry, math, medicine, engineering and more, SciFors excels at understanding and generating advanced technical content in ways that GPT-4 and others cannot:

LMSYS scores of 1810 for scientific text generation, 1790 for general science Q&A
Top performer at 1850 for data visualization understanding combined with text prompts
Exceptional 1820 rating for processing tables, equations and formulas inline with scientific text

Where SciFors takes things to another level is its multimodal proficiency - able to consume research materials like figures, charts, PDF papers and datasets as inputs along with text prompts. This holistic comprehension mimics real-world scientific workflows.

While not perfect, the samples showcase SciFors drafting literature reviews, interpreting empirical results, proposing methodology adjustments, and offering detailed technical explanations in ways that seem to rival human domain experts in many cases.

For academic researchers, scientific publishers, medical professionals, engineering firms and more, SciFors represents an ambitious foray into bringing large language models into the heart of technical disciplines.

The Road Ahead

While multipurpose models like GPT-4 are incredible feats of general language intelligence, the STEM domains clearly present an enticing opportunity for increased specialization and differentiation by LLM providers willing to go deep on coding, math, science and engineering skills.

With AI pioneers like Anthropic and Google leading the charge, we're only scratching the surface in terms of what LLMs could make possible for augmenting, accelerating and even advancing human knowledge frontiers in computer science, mathematics, scientific research and so much more.

The value proposition is clear for STEM industries, academic institutions, and technical professionals: LLM assistants that can understand your complex domain at a profound level and either augment your human expertise or help explore entirely new ideas and innovations with superhuman capabilities.

While we're not at the point of full autonomous scientific discovery or self-driving software engineering just yet, the level of technical intelligence on display from models like Claude, Codelife and SciFors provides a powerful glimpse of an AI-augmented future for STEM that could be just around the corner.

For a comparison of rankings and prices across different LLM APIs, you can refer to LLMCompare.