EPRI’s first-of-its-kind benchmarking effort uses a multi-phase approach to measure how leading large language models (LLMs) perform on real, utility-relevant questions. The process evaluates models across progressively more demanding assessments and will expand to include evaluations of domain-augmented tools and real-world utility use cases.
EPRI
3420 Hillview Avenue, Palo Alto, California 94304-1338 USA • 800.313.3774 • 650.855.2121 • askepri@epri.com • www.epri.com
© 2025 Electric Power Research Institute (EPRI), Inc. All rights reserved. Electric Power Research Institute, EPRI, and TOGETHER…SHAPING THE FUTURE OF ENERGY are registered marks of the Electric Power Research Institute, Inc. in the U.S. and worldwide.
How Accurate Are Today’s Generative AI Models
EPRI Benchmarks the Leading Models
Benchmarking AI for the Power Sector
Multiple-choice evaluations
Baseline accuracy at scale
Open-ended short answers True reasoning and clarity
Multi-run repeatability testing Model stability and confidence intervals
Domain-Augmented Tools and Utility Use Cases Benchmarking retrieval-augmented generation systems, knowledge graphs, agentic workflows, and real utility tasks
This structured approach delivers transparent, repeatable insights that help utilities identify AI tools they can trust.
EPRI’s benchmarking lays the foundation to evaluate domain-specific augmentation tools and models that deliver increased value to utilities and the broader energy ecosystem. The next phase of benchmarking will focus on real-world pilots with member utilities that integrate AI assistants into operational workflows. These pilots will measure accuracy, trust, and real operational impact.
Stay tuned for more on EPRI.AI, a retrieval-augmented assistant built on EPRI’s proprietary research library, and the Energy DSM, a domain-specific LLM trained on power-sector datasets. Comparing these tools alongside general-purpose LLMs will help determine where domain-tuned solutions offer measurable benefits.
EPRI’s benchmarking framework evaluates how leading large language models perform on real, utility-relevant questions. The study uses a structured, multi-phase process that begins with automated multiple-choice testing, progresses to open-ended short-answer evaluations, and includes multi-run repeatability testing to quantify model stability and confidence intervals.
All models were tested under identical automated conditions using enterprise interfaces to ensure a consistent apples-to-apples comparison. Questions were authored and reviewed by 94 subject matter experts across 35 power-system domains, creating the first rigorously constructed, domain-specific benchmark for the electric power sector.
This foundation will support future benchmarking of domain-augmented tools such as retrieval-augmented assistants, knowledge-graph-enhanced systems, and domain-specific models, as well as upcoming applied use-case evaluations with member utilities.
The Road Ahead: Applied AI and Domain-Specific Tools for Utilities
How Today’s Leading LLMs Perform on Power-Sector Questions
Four Big Insights
Methodology
Open-ended questions exposed major weaknesses. Accuracy dropped about 27 points and top models scored only 46-71% on expert-level items. Human oversight remains essential.
Hard Questions Reveal Gaps
LLMs scored 83-86% on MCQs, similar to academic tests. But the gap with open-ended results shows MCQs overstate real-world performance.
Strong on Multiple Choice
Top open-source models now score about 81%, nearly matching last year’s proprietary leaders. They also allow faster iteration and secure in-house deployment.
Open-Source Models Catching Up
Allowing web search improves scores only 2-4% and introduces risks from incorrect or misleading content.
Web Search Helps Only Slightly
The Power Sector’s AI Benchmarking Hub
(Coming Next)
(Coming Next)