Large Language Model Agents in Finance: A Survey Bridging Research, Practice, and Real-World Deployment

1University of Washington, 2University of Maryland, 3Carnegie Mellon University

*Indicates Equal Contribution

Corresponding Author
Financial Overview

Overview of LLM-based financial agents and their collaborative workflows: Modern financial institutions rely on multiple departments—Data Analysis, Investment Research, Trading, Investment Management, and Risk Management—each handling specialized but interdependent roles. Key sub-tasks include TS (Text Summarization), NER (Named Entity Recognition), FRE (Financial Relation Extraction), EC (Event Classification), SA (Sentiment Analysis), TSF (Time Series Forecasting), SE (Strategy Execution), QA (Question Answering), FD (Fraud Detection), DRP (Default Risk Prediction), and MAC (Multi-Agent Collaboration).

Abstract

Large language models (LLMs) are increasingly applied to finance, yet challenges remain in aligning their capabilities with real-world institutional demands. In this survey, we provide a systematic, dual-perspective review bridging financial practice and LLM research. From a practitioner-centric standpoint, we introduce a functional taxonomy covering five core financial domains—Data Analysis, Investment Research, Trading, Investment Management, and Risk Management—mapping each to representative tasks, datasets, and institutional constraints. From a research-focused perspective, we analyze key modeling challenges, including numerical reasoning limitations, prompt sensitivity, and lack of real-time adaptability. We comprehensively catalog over 30 financial benchmarks and 20 representative models, and compare them across modalities, tasks, and deployment limitations. Finally, we identify open challenges and outline emerging directions such as continual adaptation, coordination-aware multi-agent systems, and privacy-compliant deployment. We emphasize deeper researcher–practitioner collaboration and transparent model architectures as critical pathways to safer and more scalable AI adoption in finance.

Table 1. Overview of LLM-based financial tasks and datasets: Organized by agent and subtask—Data Analysis, Investment Research, Trading, Investment Management, Risk Management, and Multi-Agent Collaboration—this table lists key datasets, along with diverse data modalities (text, tables, time series, structured reports), evaluation metrics, and representative LLM models. It also highlights primary challenges, emerging trends, and future research directions critical for real-world applications.
Agent & Subtask Datasets & Benchmarks Modalities (Data Types) Key Metrics Representative Models Limitations
Data Analysis Agent (data processing and extraction)
Text Summarization (TS) ECT-Sum, LCFNS Text (earnings-call transcripts, financial reports, news articles) ROUGE, BERTScore, Num-Prec., SummaC FinMA, FinTral, InvestLM, BloombergGPT, FinGPT Limited structured data integration, high computational cost, lack of real-time updates.
Name-Entity Recognition (NER) FIN, FiNER-ORD Text (SEC filings, financial news articles) Precision, Recall, F1-score FinMA, InvestLM, ICE-INTERN Small-scale coverage, weak entity linking, limited numeric reasoning.
Financial Relation Extraction (FRE) FinRED, FIRE, KPI-EDGAR Text (EDGAR filings, earnings-call transcripts, KPI mentions) Precision, Recall, F1-score FinTral, ICE-INTERN, Xuanyuan 2.0 Difficulty detecting event-based relationships, lack of domain-specific pretraining.
Investment Research Agent (asset evaluation and market prediction)
Event Classification (EC) FOMC, FedNLP, Headlines Text (policy statements, news headlines, earnings-call transcripts) Accuracy, Precision, Recall, F1-score BloombergGPT, FinLLaMA, Temporal meets LLM, FinMA, FinGPT No real-time market data, insufficient domain-specific pretraining.
Sentiment Analysis (SA) FPB, FiQA-SA, StockEmotions Text (news articles, microblogs, StockTwits) Accuracy, Precision, Recall, F1-score, MSE FinGPT, FinMA, BloombergGPT, FinLLaMA Short-text limitations, oversimplified sentiment classification, lack of multi-modal context.
Time Series Forecasting (TSF) StockNet, Bigdata22, CIKM18 Text (tweets, microblogs), Time Series (stock prices) Accuracy, MCC Temporal meets LLM, FinLLaMA, FinGPT, FinMA No real-time data, weak asset-specific feature integration.
Trading Agent (strategy execution and decision-making)
Strategy Execution (SE) GPT-InvestAR, FinTrade Text (earnings reports, sentiment), Tables (historical prices) Profitability, Sharpe Ratio (SR) GPT-3.5-Turbo, FinBen Narrow market coverage, lack of real-time data, overlook portfolio diversification.
Support Decision-Making (SDM) InvestorBench, STRUX, FinBen Text (financial reports), Tables (crypto market data), Time Series (stock prices) Cumulative Return (CR), Sharpe Ratio (SR), Annualized Volatility (AV), Maximum Drawdown (MDD) FinMEM, STRUX, CFGPT Narrow real-world asset coverage, over-reliance on simplistic reward signals.
Investment Management Agent (portfolio optimization and allocation)
Question-Answering (QA) FiQA-QA, FinQA, ConvFinQA Text (financial news, social media posts, earnings statements), Tables (S&P 500 market tables) nDCG, MRR, Execution Accuracy, Program Accuracy FinQANet, Alphafin, FinMA, InvestLM Limited multi-modal support, struggle with long multi-hop reasoning.
Risk Management Agent (fraud detection and compliance)
Fraud Detection (FD) Credit Card Fraud, ccFraud Text (credit card transactions), Tables (financial logs) Accuracy, Precision, Recall, F1-score, AUC-ROC Finbench, FinGPT, CALM Class imbalance, evolving fraud patterns, lack of real-time tracking.
Default Risk Prediction (DRP) Finbench-CD, Finbench-LD Text (home equity loans, vehicle loans), Tables (credit card client records) Accuracy, Precision, Recall, F1-score Finbench, FinGPT, CALM Highly imbalanced data, poor interpretability for credit decisions.
Multi-Agent Collaboration (MAC)
Multi-Agent Collaboration (MAC) FinCon, Tradingagents, Cryptoagents Text (financial news), Tables (crypto market data), Audio (ECC recordings) CoT Accuracy, Profitability, CR, SR, MDD StockGPT, FinCon, Tradingagents Lack of real-time trading support, prompt engineering sensitivity.
Table 2. Comprehensive Overview of Representative Financial Datasets. The table summarizes key characteristics—including raw data size, collection period, data sources, and license types—of datasets used by various LLM-based agents in finance. [Best to zoom in]
Agent & Subtask Dataset Raw Data Size Collection Period Source License
Data Analysis Agent ECT-Sum 2,425 document-summary pairs Jan 2019 - Apr 2022 Earnings call transcripts, Reuters articles GPL-3.0 license
LCFNS 430,820 news-summary pairs Jan 2013 - Jun 2020 Major financial portals Public
FIN 54,256 words (8 annotated agreements) - U.S. SEC filings None Public
FiNER-ORD 201 financial news articles, 4,739 sentences Jul 2015 - Oct 2015 Webz.io CC BY-NC 4.0
FinRED 7,775 sentences, 29 relation types Jul 2015 - Oct 2015, Jun 2019 - Sep 2019 Financial news articles, earnings calls Public
FIRE 3,025 instances, 18 relation types 1993 - 2021 Financial news articles, SEC filings CC BY 4.0
KPI-EDGAR 1,355 sentences - EDGAR database annual reports MIT license
Investment Research Agent FOMC 214 minutes, 1,026 speeches, 63 transcripts 1996 - 2022 Federal Open Market Committee communications CC BY-NC 4.0
FedNLP 1000+ speeches, 100+ press conferences Jan 2015 - Jul 2020 Federal Reserve communications Public
Headlines 11,412 annotated news headlines 2000 - 2019 Gold commodity market CC BY-NC-ND 4.0
FPB 4,840 sentences - Financial news articles CC BY-SA 3.0
FiQA-SA 529 annotated headlines and 774 financial microblogs - Financial news and social media CC-BY-3.0
StockEmotions 10,000 investor comments, 12 emotions Jan 2020 - Dec 2020 StockTwits Public
StockNet 26,614 price movement data of 88 stocks 2014 - 2016 S&P 500 stocks, StockTwits MIT license
Bigdata22 7,164 tweets 2014 - 2015 S&P 500 stocks Public
CIKM18 47 stocks from S&P 500 Jan 2017 - Nov 2017 Yahoo Finance, Twitter Public
Trading Agent GPT-InvestAR 10-K filings with 24,200 documents 2002 - 2023 Annual SEC report filings MIT license
FinTrade 16,137 news articles, 65 10-K/10-Q files, 4,970 price data from 10 stocks One year period Stock prices, SEC filings, news MIT license
InvestorBench 5,000 stock prices, 2,000 earnings reports, 50,000 cryptocurrency articles 2019 - 2023 Yahoo Finance, CoinMarketCap, CryptoPotato, CoinTelegraph MIT license
STRUX 11,950 quarterly earnings call transcripts 2017 - 2024 Motley Fool website, NASDAQ 500 and S&P 500 stocks Public
Risk Management Agent Credit Card Fraud 11,392 transactions 2013 European cardholders DbCL v1.0
ccFraud 10,485 transactions 2013 European cardholders Public
Finbench-CD 30,000 credit records Apr - Sep 2005 Credit card clients in Taiwan CC BY-NC 4.0
Finbench-LD 10,000 credit records, 200,000 vehicle loan records - Loan records CC BY-NC 4.0
Multi-Agent Collaboration FinCon Data size not specified Aug 2020 - Aug 2023 Yahoo Finance, Form 10-Q, Form 10-K, Zacks Rank, Earnings conference calls CC BY-NC 4.0
Tradingagents Data size not specified Jan - Mar 2024 S&P 500 stocks, Bloomberg, Yahoo, Reddit, Twitter None Public
Cryptoagents Top 30 cryptocurrency data Jun 2023 - Sep 2024 Blockchain.info, Coin Metrics, Cointelegraph None Public

Paper Preview

BibTeX


}