Skip to content

This is the repo of developing reasoning models in the specific domain of financial, aim to enhance models capabilities in handling financial reasoning tasks.

Notifications You must be signed in to change notification settings

The-FinAI/Fino1

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

54 Commits
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸš€ Fino1: On the Transferability of Reasoning-Enhanced LLMs to Finance

πŸ€— Training Data | πŸ€— Training Data version 2 | πŸ“„ Arxiv | πŸ€– 8B-Model πŸ€– 14B-Model πŸ† Leaderboard


πŸ“ˆ Overview

πŸ“‚ Datasets Used

Here, we utilized three evaluation datasets to assess the performance of our Fino1 model.

Dataset Description
FinQA FinQA is a large-scale dataset for numerical reasoning in finance, featuring expert-annotated QA pairs that require integrating structured and unstructured data from financial reports while handling complex domain-specific terminology.
DocMath DocMath-Eval is a benchmark for evaluating LLMs' numerical reasoning over long specialized documents and tables, with the simpllong subset focusing on reasoning across multi-tiered financial or specialized tables within extended contexts.
XBRL-Math XBRL-Math dataset evaluates LLMs' numerical reasoning in XBRL filings, requiring models to interpret structured financial data, US GAAP XBRL tags, equations, and hierarchical numerical relationships for accurate financial analysis.

πŸ† Models Evaluated

We compared our Fino1 model against 16 state-of-the-art large language models (LLMs).

Model Description
GPT-4o GPT-4o is OpenAI's versatile, high-intelligence flagship model. It accepts text and image inputs and produces text outputs (including Structured Outputs).
GPT-o1 The o1 series of models are trained with reinforcement learning to perform complex reasoning. o1 models think before they answer, producing a long internal chain of thought before responding to the user.
GPT-o3-mini o3-mini is OpenAI's most recent small reasoning model, providing high intelligence at the same cost and latency targets of o1-mini. o3-mini also supports key developer features, like Structured Outputs, function calling, Batch API, and more.
GPT-4.5 GPT-4.5 is the largest and most capable GPT model yet. Its deep world knowledge and better understanding of user intent makes it good at creative tasks and agentic planning. GPT-4.5 excels at tasks that benefit from creative, open-ended thinking and conversation, such as writing, learning, or exploring new ideas.
DeepSeek-V3 DeepSeek-V3 is a 671B Mixture-of-Experts (MoE) model with 37B active parameters per token, leveraging Multi-head Latent Attention (MLA) and DeepSeekMoE for efficient training and inference, achieving state-of-the-art performance comparable to closed-source models with stable and cost-effective training.
DeepSeek-R1 DeepSeek-R1-Zero and DeepSeek-R1 are first-generation reasoning models, with DeepSeek-R1 incorporating cold-start data before RL to improve readability and performance, achieving results comparable to OpenAI-o1 across reasoning tasks, while open-sourced distilled models set new benchmarks for dense models.
Qwen2.5-7B-Instruct Qwen2.5 is the latest series of Qwen LLMs, offering models from 0.5B to 72B parameters with improved knowledge, coding, math, instruction following, structured data handling, long-context support (up to 128K tokens), and multilingual capabilities across 29+ languages.
Qwen2.5-14B-Instruct Qwen2.5 is the latest series of Qwen LLMs, offering models from 0.5B to 72B parameters with improved knowledge, coding, math, instruction following, structured data handling, long-context support (up to 128K tokens), and multilingual capabilities across 29+ languages.
Qwen2.5-32B-Instruct Qwen2.5 is the latest series of Qwen LLMs, offering models from 0.5B to 72B parameters with improved knowledge, coding, math, instruction following, structured data handling, long-context support (up to 128K tokens), and multilingual capabilities across 29+ languages.
Qwen2.5-72B-Instruct Qwen2.5 is the latest series of Qwen LLMs, offering models from 0.5B to 72B parameters with improved knowledge, coding, math, instruction following, structured data handling, long-context support (up to 128K tokens), and multilingual capabilities across 29+ languages.
Qwen/QwQ-32B QwQ is the reasoning model of the Qwen series. Compared with conventional instruction-tuned models, QwQ, which is capable of thinking and reasoning, can achieve significantly enhanced performance in downstream tasks, especially hard problems.
Qwen2.5-Math-72B-Instruct Qwen2.5-Math-72B-Instruct is an upgraded open-source mathematical LLM supporting both Chain-of-Thought (CoT) and Tool-integrated Reasoning (TIR) for solving math problems in Chinese and English, offering significant performance improvements over Qwen2-Math.
DeepSeek-R1-Distill-Llama-70B DeepSeek-R1-Zero and DeepSeek-R1 are first-generation reasoning models, with DeepSeek-R1 incorporating cold-start data before RL to improve readability and performance, achieving results comparable to OpenAI-o1 across reasoning tasks, while open-sourced distilled models set new benchmarks for dense models.
Llama3-70B-Instruct Meta released the Llama 3 family of 8B and 70B LLMs, optimized for dialogue, outperforming many open-source chat models while prioritizing helpfulness and safety.
Llama3.1-70B-Instruct The Meta Llama 3.1 collection includes multilingual LLMs (8B, 70B, 405B) optimized for multilingual dialogue, outperforming many open-source and closed chat models on industry benchmarks.
Llama3.3-70B-Instruct The Meta Llama 3.3 is a 70B instruction-tuned multilingual LLM optimized for dialogue, outperforming many open-source and closed chat models on industry benchmarks.
DeepSeek-R1-Distill-Qwen-32B DeepSeek-R1-Zero and DeepSeek-R1 are first-generation reasoning models, with DeepSeek-R1 incorporating cold-start data before RL to improve readability and performance, achieving results comparable to OpenAI-o1 across reasoning tasks, while open-sourced distilled models set new benchmarks for dense models.
DeepSeek-R1-Distill-Qwen-14B DeepSeek-R1-Zero and DeepSeek-R1 are first-generation reasoning models, with DeepSeek-R1 incorporating cold-start data before RL to improve readability and performance, achieving results comparable to OpenAI-o1 across reasoning tasks, while open-sourced distilled models set new benchmarks for dense models.
DeepSeek-R1-Distill-Llama-8B DeepSeek-R1-Zero and DeepSeek-R1 are first-generation reasoning models, with DeepSeek-R1 incorporating cold-start data before RL to improve readability and performance, achieving results comparable to OpenAI-o1 across reasoning tasks, while open-sourced distilled models set new benchmarks for dense models.
Llama3-8B-Instruct Meta released the Llama 3 family of 8B and 70B LLMs, optimized for dialogue, outperforming many open-source chat models while prioritizing helpfulness and safety.
Llama3.1-8B-Instruct The Meta Llama 3.1 collection includes multilingual LLMs (8B, 70B, 405B) optimized for multilingual dialogue, outperforming many open-source and closed chat models on industry benchmarks.
LIMO LIMO challenges the conventional wisdom in mathematical reasoning by demonstrating that models can achieve superior performance with significantly less but higher quality training data.
s1-32B s1 is a reasoning model finetuned from Qwen2.5-32B-Instruct on just 1,000 examples. It matches o1-preview & exhibits test-time scaling via budget forcing.
FinR1-7B Fin-R1 is a large language model designed for complex reasoning tasks in the financial domain. It was developed and open-sourced jointly by the Financial Large Language Model Research Group (SUFE-AIFLM-Lab) at the School of Statistics and Data Science of Shanghai University of Finance and Economics, in collaboration with Fintopia. The model is based on Qwen2.5-7B-Instruct and fine-tuned using high-quality, verifiable financial reasoning questions. It has achieved state-of-the-art performance across multiple financial benchmarks.

🧩 Reasoning Path Building

For the reasoning path building and training part, we were inspired by HuatuoGPT-o1

We release the reasoning path here: https://huggingface.co/datasets/TheFinAI/Fino1_Reasoning_Path_FinQA

πŸ—οΈ How to Train Fino1

Refer to HuatuoGPT-o1, we applied two-stage way to train our Fino1 model

  • Stage 1: Supervised Fine-Tuning (SFT)

  • Stage 2: Reinforcement Learning (RL)

We provide a simple PPO script using the trl library. Below is an example for training an 8B model with PPO on an 8-GPU A100 machine. Ensure you first download medical verifier as the reward model.

Please check HuatuoGPT-o1 for more training details.

🎯 Evaluation of all models

Inference: Local Models

Model inference for local models is conducted using FinBen with the VLLM framework.

Inference: API Models

For API-based models, evaluation is performed using the query_llm.py script.

Evaluation

For the final evaluation, we used DocMath-Eval to first use GPT to extract final answers from the result and then evaluate the correctness of the answer.


Key Results

πŸ“Š Performance of Different LLMs on Financial Datasets

Models FinQA DocMath-Simplong XBRL-Math DocMath-Complong Average
GPT-4o 72.49 60.00 72.22 39.33 61.01
GPT-o1-preview 49.07 56.00 74.44 36.67 54.05
GPT-o3-mini 60.87 59.00 76.67 35.00 57.89
DeepSeek-V3 73.20 53.00 76.67 42.33 61.30
DeepSeek-R1 65.13 53.00 86.67 38.67 60.87
GPT-4.5 68.94 59.00 74.44 39.33 60.43
Meta-Llama-3-70B-Instruct 58.92 41.00 56.67 13.67 42.57
Llama-3.1-70B-Instruct 63.18 48.00 63.33 34.33 52.21
Llama-3.3-70B-Instruct 68.15 54.00 70.00 32.00 56.04
Qwen2.5-72B-Instruct 73.38 59.00 67.78 14.67 53.71
Qwen2.5-Math-72B-Instruct 69.74 42.00 83.33 5.00 50.02
DeepSeek-R1-Distill-Llama-70B 66.73 53.00 86.67 30.67 59.27
Qwen2.5-32B-Instruct 73.11 56.00 65.56 30.00 56.17
Qwen/QwQ-32B 61.22 46.00 84.44 20.00 52.92
DeepSeek-R1-Distill-Qwen-32B 65.48 55.00 84.44 24.67 57.40
Limo 63.44 45.00 61.11 15.33 46.22
S1-32B 66.81 53.00 84.44 24.00 57.06
Qwen2.5-14B-Instruct 67.44 59.00 57.78 26.67 52.72
DeepSeek-R1-Distill-Qwen-14B 63.27 44.00 84.44 21.00 53.18
DeepSeek-R1-Distill-Llama-8B 45.96 33.00 81.11 15.67 43.94
Meta-Llama-3-8B-Instruct 41.97 29.00 48.89 6.00 31.47
Llama-3.1-8B-Instruct 54.13 34.00 62.22 14.30 41.16
Qwen2.5-7B-Instruct 55.37 41.00 42.22 17.67 39.07
FinR1-7B 58.74 37.00 30.00 13.67 34.85
Fino1-8B 60.87 40.00 82.22 20.00 50.77
Fino1-14B 70.01 60.00 86.67 24.33 60.25

πŸ› οΈ Updates

  • [2025/02/12] πŸŽ‰ We've trained Fino1-8B model and evaluated its performance.
  • [2025/03/30] πŸŽ‰ We've trained Fino1-14B model and evaluated its performance recently

πŸ“„ Citation

If you find our work useful, please cite our paper:

BibTeX:

@misc{qian2025fino1transferabilityreasoningenhanced,
      title={Fino1: On the Transferability of Reasoning Enhanced LLMs to Finance}, 
      author={Lingfei Qian and Weipeng Zhou and Yan Wang and Xueqing Peng and Jimin Huang and Qianqian Xie},
      year={2025},
      eprint={2502.08127},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2502.08127}, 
}

About

This is the repo of developing reasoning models in the specific domain of financial, aim to enhance models capabilities in handling financial reasoning tasks.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published