LLMs in financial world and Internet-scale Financial Data

The demos are shown in FinGPT and the data sources and supporting codes are in FinNLP

中文版请点击这里

Disclaimer: We are sharing codes for academic purpose under the MIT education license. Nothing herein is financial advice, and NOT a recommendation to trade real money. Please use common sense and always first consult a professional before trading or investing.

Ⅰ. Architecture

The whole project is made up of 4 parts:
The first part is the Data Source, Here, we gather past and streaming data from the Internet.
Next, we push the data to the Data Engineering part where we clean the data, tokenize the data and do the prompt engineering
Then, the data is pushed to LLMs. Here, we may use LLMs in different kind of ways. We can not only use the collected data to train our own light-weight fine-tuning models but we can also use those data and trained models or LLM APIs to support our applications
The last part would be the application part, here we can use data and LLMs to make many interesting applications.

Ⅱ. Data Sources

Due to space limitations, we only show a few of them.

1. News

Platform	Data Type	Related Market	Specified Company	Range Type	Source Type	Limits	Docs (1e4)	Support
Yahoo	Financial News	US Stocks	√	Date Range	Official	N/A	1,500+	√
Reuters	Financial News	US Stocks	×	Date Range	Official	N/A	1,500+	√
Sina	Financial News	CN Stocks	×	Date Range	Official	N/A	2,000+	√
Eastmoney	Financial News	CN Stocks	√	Date Range	Official	N/A	1,000+	√
Yicai	Financial News	CN Stocks	√	Date Range	Official	N/A	500+	Soon
CCTV	Governemnt News	CN Stocks	×	Date Range	Third party	N/A	4	√
US Mainstream	Financial News	US Stocks	√	Date Range	Third party	Account (Free)	3,200+	√
CN Mainstream	Financial News	CN Stocks	×	Date Range	Third party	￥500/year	3000+	√

FinGPT may have fewer docs than Bloomberg, we're on the same order of magnitude.

Platform	Data Type	Related Market	Specified Company	Range Type	Source Type	Limits	Docs (1e4)	Support
Twitter	Tweets	US Stocks	√	Date Range	Official	N/A	18,000+	√
StockTwits	Tweets	US Stocks	√	Lastest	Official	N/A	160,000+	√
Reddit (wallstreetbets)	Threads	US Stocks	×	Lastest	Official	N/A	9+	√
Weibo	Tweets	CN Stocks	√	Date Range	Official	Cookies	1,400,000+	√
Weibo	Tweets	CN Stocks	√	Lastest	Official	N/A	1,400,000+	√

In BloomberGPT, they don’t collect social media data, but we believe that public opinion is one of the most important factors interfering the stock market.

3. Company Announcement

Platform	Data Type	Related Market	Specified Company	Range Type	Source Type	Limits	Docs (1e4)	Support
Juchao (Official Website)	Text	CN Stocks	√	Date Range	Official	N/A	2,790+	√
SEC (Official Website)	Text	US Stocks	√	Date Range	Official	N/A	1,440+	√

Since we collect data from different stock markets, we have more filing docs than Bloomberg GPT.

4. Trends

Platform	Data Type	Related Market	Data Source	Specified Company	Range Type	Source Type	Limits
Google Trends	Index	US Stocks	Google Trends	√	Date Range	Official	N/A
Baidu Index	Index	CN Stocks	Soon	-	-	-	-

5. Data Sets

Data Source	Type	Stocks	Dates	Avaliable
AShare	News	3680	2018-07-01 to 2021-11-30	√
stocknet-dataset	Tweets	87	2014-01-02 to 2015-12-30	√
CHRNN	Tweets	38	2017-01-03 to 2017-12-28	√

Ⅲ. Models

In data-centric NLP, we don’t train the model from the beginning. We only call APIs and do light-weight fine-tunings.
The left part is some LLM APIs that we may use and the middle part is the models that we may use to perform fine-tunings and the right part is some of the Fine-tuning methods

1. Fine-tuning: Tensor Layers (LoRA)

In FinGPT, we fine-tune a pre-trained LLM using a new financial dataset.High-quality labeled data is one of the most important key to many successful LLMs including ChatGPT
However, those high-quality labeled data are often very expensive and time-consuming and we may need help from professional finance experts.
If our goal is to use LLMs to analyze financial-related text data and help with quantitative trading, why not let the market do the labeling for us?
So here, we use the related stock price change percent of each news as the output label, we use the threshold to split the label into three groups positive, negative, and neutral, and use them and the label of the news sentiment.
In correspondence, we also ask the model to select one of positive, negative, and neutral as the output in the prompt engineer part so we the make the best use of the pre-trained information
By using LoRA we may reduced the trainable parameters from 6.17B to 3.67M
As the table presents, compared with chatGLM, FinGPT can achieve large improvement on multiple metrics. it may be inappropriate to use our model to quantitative trading directly. Since most news titles are neutral, most of the original outputs of the LLMs are Neutral, so LLM perform poorly in positive and negative labels and those labels are what might be useful in quantitative trading.
However, after fine-tuning, we have witness huge improvements in the prediction of positive and negative labels.
That’s also why the model can achieve positive trading results.

2. Fine-tuning: Reinforcement Learning on Stock Prices (RLSP)

In the same way, we may use RL on Stock Prices (RLSP) to replace RL on Human feedback used by ChatGPT.

Ⅳ. Applications

1. Robo Advisor

ChatGPT can make the investment advises just like a pro.
In this example the raising stock price of the Apple is in accordance with ChatGPT’s prediction made by the analysis of news

2. Quantitative Trading

We may also use News, Social media tweet or filing to build sentiment factors, the right part is the trading results just by the signal of the twitter tweets and ChatGPT, the data is from a data set called stocknet-dataset.
As you may see from the picture, the trading signals generated by ChatGPT are so good that we may even achieve good results just by trading according to twitter sentiment factors.
So we may even achieve better results by combining price factors.

3. Low-code development

We can use the help of LLMs to write codes.
The right part shows how we can develop our factors and other codes quickly and efficiently.