The data used in this paper are obtained from the CMSAR database (China Stock Market & Accounting Research Database) of CSI 300 constituent stocks from January 1, 2016, to December 31, 2020. The stock news data used in this paper are from the CMSAR database and internet crawling, and the internet crawling was performed via aiqicha ( and qichacha ( The total number of texts between January 1, 2016 and December 31, 2020, for the 300 stocks is 1807207 (2016: 79091; 2017: 180868; 2018: 284679; 2019: 445356; 2020: 817213). The database news texts were classified as domestic finance, overseas finance, securities market, Asia-Pacific stock market, European and American market, individual stock news, and industry information (from Xinhua, 21st Century Business Herald, Economic Observer, Yangzi Evening News, and other media, the data is available at the following link., password: 9piq). Table 2 provides an overall description of the distribution of transaction data and news texts, and shows the number of days of trading and the number of news texts. Table 3 is the statistical description of stock market transaction data, Figure 6 is the distribution of news text data, and Figure 6(a) is the distribution of the number of news texts per year. The development of technology shows a trend of increasing year by year. Figure 1(b) shows the classification of news text data. The classification is based on the news classification of the CMSAR database (China Stock Market & Accounting Research Database). It is divided into 7 categories: domestic finance and economics (DFE), overseas finance and economics (OFE), securities market (SM), Asia-Pacific market (APM), European and American markets (EAM), individual stock news (ISN), and industry information (II). Figure 6(c) shows the distribution of news sources, with the top 10 news sources: 21st Century Business Herald (21st), Daily Business News (DBN), Securities Daily (SD), Securities Times (ST), Shanghai Securities News (SSN), Economic Information Daily (EID), Securities Times Website (STW), Beijing Business Today (BBT), First Financial Daily (FFD), and Wall Street News (WSN). According to the classification and sources of news texts in Figure 6(b)(c), it can be seen that the news text data, as the core experimental data of this paper, have a reasonable distribution, and the news coverage is relatively comprehensive.