figshare
Browse

Dataset and News Sentiments for NSEI Stock Market Prediction

dataset
posted on 2025-09-17, 17:38 authored by Akshat RanjanAkshat Ranjan
<p dir="ltr">This dataset contains the financial time-series and corresponding news sentiment data used to conduct the research in the paper, "Stock Price Prediction with Limited Historical Data: A Model-Driven Exploration". The study's primary objective was to evaluate and compare the performance of four distinct forecasting models—ARIMA, Random Forest (RF), LSTM, and a hybrid LSTM + FinBERT—under the critical constraint of data scarcity.</p><h3>Data Sources</h3><p><br></p><ul><li><b>Primary Stock Data</b>: The dataset features historical daily price data for the NSEI (NIFTY 50) index, obtained from Yahoo Finance. This includes 'Open', 'High', 'Low', 'Close', and 'Volume' for each trading day. The 'Close' price is the target variable for prediction.</li><li><b>News Sentiment Data</b>: Textual data from a local <code>news.csv</code> file, containing dated financial news headlines, was used to generate sentiment scores.</li></ul><h3>Dataset Composition and Feature Engineering</h3><p dir="ltr">To create a rich feature set for the models, the raw data was significantly augmented:</p><ul><li><b>Engineered Features</b>: The dataset includes technical indicators (RSI, MACD, Bollinger Bands), lagged price and return features, and rolling window statistics (e.g., moving averages).</li><li><b>Sentiment Scores</b>: The FinBERT model was used to process news headlines and generate a daily sentiment score, which was incorporated as an additional feature.</li><li><b>Final Structure</b>: The final dataset is intentionally small to simulate real-world data constraints, consisting of <b>60 rows and 29 columns</b>.</li></ul><h3>Experimental Setup and Usage</h3><p dir="ltr">A chronological 80%-20% split was applied to the data for model training and evaluation.</p><ul><li><b>Training/Validation Set</b>: The first 48 rows.</li><li><b>Test Set</b>: The final 12 rows, held out for performance evaluation.</li></ul><p dir="ltr">The data was formatted specifically for each model:</p><ul><li><b>ARIMA</b>: Utilized only the univariate 'Close' price time series from the training data.</li><li><b>Random Forest</b>: Employed a tabular format where all 28 engineered features (including indicators, lags, and sentiment) were used to predict the 'Close' price.</li><li><b>LSTM & Hybrid LSTM+FinBERT</b>: The data was transformed into 3D sequences with a lookback window of 16 days to predict the subsequent day's price. The hybrid model specifically included the sentiment score as an input feature within these sequences. For these neural network models, all numerical features were scaled.</li></ul><p dir="ltr">This dataset was instrumental in demonstrating that for financial forecasting tasks with limited data, the ensemble-based Random Forest model is a more effective and reliable choice due to its robustness against overfitting.</p>

History

Usage metrics

    Licence

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC