Percentage Improvement in Speed: 7784.80%
PySpark Output was 78.85 times more scalable than Pandas.
Here I have tried 20 models for time series forecasting on Bitcoin dataset from 17-7-2010 to 9-9-2024 and Best performance was given be Ensemble Model
Scalability was done by predicting on Top 50 Crypto creating a PandasUDF integrating it with Pyspark
- AR
- MA
- ARMA
- AIRMA
- SARIMA
- Naive Forecast
- AutoARIMA
- ExponentialSmoothing (Requires Smoothening)
- Random Forest (TF-DF)
- Gradient Boosted Trees (TF-DF)
- Prophet (Facebook Kats) (Requires Smoothening)
- Dense Model (Window = 7, Horizon = 1)
- Dense Model (Window = 30, Horizon = 1)
- Dense Model (Window = 30, Horizon = 7)
- Conv1D (Window = 7, Horizon = 1)
- LSTM (Window = 7, Horizon = 1)
- Dense (Multivariate Time series)
- N-BEATs Algorithm
- Ensemble
- Simple ANN Model (Future Predictions)
- Checking if series is stationary
Using
Augmented Dickey Fuller Test - Making the series stationary
Using
Differencing - Plotting
ACFandPACFplots According to the plots mostly both the plots were same and the first lag was negative second lag onwards many lags were below the threshold
- The test dataset was of 362 days (around 1 year) this is the main reason why the forecast of FB-Prophet, ARIMA, ARMA, AR, MA etc are not so good.
