Predicting the Price of Bitcoin using Machine Learning in R

How accurately can we predict cryptocurrency prices using time series analysis

9 min readJul 28, 2021

Introduction

With over 5,000 different cryptocurrencies currently in circulation with a combined value of over $2 trillion, chances are the thought of investing in one of these crypto-assets has at the least crossed your mind. And as the most prominent cryptocurrency around, making up around 70% of the market, Bitcoin has become a very popular investment with both individuals and institutional investors alike. However, the extreme volatility of Bitcoin cannot be understated, and with the cryptocurrency’s annualized 30-day volatility recently rising to a 14-month high, strong nerves are needed by those who chose to trade it.

With such a variety of factors influencing stock price, from social and economic events to the irrational behaviour of investors, even the most advanced machine learning algorithms haven’t been able to overcome the complexities of the stock market. But despite this, Artificial Intelligence is playing an increasingly significant role in trading and investing, with computer-aided high frequency trading accounting for around 70% of total trade volume, as stated in the Wired article linked below. I therefore thought it would be an interesting challenge to attempt to predict the price of Bitcoin by utilizing R’s extensive machine learning toolkit.

Algorithms Take Control of Wall Street

Last spring, Dow Jones launched a new service called Lexicon, which sends real-time financial news to professional…

www.wired.cofs

Project Goals

The aim of the project was to try to accurately predict the closing price of Bitcoin over a 10-day period using historical data. This would require investigating existing forecasting methods before implementing the most appropriate model for the job. Rather than actually implementing and comparing all the possible models one-by-one, I decided to spend more time on the research side of things, exploring what has worked well for others undergoing similar time-series forecasting projects and then to focus on fine-tuning that one model.

Libraries Used

Here are the libraries I imported to be used in the notebook:

library("anytime")
library("xts")
library("ggfortify")
library("forecast")
library("quantmod")

anytime functions as a general-purpose converter which returns a Date object regardless of the input. It relies on the Boost date_time library for the efficient conversion.
xts (Extensible Time Series) offers a flexible time series class that offers a variety of methods to manipulate time series data.
ggfortify is an extension to ggplot2 which allows you to plot some popular R packages. In our case, it is used to plot time series objects.
forecast takes in a time series model and produces a forecast appropriately.
quantmod is designed to assist with the development, testing and deployment of statistical models. In our case, it is brought in for its Augmented Dickey-Fuller Test method.

Exploring and Cleaning the Data

Before getting stuck into the modelling, I spent some time exploring the data and having a look at the quality of it. I used the Bitcoin Historical Dataset found on Kaggle which dates back to January 2012:

Bitcoin Historical Data

Bitcoin data at 1-min intervals from select exchanges, Jan 2012 to March 2021

www.kaggle.com

# Importing the data
train <- read.csv("data.csv", header = TRUE)
head(train)
summary(train1)
train <- na.omit(train)

As shown by Figure 1, the timestamp is displayed in Unix time and is no use to us in its current form. For each row, we are given the opening, highest, lowest and closing price of Bitcoin. Volume_.BTC refers to the number of coins that have exchanged hands over 24 hours, i.e. how much value of Bitcoin has been bought and sold over the course of a day and Volume_.Currency is the corresponding trading volume in USD. We are also given the Volume Weighted Average Price, a benchmark used by traders that gives the average price an asset has traded at throughout the day.

As indicated to by the top 6 rows, and confirmed by the summary, there are a huge number of rows with missing values. Given that our dataset takes market data at 1-minute intervals, this missing information is not of too much concern, and can be omitted whilst still leaving plenty of data to work with.

# Converting data for analysis
train$time <- as.POSIXct(train$Timestamp, origin = "1970-01-01", tz = "GMT")
train$time <- as.Date(train$time)
train$Weighted_Price <- as.numeric(train$Weighted_Price)
train$Volume_.BTC. <- as.numeric(train$Volume_.BTC.)
train$Volume_.Currency. <- as.numeric(train$Volume_.Currency.)
train$Timestamp = NULL
train <- train[,c(8,1,2,3,4,5,6,7)]# Data for the price comparison at the end
testdata <- train[,5]

To convert the timestamp into a useful format, I first used the as.POSIXct method which converts a date-time string into a POSIXct class. This stores both a date and a time with our associated time-zone (GMT), with the number of seconds starting at the standard date of 1 January 1970; storing data this way optimizes use in data frames and speeds up computation and conversion to other formats. Now that I had the time and date in a POSIXct class, I could easily convert them to the Date class. I then converted the numeric fields into the correct data type, dropped the redundant timestamp field and then rearranged the order of the columns.

cp <- ggplot(train2, aes(time, Close)) + geom_line() + scale_x_date(date_breaks = "years", date_labels = "%Y" ,limits = as.Date(c("2016-01-01","2021-03-01"))) + ylab("Closing Price ($)") + xlab("Year") + ylim(0,55000)# Plotting the Closing Price
cp + theme_bw() + labs(title="Bitcoin Closing Price") + geom_line(size = 1, colour = "red")

Following that I built a line chart using the cleansed data to have a look at what the overall trend for the price of Bitcoin has looked like over the years:

Figure 3 : Graph showing Bitcoin Closing Price over time

Building the Forecasting Model

ARIMA

After researching potential models to use, it became clear that one of the best-performing forecasting model choices for time series analysis was ARIMA (Autoregressive Integrated Moving Average). “Autoregressive” means that the pattern of growth / decline in the data is accounted for, “integrated” means the rate of change of this growth / decline is regarded and “moving average” resolves the noise between consecutive time points.

ARIMA models are typically expressed as ARIMA (p,d,q), with the terms p, d and q defined as follows:

p is the number of lag observations included in the model and is the autoregressive part
d is the number of times that the raw observations are differenced to produce a stationary signal
q is the size of the moving average window and is the number of lagged forecast error terms in the prediction equation

With identifying any patterns in the data playing an important role in the implementation of the ARIMA model, I needed to decompose the data to identify any such characteristics:

train_xts <- xts(train2[, -1], order.by =as.POSIXct(train2$time))
train_ts <- ts(train_xts[,4], frequency = 700,start = c(2016,4,27))# checking for trends and seasonality
dects2 <- decompose(train_ts) #Obtaining the trends and seasonality
ap <- autoplot(dects2)
ap + theme_bw()

I first needed to create a time-series object for the closing price. Time series data is simply a collection of observations obtained through repeated measurements over time. Because data points in time series are collected at adjacent time periods there is potential for correlation between observations; this is one of the features that distinguishes time series data from cross-sectional data, which observes subjects at a single point in time. Here, I utilized the xts library that I mentioned at the start, as well as R’s own time series function.

Decomposing this object can be easily accomplished thanks to R’s decompose function which breaks down a time series into three components. First, the function determines the trend component using a moving average and removes it from the time series. Next, the seasonal figure is computed by averaging over all periods and then being centred. Finally, the random error component is determined by removing the trend and seasonal figures from the original time series.

Figure 4 : Decomposition of the Closing Price Time Series

We can see from Figure 4 that the data has both trend and seasonality, meaning the data is not stationary. A time series is said to be stationary if its properties are independent of the time at which it is observed; most time series models require this to be the case.

To confirm this non-stationarity I used the Dickey-Fuller Hypothesis test:

Null Hypothesis : The series is non-stationary

Alternative Hypothesis : The series is stationary

#Dickey-Fuller testing
closingFigures <- xts(train[,5], order.by =as.POSIXct(train2$time))
adf.test(closingFigures, alternative='stationary')

With the p-value being greater than 0.05, we fail to reject the null hypothesis and the non-stationarity of the series is confirmed. Therefore, the next step was make the series stationary by using a method called differencing in which each previous observation is subtracted from the current observation. This is achieved using the diff() function:

closing_diff <- diff(train[,5], differences= 2)
closing_diff <- closing_diff[!is.na(closing_diff)]

After testing the first and second orders of difference (and finding that the data becomes stationary at differences = 2), I had obtained the ARIMA parameter: d = 2.

ACF and PACF Plots

Now that I had an ARIMA(p,1,q) model, I needed to find the AR term (p) and the MA term (q). This required looking into an ACF plot as well as a PACF plot. The ACF (autocorrelation function) computes and plots the correlation between observations of a time series separated by k time units and the PACF (partial autocorrelation function) measures the strength of the relationship with other terms being accounted for, in this case other terms being the intervening lags present in the model.

acf(closing_diff)
pacf(closing_diff)

The patterns of these plots can be used to determine the p and q values. Since the ACF cuts off after 1 lag and the PACF shows a gradual decay, an AR(1) would be an appropriate choice for the series. The drop in the PACF plot after 9 lags indicated an MA(9) would also be a good candidate for the model, and so with this, I had arrived at a model of ARIMA(1,2,9).

With the parameters found, all that was left to be done was to plug in these figures and plot the results. Figure 7 shows the outcome of the ARIMA function and Figure 8 compares the prediction against the actual closing prices.

model <- arima(closingFigures, order = c(4,2,11))
model_frame <- as.data.frame(forecast(gege, h = 10))
model_frame.index.name = 'index'
comparison <- cbind(testdata, model_frame[,1])
plot(forecast(gege))
ggplot() + geom_line(data = model_frame, aes(Date, comparison[,2]), color = "a") + geom_line(data = comparison, aes(Date, comparison[,3]), color = "b") + labs(title="Forecast vs Actual Price")+ ylab("Closing Price ($)") + xlab("Date") + scale_x_date(date_labels = '%d %b %y'

Figure 7 : Figure Results of ARIMA model

Figure 8 : Comparison of forecasted closing price against actual price

Conclusion

As we can see in Figure 8, though the model returned results with a fair level of accuracy, it is far from perfect. It correctly predicted a dip at around February 10th, however overestimated the degree of this fall and as a result was around $2000 below the actual price for the next couple of days. The steep rise in both prices means that the forecasted price finished around $2500 dollars below the actual price.

Having known the difficulties associated with predicting the stock market, and in particular Bitcoin, I must admit that I was pleasantly surprised with the results; a dip followed by a sharp increase was the correct prediction. All things considered however, the fact that the most advanced machine learning models are unable to consistently predict share prices means that it would be presumptuous to overlook the idea of luck playing a part in the performance of my model.