Fundamentals of Machine Learning
How is ML different from statistics and econometrics and what are the fundamentals to know when one develops a machine learning (ML) model? These are the main questions I will explore in the article.
The main focus of classical statistics and econometrics is to estimate the parameters of a model under given probabilistic assumptions and understand how significant they are. The main focus of ML is accuracy of the predictions rather than the parameters. In most cases ML does not assume any particular probabilities and uses complex, non-probabilistic and non-parametric models that make it difficult to conduct statistical tests for the parameters of the model.
The goal of supervised ML techniques is to predict outputs as accurately as possible. For that reason understanding notions such as bias-variance trade off, under fitting and overfitting, and such techniques as cross-validation and bootstrapping are central for ML. They help improve out-of-sample performance of the model and therefore provide more accurate predictions.
“Notions such as bias-variance trade off, under fitting and overfitting, and such techniques as cross-validation and bootstrapping are central for ML.”
These notions and techniques naturally arise when we approach the supervised machine-learning problem: suppose we have some data sets of inputs and corresponding outputs and we would like to come up with some function in order to be able to match the inputs with the outputs as best as possible. At this stage we have many questions: what restrictions should we impose on the function? How do we choose the best function among all possibilities?
First, we need to think of the restrictions of the existing data: it is finite and noisy, meaning that the function should be able to accurately capture the relationship in the data inputs and at the same time be general enough to avoid explaining noise in the data. This is where overfitting and under fitting notions appear. The training data is data on which we train our model and the test data is the data that we use to test the performance of the trained model. The main task of an ML model is to demonstrate good performance on the test sample. If we have such a model that forecasts well only on the data that we trained our model on, but does not work on the new data set, then we have a situation called overfitting. If we are not able to forecast even on the training sample – this is under fitting.
These notions relate closely with model complexity. The more complex the model is and the more parameters it has, the better we are able to explain the data set on which we train our model. However, a high degree of complexity leads to poor performance on the test sample because the model starts fitting noise in the training sample. On other hand, we can underestimate the complexity of the model in the case the model is too simple such that it is not able to capture basic relationships in the training data. Therefore there should be some optimal level of the complexity of the model.
“The more complex the model is and the more parameters it has, the better we are able to explain the data set on which we train our model.”
Bias-variance trade-off is another way to look at the problems of under fitting and overfitting. In an overly complex model we have large sensitivity of estimators (beta estimation in linear model or estimation of weights in neural nets for example) to data inputs in training data. This situation is called large variance, meaning that small changes in data inputs can lead to dramatic changes in the estimator. In an overly simple model we have low sensitivity to the data inputs in training data however we observe large bias of the estimators, meaning that on average the estimator is not equal to the “true” parameters of the model. The following graph demonstrates the notions discussed above and represents a fundamental concept for developing a supervised ML model:
Now that we understand the problem of finding optimal complexity of the model we can try to address it. This is where cross-validation and bootstrapping come into play. The idea of cross-validation is the artificial splitting of data sets into two or even three types of sets: training, validation and test sets. The training set is used for training the model, the validation set is used in order to adjust some parameters of the model (hyper-parameters) or find optimal complexity of the model and the test set is used for final testing. In bootstrapping we attempt to artificially increase the amount of data by drawing additional data from the empirical distribution that is tuned to the initial data sample. By artificially increasing the sample we are able to retrain the model on a larger sample which decreases the bias in estimation due to particularities of the initial set of data. By using these two techniques we can find optimal complexity of the model.
The final concept I consider to be fundamental in the development of an ML model is a comparison with some benchmark model that is more transparent and understood. It is often the case in financial applications that a simple model based on business judgement and basic statistics can give positive results, whereas a complex ML model may give a slight improvement but at the same time bring non-transparency into play.
Ivan Zhdankin is a contributor to QuantNews and a member of The Thalesians.
Risk Warning: The FXCM Group does not guarantee accuracy and will not accept liability for any loss or damage which arise directly or indirectly from use of or reliance on information contained within the webinars. The FXCM Group may provide general commentary which is not intended as investment advice and must not be construed as such. FX/CFD trading carries a risk of losses in excess of your deposited funds and may not be suitable for all investors. Please ensure that you fully understand the risks involved.