 # What is K-means clustering?

K means is an iterative refinement algorithm that attempts to put each data point into a group or cluster. The algorithm starts with initial estimates for the K centroids (centers of the mentioned groups) and continues moving the centroids around the data points until it has minimized the total distance between the data points and their nearest centroid. The user will generally specify K which is the number of centroids (groups). The algorithm can be thought of in two repetitive steps:

1. Data assignment
• Each centroid defines one of the clusters. In this step, each data point is assigned to one of the centroids or clusters. Assignment is typically done based on Euclidean distance.
2. Centroid Update
• Centroids are then recomputed or moved. This is done by taking the mean of all the data points assigned to that centroid’s cluster.

This process is repeated, hence the iterative comment before, until the sum of the distance is minimized (or some maximum number of iterations is reached).

## In Laymen’s Terms

We randomly move around the user specified number of centroids each time assigning each data point to the closest centroid. Once we’ve done that we calculate the mean distance of all points in each centroid. Then once we can no longer reduce the minimum distance from data points to their respective centroid we have found our centroids/clusters.

## Why would this be important in trading and investing?

What if certain trading activity was characteristic of high or low volatility and we could group the day’s activity into a ‘good’ or ‘bad’ group? What if we also knew that if a day qualified as part of a particular cluster then it was generally telling of the next day’s trading activity?

## The data

Let’s analyze SP500 eMini futures data from 1997 to present day 2018. The SP500 is an index of ~500 stocks best said to represent the US economy. The futures contract is a derivative contract that allows exposure to this index. First, let’s read in our data file and generate two unique features that I believe will help us ‘cluster’ our data. I want to calculate both today’s volume and today’s trading range (highest traded price minus lowest traded price) and compare each to their respective rolling 20 day average. I also calculate the next day’s return (opening price to opening price).

```import matplotlib.pyplot as plt
from datetime import datetime
import pandas as pd
import numpy as np
import talib
from sklearn.cluster import KMeans
```
```# ------------------------------
#  Get Data and Features
#
def get_data(sym):
sym,delimiter=',',index_col='Date',parse_dates=True)
df['Vol'] = df['Vol'] / df['Vol'].rolling(20).mean()
df['Rng'] = (df['High'] - df['Low']) / talib.ATR(df.High.values,df.Low.values,df.Close.values,20)
df['Ret'] = df.Open.shift(-2) - df.Open.shift(-1)
df['Tar'] = 0
df.dropna(inplace=True)
return df```

## Training

Let’s then split our data into training and testing sets. This allows us to fit or build our K Means model on a portion of our data and then test its generalizability on another portion. There are more sophisticated methods of doing this but a simple train/test split will suffice for this example. I have arbitrarily elected to withhold data after January 1, 2015 for testing. I then arbitrarily selected three as “K” in our K means algorithm. My hopes are to find periods of low, medium, and high volatility with the ideal situation presenting itself as ‘goldilocks’ amount of volatility (not too little and not too much) that would be most conducive to short-term trading. That is, just enough volatility to make money and not too much volatility that it is too unpredictable. The K Means algorithm returns the ‘clusters’ for each data point in my data frame to the variable named y_kmeans.

```# ------------------------------------
#   Split into Test and Train
#
df = get_data("ES")
df_train = df[df.index <= datetime(2015,1,1)]
df_test  = df[df.index > datetime(2015,1,1)]```
```# ----------------------------
#  K Means - Training
#
X = df_train[['Vol','Rng']]
kmeans = KMeans(n_clusters=3).fit(X)
y_kmeans = kmeans.predict(X)
df_train['Tar'] = y_kmeans```

I then plot the clusters in which we can see there are three distinct clusters. We can also imagine a best of fit line showing a clear relationship between higher volume and higher volatility (range). The x-axis being our ‘Vol’ column and our y-axis being our ‘Rng’ column.

```# --------------------------
#  Plot Training
#
centers = kmeans.cluster_centers_
plt.scatter(df_train['Vol'],df_train['Rng'],c=y_kmeans)
plt.scatter(centers[:,0],centers[:,1],c='red',s=100,marker='x')
plt.show() ```

## Testing

Now that we have fit our K means algorithm we can run it on our out of sample or testing data. The code to do this (and plot the testing clusters) is almost identical except we have used our testing dataframe.

```# -------------------------------
#  K Means - Testing
#
x = df_test[['Vol','Rng']]
y_kmeans = kmeans.predict(x)
df_test['Tar'] = y_kmeans```
```# --------------------------------
#  Plot Testing
#
plt.scatter(df_test['Vol'],df_test['Rng'],c=y_kmeans)
plt.scatter(centers[:,0],centers[:,1],c='red',s=100,marker='x')
plt.show() ```

We can see the three clusters and centroids appear relatively stable in our test data set! This is great and gives confidence our algorithm will generalize well onto new data – live trading data, in our case!

## Training and Testing Comparison

It is also nice to visualize the train/test splits to make sure everything is ‘stable’. Here is a snippet of code to produce a text readout to display the total points earned if we were to trade the next day based on the current day’s cluster. Note: in futures trading points refer to the change in contract value. A move from 2005 to 2010 would be 5 points. In this specific contract that would translate to \$250 per contract (\$50 multiplier for SP500 eMini futures).

```# ------------------------------------------
#  Compare Training and Testing
#
print("Total Points Earned by Cluster Prediction")

print("Cluster 1 Train: %.2f\tCluster 1 Test: %2.f" % (df_train['Ret'].loc[df_train['Tar'] == 0].sum(),df_test['Ret'].loc[df_test['Tar'] == 0].sum()))

print("Cluster 2 Train: %.2f\tCluster 2 Test: %.2f" % (df_train['Ret'].loc[df_train['Tar'] == 1].sum(),df_test['Ret'].loc[df_test['Tar'] == 1].sum()))

print("Cluster 3 Train: %.2f\tCluster 3 Test: %.2f" % (df_train['Ret'].loc[df_train['Tar'] == 2].sum(),df_test['Ret'].loc[df_test['Tar'] == 2].sum()))```

## Equity Curves

The quickest way to see if something is useful is to get a brief view of the equity curve. This is the cumulative sum of all the returns of a trading strategy had you followed the rule(s) historically speaking. For example, if we would have only traded the day after a day that qualified as our goldilocks volatility environment then we would have outpaced the other two clusters! Furthermore, trading in days following a classification into the low or high volatility cluster would have resulted in negative returns and choppy volatility for the trading account since 2015 to 2018!

```# ------------------------------
#  Equity Curves
#
plt.plot(np.cumsum(df_test['Ret'].loc[df_test['Tar'] == 0]),label='Cluster Low')
plt.plot(np.cumsum(df_test['Ret'].loc[df_test['Tar'] == 1]),label='Cluster High')
plt.plot(np.cumsum(df_test['Ret'].loc[df_test['Tar'] == 2]),label='Cluster Med')
plt.legend()
plt.show() ```

## Going a step further!

K Means, Python, and other Machine Learning with Build Alpha software.

Ok, this K means filter is simple, worked out of sample on our testing data, but is almost too simple. For example, the only thing we do is check the cluster assignment at the end of the day (market’s close) and if it is the middle volatility cluster then we buy the next day’s open and hold for one day. We reevaluate at the close of each day.

What if there was a simple way to test this simple signal in combination with thousands of other signals like price action, volume, volatility, intermarket signals, signals from additional timeframes, technical indicators like Relative Strength Index? What if there was a simple way to add stop loss orders or profit taking orders?

Well in Build Alpha we can do all this. All we need to do is create a custom python signal and turn our ‘medium cluster assignment’ into a python list of binary signals.

First, we create a custom indicator through the file menu. Then we select the Type ‘Python’ and copy and paste the code from this blog (removing the plot and print statements). We then return a list named Signal that contains a 1 if the data point is in our middle volatility cluster or a 0 if it is either low or high.  Now from a point and click interface we can combine our signal with thousands of other pre-built or custom built signals, add stop losses, profit targets, adjust the security to test our strategy (gold, oil, bonds, stocks, ETFs, FX pairs, crypto currencies, etc.). We can even stress test our strategy with a suite of robustness and validation methods.

Here is a simple example of using our signal along with some random other signals as well as adding a stop. You can see we far and away have beat our benchmark SP500 index (red).  This is an example with 2 extra rules plus our K Means signal and I have also added a simple stop loss. Please see trading risk disclosures at Buildalpha.com/disclaimers

It has never been simpler to conduct [python] research for trading and investing. I hope this example shows how endless the possibilities are for those with even simple python skills. If you have questions about futures contracts, trading/investing, the Build Alpha software, or machine learning please feel free to email me at david@buildalpha.com. Thanks for reading.

Risk Warning: The FXCM Group does not guarantee accuracy and will not accept liability for any loss or damage which arise directly or indirectly from use of or reliance on information contained within the webinars. The FXCM Group may provide general commentary which is not intended as investment advice and must not be construed as such. FX/CFD trading carries a risk of losses in excess of your deposited funds and may not be suitable for all investors. Please ensure that you fully understand the risks involved.