A comparison of several machine learning techniques to classify music pieces according to the emotion they evoke.
The task
Our goal was to create a classifier for music pieces that, given objective frequency information automatically extracted, such as Mel Frequency Cepstral Coefficients (MFCCs), was able to label them according to the emotions that it would produce to a person listening to it.
The possible emotions (target variables) were: amazed/surprised, happy/pleased, relaxing/calm, quiet/still, sad/lonely and angry/aggresive. Each data sample can belong to more than one category, or none a all. This makes the problem quite challening because, were it to be treated as a simple classification task, it would mean that there would be \(2^6 = 64\) possible (and very unbalanced) classes, too many for the limited dataset we had (~400 training samples); however, if we considered each category as a separate binary classification, we would be ignoring the implicit correlation between some of the target variables, which in some case was \(>0.5\) (e.g. ‘relaxing’ and ‘aggresive’ rarely go together).
Our work
We tried the following approaches:
- 6 separate binomial Generalized Linear Models (GLMs), to classify each emotion independently. This is not ideal, as mentioned before, because it ignores the correlation between targets, but it is a simple approach that serves as a baseline.
- 6 binomial chained GLMs. Using the idea of Chained Models (CM) by Read et al., we improved the previous setup by using predictions from previous models as attributes for the newer ones.
- Linear Discriminant Analysis (LDA), to obtain new variables that separate the data better than the original ones. We then use 6 independent binomial GLMs on these new variables.
- Principal Component Analysis (PCA), with the same idea as in LDA, but this time obtaining more variables that don’t necessarily separate the different classes.
- k-Nearest Neighbours (kNN), a very simple classifier that does not need to be previously trained. We optimized the value of $k$ through 10-fold Cross-Validation on the training data.
Measure of error
We measure the error using two different metrics, Cross-Entropy and the Mean Square Error.
The Cross-Entropy is defined as:
$$E := -\sum_{n = 1}^{N} t_n \cdot ln(\hat{y}_n) + (1-t_n) \cdot ln(1-\hat{y}_n)$$
where \(t_{n,k} \in {0, 1}\) is the binary variable with value 1 if the sample \(x_n\) corresponds to the class \(k\) and 0 otherwise, and \(\hat{y}_{n,k} = y_k(x_n) \in [0, 1]\) is the probability that the model predicted for class \(k\) based on sample \(x_n\). In order to obtain the overall cross-entropy of our model \(E\), we average the cross-entropy for all variables \(E_k\):
$$E := \frac{1}{K} \sum_{k = 1}^K E_k$$
Note that this is not applicable to the kNN classifier, since it does not output a probability of the sample belonging to each class, but rather gives a hard decision. This is why we also use the MSE as an alternative metric. The Mean Square Error is defined as:
$$MSE = \frac{1}{N} \sum_{i = 1}^{N} t_n - \hat{y}n \left( \equiv \frac{1}{N} \sum{i = 1}^{N} (t_n - \hat{y}_n)^2 \right)$$
with the same parameters as before. Again, we need to expand the definition to account for the multiple target classes:
$$MSE = \frac{1}{K} \sum_{k = 1}^{K} MSE_k = \frac{1}{KN} \sum_{k = 1}^{K} \sum_{i = 1}^{N} t_{n,k} - \hat{y}_{n,k}$$
Results
LDA proved to be the best approach, followed closely by the independent binomial GLMs and PCA. The Chained Model did not improve on its baseline.
When using the testing partition of the dataset (1/3 of the data, unused during testing or Cross Validation), it obtained an MSE of 19.98% (0% being a perfect model). The cross-entropy was 95.55.