Implementation of Mel-Frequency Cepstral Coefficient As Feature Extraction Method On Speech Audio Data

Authors

  • Andre Julio Marbun Pembangunan Nasional "Veteran" Yogyakarta University
  • Heriyanto Pembangunan Nasional "Veteran" Yogyakarta University
  • Frans Richard Kodong Pembangunan Nasional "Veteran" Yogyakarta University

DOI:

https://doi.org/10.31315/telematika.v21i3.12339

Keywords:

Mel-Frequency Cepstral Coefficient, Support Vector Machine, Classification Emotion, Signal Processing

Abstract

Sounds cannot be directly processed by machines without a feature extraction process being carried out first. Currently, there are so many choices of feature extraction methods that can be used, so determining the right feature extraction method is not easy. One method of feature extraction on sound signals that is often used is Mel-Frequency Cepstral Coefficient (MFCC). MFCC has a working principle that resembles the human hearing system, which causes it to be widely used in various tasks related to recognition based on sound signals. This research will use the MFCC method to extract characteristics on voice signals and Support Vector Machine as a method of emotion classification on the RAVDESS dataset. MFCC consists of several stages, namely Pre-emphasize, Frame Blocking, Windowing, Fast Fourier Transform, Mel-Scaled Filterbank, Discrete Cosine Transform, and Cepstral Liftering. The type of test design that will be carried out in this research is parameter tuning. Parameter tuning is carried out with the aim of obtaining parameters that produce the best accuracy in the machine learning model. The parameters that will be tuned include the α value in the Pre-Emphasis process, frame length and overlap length in the Frame Blocking process, the number of mel filters in the Mel-Scaled Filterbank process, the number of cepstral coefficients in the Discrete Cosine Transform process and the C value in SVM. The best accuracy in males of 85.71% was obtained with a combination of filter parameter pre-emphasize of 0.95, frame length of 0.023 ms, overlap of adjacent frames of 40%, number of mel filters in the mel-scaled filterbank process of 24 mel, number of cepstral coefficient of 24 coefficient and the value of 'C' in SVM of 0.01. The best accuracy in women of 92.21% was obtained with a combination of filter parameters pre-emphasize of 0.95, frame length of 0.023 ms, overlap of adjacent frames of 40%, the number of mel filters in the melscaled filterbank process of 24 mel, and the number of cepstral coefficient of 13 coefficient and 'C' value in SVM of 0.01. From the two test results of tuning parameters between men and women, there are similar parameter values in all test parameters, except for the number of cepstral coefficients. The number of cepstral coefficient in men is 24 coefficient while the number of cepstral coefficient in women is 13 coefficient. Based on the research conducted, there are the following conclusions, the combination of MFCC and SVM methods can be used for emotion classification based on input data in the form of voice intonation with an accuracy of 85.71% in men and 92.21% in women. The difference in accuracy obtained between male and female models is due to the different data used. Male models are trained with male voice data and female models are trained with female voice data, this is done because men and women have different voice frequency ranges.

Downloads

Published

2024-10-31