Preprocessing Using SMOTE and K-Means for Classification by Logistic Regression on Pima Indian Diabetes Dataset

Ahmad Taufiq Akbar; Rochmat Husaini; Hari Prapcoyo

doi:10.31315/telematika.v20i2.9676

Authors

Ahmad Taufiq Akbar Universitas Pembangunan Veteran Yogyakarta
Rochmat Husaini Department Informatics Engineering Faculty of Industrial Engineering University of Pembangunan Nasional Veteran
Hari Prapcoyo

DOI:

https://doi.org/10.31315/telematika.v20i2.9676

Keywords:

SMOTE, k-means, Logistic Regression

Abstract

Purpose: Our study aims to combine pre-processing methods to develop a training data model from the Indian diabetic Pima dataset so that it can improve the performance of machine learning in recognizing diabetes

Design/methodology/approach: This research was started through several stages such as collecting the Pima indian diabetes dataset, pre-processing including k-means clustering, oversampling using SMOTE, then undersampling the dataset whose cluster is a minority in each class. Furthermore, the dataset is classified using machine learning namely logistic regression through 10 cross validation

Findings/result: The results of this classification performance show that the accuracy reaches 99.5% and is higher than the method in previous studies.

Originality/value/state of the art:

The method in this study uses SMOTE to handle data imbalances and k-means clustering to remove outliers by removing labels that do not match the majority cluster in each class so that clean data is produced and validation using logistic regression is more accurate than previous studies.

Tujuan: Penelitian ini bertujuan untuk menerapkan metode pre-processing untuk membentuk model data latih dari dataset Pima Indian diabetes sehingga dapat meningkatkan performa mesin pembelajaran dalam mengenali diabetes.

Perancangan/metode/pendekatan: Riset ini dimulai melalui beberapa tahap yakni pengumpulan dataset Pima Indian diabetes, pre-processing meliputi clustering, oversampling menggunakan SMOTE, kemudian undersampling pada dataset pada klaster minoritas pada setiap kelas. Selanjutnya dataset diklasifikasikan menggunakan machine learning yakni metode regresi logistik melalui 10 cross validation

Hasil: Hasil dari performa klasifikasi ini menunjukkan akurasi mencapai 99,5% dan lebih tinggi daripada metode pada penelitian sebelumnya.

Keaslian/ state of the art: Metode dalam penelitian ini menggunakan SMOTE untuk menangani ketidakseimbangan data dan k-means klastering untuk membuang outlier dengan cara menghapus label yang tidak sesuai dengan klaster mayoritas pada setiap kelas sehingga dihasilkan data yang bersih dan pada validasi menggunakan logistic regression lebih akurat daripada penelitian sebelumnya.

Author Biographies

Ahmad Taufiq Akbar, Universitas Pembangunan Veteran Yogyakarta

Ahmad Taufiq Akbar, S.Si., M.Cs.
Department Informatics Engineering
Faculty of Industrial Engineering
University of Pembangunan Nasional Veteran

Rochmat Husaini, Department Informatics Engineering Faculty of Industrial Engineering University of Pembangunan Nasional Veteran

Department Informatics Engineering
Faculty of Industrial Engineering
University of Pembangunan Nasional Veteran

Hari Prapcoyo

Department Informatics Engineering
Faculty of Industrial Engineering
University of Pembangunan Nasional Veteran

References

Daftar Pustaka

M. Lestandy, A. Faruq, and A. Faruq, “Klasifikasi pendonor darah potensial menggunakan pendekatan algoritme pembelajaran mesin,” vol. 8, no. July, pp. 217–221, 2020, doi: 10.14710/jtsiskom.2020.13619.

. R. Kaur, “Predicting diabetes by adopting classification approach in data mining,” Int. J. Informatics Vis., vol. 3, no. 2–2, pp. 218–221, 2019, doi: 10.30630/joiv.3.2-2.229.

. S. N. Khan et al., “Comparative analysis for heart disease prediction,” Int. J. Informatics Vis., vol. 1, no. 4–2, pp. 227–231, 2017, doi: 10.30630/joiv.1.4-2.66.

. H. Hairani and M. Innuddin, “Kombinasi Metode Correlated Naive Bayes dan Metode Seleksi Fitur Wrapper untuk Klasifikasi Data Kesehatan,” J. Tek. Elektro, vol. 11, no. 2, pp. 50–55, 2019, doi: 10.15294/jte.v11i2.23693.

. M. S. Barale and D. T. Shirke, “Cascaded Modeling for PIMA Indian Diabetes Data,” Int. J. Comput. Appl., vol. 139, no. 11, pp. 1–4, 2016, doi: 10.5120/ijca2016909426.

. J. C. Ang, A. Mirzal, H. Haron, and H. N. A. Hamed, “Supervised, unsupervised, and semi-supervised feature selection: A review on gene selection,” IEEE/ACM Trans. Comput. Biol. Bioinforma., vol. 13, no. 5, pp. 971–989, 2016, doi: 10.1109/TCBB.2015.2478454.

. N. K. Suchetha, A. Nikhil, and P. Hrudya, “Comparing the wrapper feature selection evaluators on twitter sentiment classification,” ICCIDS 2019 - 2nd Int. Conf. Comput. Intell. Data Sci. Proc., pp. 1–6, 2019, doi: 10.1109/ICCIDS.2019.8862033.

. S. W. Purnami, A. Embong, J. M. Zain, and S. P. Rahayu, “A new smooth support vector machine and its applications in diabetes disease diagnosis,” J. Comput. Sci., vol. 5, no. 12, pp. 1003–1008, 2009, doi: 10.3844/jcssp.2009.1003.1008.

. R. Bhalla and A. Bagga, “Opinion mining framework using proposed rb-bayes model for text classification,” Int. J. Electr. Comput. Eng., vol. 9, no. 1, pp. 477–484, 2019, doi: 10.11591/ijece.v9i1.pp477-484.

. I. Permana, N. E. Rozanda, F. Syafria, and F. N. Salisah, “Optimization learning vector quantization using genetic algorithm for detection of diabetics,” Indones. J. Electr. Eng. Comput. Sci., vol. 12, no. 3, pp. 1111–1116, 2018, doi: 10.11591/ijeecs.v12.i3.pp1111-1116.

. S. A. D. Alalwan, “Diabetic analytics: Proposed conceptual data mining approaches in type 2 diabetes dataset,” Indones. J. Electr. Eng. Comput. Sci., vol. 14, no. 1, pp. 85–95, 2019, doi: 10.11591/ijeecs.v14.i1.pp88-95.

. H. Hairani, K. E. Saputro, and S. Fadli, “K-means-SMOTE for handling class imbalance in the classification of diabetes with C4.5, SVM, and naive Bayes,” J. Teknol. dan Sist. Komput., vol. 8, no. 2, pp. 89–93, 2020, doi: 10.14710/jtsiskom.8.2.2020.89-93.

. A. T. Akbar, R. Husaini, B. M. Akbar, and S. Saifullah, “A proposed method for handling an imbalance data in classification of blood type based on Myers-Briggs type indicator,” J. Teknol. dan Sist. Komput., vol. 8, no. 4, pp. 276–283, 2020, doi: 10.14710/jtsiskom.2020.13625.

. S. Sugriyono and M. U. Siregar, “Preprocessing kNN algorithm classification using K-means and distance matrix with students’ academic performance dataset,” J. Teknol. dan Sist. Komput., vol. 8, no. 4, pp. 311–316, 2020, doi: 10.14710/jtsiskom.2020.13874.

. N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, “Comparison of Balancing Techniques for Unbalanced Datasets,” Mach. Learn. Gr. Univ. Libr. Bruxelles Belgium, vol. 16, no. 1, pp. 321–357, 2002, doi: 10.1613/jair.953.

. T. E. Tallo and A. Musdholifah, “The Implementation of Genetic Algorithm in Smote (Synthetic Minority Oversampling Technique) for Handling Imbalanced Dataset Problem,” Proc. - 2018 4th Int. Conf. Sci. Technol. ICST 2018, vol. 1, pp. 1–4, 2018, doi: 10.1109/ICSTC.2018.8528591.

. N. Cahyana, S. Khomsah, and A. S. Aribowo, “Improving Imbalanced Dataset Classification Using Oversampling and Gradient Boosting,” Proceeding - 2019 5th Int. Conf. Sci. Inf. Technol. Embrac. Ind. 4.0 Towar. Innov. Cyber Phys. Syst. ICSITech 2019, pp. 217–222, 2019, doi: 10.1109/ICSITech46713.2019.8987499.

. X. Xu and E. Frank, “Logistic regression and boosting for labeled bags of instances,” Lect. Notes Comput. Sci. (including Subser. Lect. Notes Artif. Intell. Lect. Notes Bioinformatics), vol. 3056, pp. 272–281, 2004, doi: 10.1007/978-3-540-24775-3_35..