This study involved the investigation of various machine learning methods, including four classification tree-based ML models, namely the Adaptive Boosting tree, Random Forest, Gradient Boost Decision Tree, Extreme Gradient Boosting tree, and three non-tree-based ML models, namely Support Vector Machines, Multi-layer Perceptron and k-Nearest Neighbors for predicting the level of severity of large truck crashes on Wyoming road networks. The accuracy of these seven methods was then compared. The Final ROC AUC score for the optimized random forest model is 95.296 %. The next highest performing model was the k-NN with 92.780 %, M.L.P. with 87.817 %, XGBoost with 86.542 %, Gradboost with 74.824 %, SVM with 72.648 % and AdaBoost with 67.232 %. Based on the analysis, the top 10 predictors of severity were obtained from the feature importance plot. These may be classified into whether safety equipment was used, whether airbags were deployed, the gender of the driver and whether alcohol was involved.
Full text article
Abou Elassad, Z. E., Mousannif, H., & Al Moatassime, H. (2020). A proactive decision support system for predicting traffic crash events: A critical analysis of imbalanced class distribution. Knowledge-Based Systems, 205, 106314.
Batista, G. E., Prati, R. C., & Monard, M. C. (2004). A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD explorations newsletter, 6(1), 20-29.
Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5-32.
Burez, J., & Van den Poel, D. (2008). Separating financial from commercial customer churn: A modeling step towards resolving the conflict between the sales and credit department. Expert Systems with Applications, 35(1-2), 497-514.
Chang, L. Y., & Chien, J. T. (2013). Analysis of driver injury severity in truck-involved accidents using a non-parametric classification tree model. Safety Science, 51(1), 17-22.
Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: synthetic minority over-sampling technique. Journal of artificial intelligence research, 16, 321-357.
Chen, T., & Guestrin, C. (2016, August). Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining (pp. 785-794).
Cosslett, S. R. (1981). Maximum likelihood estimator for choice-based samples. Econometrica: Journal of the Econometric Society, 1289-1316.
Cover, T., & Hart, P. (1967). Nearest neighbor pattern classification. IEEE transactions on information theory, 13(1), 21-27. https://doi.org/10.1109/TIT.1967.1053964.
Fernández, A., García, S., del Jesus, M. J., & Herrera, F. (2008). A study of the behaviour of linguistic fuzzy rule based classification systems in the framework of imbalanced data-sets. Fuzzy Sets and Systems, 159(18), 2378-2398.
Fernández, A., García, S., Galar, M., Prati, R. C., Krawczyk, B., & Herrera, F. (2018). Learning from imbalanced data sets (Vol. 10, pp. 978-3). Berlin: Springer.
Fiorentini, N., & Losa, M. (2020). Handling imbalanced data in road crash severity prediction by machine learning algorithms. Infrastructures, 5(7), 61.
FMCSA (Federal Motor Carrier Safety Administration). Federal Regulatory Guide. 917–920.
García, V., Mollineda, R. A., & Sánchez, J. S. (2008). On the k-NN performance in a challenging scenario of imbalance and overlapping. Pattern Analysis and Applications, 11(3), 269-280.
Gu, Q., Cai, Z., Zhu, L., & Huang, B. (2008, December). Data mining on imbalanced data sets. In 2008 International Conference on advanced computer theory and engineering (pp. 1020-1024). IEEE.
Guo, P. T., Li, M. F., Luo, W., Tang, Q. F., Liu, Z. W., & Lin, Z. M. (2015). Digital mapping of soil organic matter for rubber plantation at regional scale: An application of random forest plus residuals kriging approach. Geoderma, 237, 49-59.
Iranitalab, A., & Khattak, A. (2017). Comparison of four statistical and machine learning methods for crash severity prediction. Accident Analysis & Prevention, 108, 27-36.
Izmailov, R., Vapnik, V., & Vashist, A. (2013, August). Multidimensional splines with infinite number of knots as SVM kernels. In The 2013 International Joint Conference on Neural Networks (IJCNN) (pp. 1-7). IEEE.
Japkowicz, N. (2000, June). The class imbalance problem: Significance and strategies. In Proc. of the Int’l Conf. on Artificial Intelligence (Vol. 56, pp. 111-117).
Kotsiantis, S. B., Zaharakis, I. D., & Pintelas, P. E. (2006). Machine learning: a review of classification and combining techniques. Artificial Intelligence Review, 26(3), 159-190.
Li, J., Guo, J., Wijnands, J. S., Yu, R., Xu, C., & Stevenson, M. (2022). Assessing injury severity of secondary incidents using support vector machines. Journal of Transportation Safety & Security, 14(2), 197-216. https://doi.org/10.1080/19439962.2020.1754983.
Li, J., Liu, J., Liu, P., & Qi, Y. (2020). Analysis of factors contributing to the severity of large truck crashes. Entropy, 22(11), 1191.
Li, P., Abdel-Aty, M., & Yuan, J. (2020). Real-time crash risk prediction on arterials based on LSTM-CNN. Accident Analysis & Prevention, 135, 105371. https://doi.org/10.1016/j.aap.2019.105371.
Lill, R. A. (1977). A Review of BMCS Analysis and Summary of Accident Investigations, 1973-1976 With Respect to Downgrade Runaway Type Accidents. American Truckers Association..
Liu, X. Y., Wu, J., & Zhou, Z. H. (2008). Exploratory undersampling for class-imbalance learning. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 39(2), 539-550.
Ma, X., Dai, Z., He, Z., Ma, J., Wang, Y., & Wang, Y. (2017). Learning traffic as images: a deep convolutional neural network for large-scale transportation network speed prediction. Sensors, 17(4), 818.
Mduma, N., Kalegele, K., & Machuve, D. (2019). A survey of machine learning approaches and techniques for student dropout prediction. Data Science Journal, 18(1).
Moomen, M., Rezapour, M., Raja, M. N., & Ksaibati, K. (2020). Predicting injury severity and crash frequency: Insights into the impacts of geometric variables on downgrade crashes in Wyoming. Journal of Traffic and Transportation Engineering (English edition), 7(3), 375-383.
Mujalli, R. O., López, G., & Garach, L. (2016). Bayes classifiers for imbalanced traffic accidents datasets. Accident Analysis & Prevention, 88, 37-51.
Rivera, G., Florencia, R., García, V., Ruiz, A., & Sánchez-Solís, J. P. (2020). News classification for identifying traffic incident points in a Spanish-speaking country: A real-world case study of class imbalance learning. Applied Sciences, 10(18), 6253.
Schlögl, M., Stütz, R., Laaha, G., & Melcher, M. (2019). A comparison of statistical learning methods for deriving determining factors of accident occurrence from an imbalanced high resolution dataset. Accident Analysis & Prevention, 127, 134-149.
Seiffert, C., Khoshgoftaar, T. M., & Van Hulse, J. (2009). Hybrid sampling for imbalanced data. Integrated Computer-Aided Engineering, 16(3), 193-210.
Shi, Q., & Abdel-Aty, M. (2015). Big data applications in real-time traffic operation and safety monitoring and improvement on urban expressways. Transportation Research Part C: Emerging Technologies, 58, 380-394. https://doi.org/10.1016/j.trc.2015.02.022.
Su, X., Zhou, T., Yan, X., Fan, J., & Yang, S. (2008). Interaction trees with censored survival data. The International Journal of Biostatistics, 4(1). https://doi.org/10.2202/1557-4679.1071
Sun, Y., Wong, A. K., & Kamel, M. S. (2009). Classification of imbalanced data: A review. International Journal of Pattern Recognition and Artificial Intelligence, 23(04), 687-719.
Tang, J., Liang, J., Han, C., Li, Z., & Huang, H. (2019). Crash injury severity analysis using a two-layer Stacking framework. Accident Analysis & Prevention, 122, 226-238.
The Centers for Disease Control and Prevention. Retrieved from https://www.cdc.gov/
Weber, A., & Murray, D. C. (2014). Evaluating the impact of commercial motor vehicle enforcement disparities on carrier safety performance. American Transportation Research Institute.
Williams, D. P., Myers, V., & Silvious, M. S. (2009). Mine classification with imbalanced data. IEEE Geoscience and Remote Sensing Letters, 6(3), 528-532.
Wilson, J. (2004). Measuring personal travel and goods movement. Tr News, 234, 28.
Xu, B., Wang, N., Chen, T., & Li, M. (2015). Empirical evaluation of rectified activations in convolutional network. arXiv preprint arXiv:1505.00853.
Yu, R., & Abdel-Aty, M. (2013). Utilizing support vector machine in real-time crash risk evaluation. Accident Analysis & Prevention, 51, 252-259.
Yu, R., & Abdel-Aty, M. (2014). Using hierarchical Bayesian binary probit models to analyze crash injury severity on high speed facilities with real-time traffic data. Accident Analysis & Prevention, 62, 161-167.
Zhou, F., Yin, H., Zhan, L., Li, H., Fan, Y., & Jiang, L. (2018, June). A Novel Ensemble Strategy Combining Gradient Boosted Decision Trees and Factorization Machine Based Neural Network for Clicks Prediction. In 2018 International Conference on Big Data and Artificial Intelligence (BDAI) (pp. 29-33). IEEE.