Research

An Adaptive Failure-Aware Scheduler for Hadoop

 
Mbarka Soualhia, Foutse Khomh and Sofiene Tahar

Contact: soualhia@ece.concordia.ca

Hadoop has become the framework of choice on many off-the-shelf clusters for processing large data in the cloud. Given the complexity and dynamic nature of the cloud, the Hadoop scheduler still generates poor scheduling decisions leading to tasks' failures due to unforeseen events. In this project, we present new approaches for modeling and verifying an adaptive failure-aware scheduler for Hadoop to early detect these failures and to reschedule tasks according to changes in the cloud. To early detect tasks' failures, we use machine learning algorithms trained on previously executed tasks. To improve Hadoop scheduling decisions, we use reinforcement learning techniques to select an appropriate scheduling action for a scheduled task. Furthermore, we propose an adaptive algorithm to dynamically adjust the communication between the JobTracker and TaskTrackers in order to quickly detect the failures of these nodes. Finally, we propose a new methodology to formally identify the impact of the scheduling decisions of Hadoop on the failures rates and to provide possible strategies to avoid their occurrence using model checking techniques. To show the benefits of our approaches, we have built ATLAS: an AdapTive Failure-Aware Scheduler for Hadoop. ATLAS outperforms existing Hadoop schedulers (FIFO, Fair, and Capacity) in terms of failures rates, execution times and the amount of used resources in an Hadoop cluster.







Publications:



Journal Papers

[1] M. Soualhia, F. Khomh, and S. Tahar: A Dynamic and Failure-aware Task Scheduling Framework for Hadoop, In: IEEE Transactions on Cloud Computing. Accepted, pp. 1-14, January 2018.

[2] M. Soualhia, F. Khomh, and S. Tahar: Task Scheduling in Big Data Platforms: A Systematic Literature Review, In: Journal of Systems and Software, vol. 134, pp. 170-189, Elsevier, 2017.

Conference Papers

[1] M. Soualhia, F. Khomh, and S. Tahar: ATLAS: An Adaptive Failure-Aware Scheduler for Hadoop, In: IEEE International Performance Computing and Communications Conference (IPCCC’15), Nanjing, China, December 2015, pp. 1-8.

[2] M. Soualhia, F. Khomh, and S. Tahar: Predicting Scheduling Failures in the Cloud: A Case Study with Google Clusters and Hadoop on Amazon EMR, In: IEEE High Performance Computing and Communications (HPCC'15), New York, USA, August 2015, pp. 58-65.

Technical Reports

[1] M. Soualhia, F. Khomh, and S. Tahar: ATLAS: An Adaptive Failure-Aware Scheduler for Hadoop, Technical Report, Department of Electrical and Computer Engineering, Concordia University, November 2015. [24 Pages].

[2] M. Soualhia, F. Khomh, and S. Tahar: Predicting Scheduling Failures in the Cloud, Technical Report, Department of Electrical and Computer Engineering, Concordia University, July 2015. [26 Pages].

 
 

Concordia University