At the Data & Analytics Team, we are doing regular investigations to improve efficiency of in-house built RETL(Real Time Extract Transform Load). Any improvements may have significant impact on the direction of time, space and cost complexity. This article provides an overview of the investigation we have done to improve the performance of job schedulers at Apache Storm which is the core underlying layer of RETL.
Apache Storm provides a stable infrastructure to process large streams of transaction data on the fly. It indirectly helps us to serve customers valuable reports with low latency via the cubical data-warehouse where the transformed data pushed into. In a previous article, we have described how the ETL jobs are distributed to harness data-locality inside Storm.
Round-Robin Task Scheduling
In a storm cluster, the master node (Nimbus daemon) is responsible for scheduling tasks among workers/executors. Currently, the Storm platform uses pseudo-random round robin task scheduling and task placement on physical machines. This means Storm puts any remaining tasks randomly among available workers. For example, if we have n number of tasks to be distributed among a number of workers, the probability that a task schedules into any given worker is 1/m by considering the placements are uniformly at random and independent of each other. This default scheduling algorithm is simplistic and not optimal in terms of throughput performance and resource utilization. Default Storm does not consider resource availability in the underlying cluster or resource requirement in the of Storm topologies when scheduling. Also the number of odds that some particular worker queue gets bogged down when you assign tasks randomly is high.
It might be a problem not to consider the resource demand and availability when scheduling tasks among Storm workers. The resource on machines in the cluster can be easily over-utilized or underutilized which can cause problems ranging from catastrophic failure to execution inefficiency. For example, a Storm cluster can suffer a potentially unrecoverable failure if certain executors attempt to use more memory than is available. Over-utilization of resource other than memory can also cause the execution of applications to grind to a halt. Without taking the bandwidth capabilities to account, a subset of Storm topology might become too busy to respond. That small portion of slow nodes is enough to degrade the throughput of the entire cluster. Also under-utilization decreases resource utilization and can cause unnecessary expenditures in operation costs of a cluster. Thus, to maximize performance and resource utilization, an intelligent scheduler must take into account resource availability in the cluster as well as resource demand/requirement of a storm application in order to calculate an efficient scheduling.
Resource Aware Task Scheduling
The problem of optimal resource aware task scheduling can dig until it touches the theory of operational complexity. Many researches have reduced this problem into a variation of Knapsack problem which is inherently NP-Hard proving the in-feasibility of solving this problem in polynomial time. As Storm needs to have a task scheduler on the fly in almost real-time, the approximated solution to this problem can be looked by harnessing different type of resources such as CPU usage, memory usage, and bandwidth usage on Storm workers. It is almost impossible to utilize all types of resources on scheduling tasks but to classify them as constraints which can be completed at higher probability. The recent study has come up new heuristics of task scheduling algorithm called R-Storm as shown in the figure below.
The backbone of this scheduling algorithm is to provide high priority to the tasks of components that communicate at close network proximity. Also based on the given priority on the constraints, Nimbus can schedule tasks into most optimal workers. As randomized scheduling was avoided, the resource waste is minimized at under-utilized workers. From their experimental results, it is concluded that R-Storm achieves 30-47% higher throughput and 69-350% better CPU utilization than default Storm for the micro-benchmarks topologies. For the Yahoo! Storm topologies, R-Storm outperforms default Storm by around 50% based on overall throughput. R-Storm implementation can be found here. We are about to perform a benchmark of this version at development servers against RETL. In the mean time, Storm JIRA is ready to plug this critical improvement in an up-coming version.