Google has open sourced TensorFlow RunTime (TFRT), a new TensorFlow runtime that aims to provide a unified, extensible infrastructure layer with best-in-class performance across a wide variety of domain-specific hardware.
According to Google, TFRT is responsible for the efficient execution of kernels – low-level device-specific primitives – on targeted hardware. It has a key role to play in both eager and graph execution.
Besides a range of users, TFRT will be beneficial to:
- researchers looking to experiment with complex new models and add custom operations to TensorFlow;
- application developers looking for improved performance when serving models in production;
- and hardware makers looking to plug the hardware into TensorFlow, including edge and datacenter devices.
As part of a study for TensorFlow Dev Summit 2020, Google integrated TFRT with TensorFlow Serving and measured the latency of sending requests to the model and getting prediction results back.
“We picked a common MLPerf model, ResNet-50, and chose a batch size of 1 and a data precision of FP16 to focus our study on runtime related op dispatch overhead. In comparing the performance of GPU inference over TFRT to the current runtime, we saw an improvement of 28% in average inference time. These early results are strong validation for TFRT, and we expect it to provide a big boost to performance,” explained TFRT product manager Eric Johnson and TFRT tech lead Mingsheng Hong in a blog post.
TFRT is being integrated with TensorFlow, and will be enabled initially through an opt-in flag. This will provide the team with time to fix any bugs and fine-tune performance, the post added. Eventually, it is expected to be TensorFlow’s default runtime.