LinkedIn has open sourced the new Kube2Hadoop tool–a scalable and secure integration with HDFS Kerberos. The solution enables AI modelers at LinkedIn to use HDFS data in Kubernetes pods with access control through a user account or a headless account.
Since the introduction of Hadoop to the open source community, HDFS has been a widely-adopted distributed file system in the industry for its scalability and robustness. With the growing popularity in running model training on Kubernetes, it is natural for many people to leverage the massive amount of data that already exists in HDFS.
LinkedIn believes that Kube2Hadoop will benefit both the Kubernetes and Hadoop communities.
Kube2Hadoop is said to be a superior solution to the Kubernetes Secret approach due to its cleaner access control to HDFS, its ability to automatically renew tokens, and its ease of managing the token’s life cycle.
Kube2Hadoop comprises three parts:
- Hadoop Token Service, for fetching delegation tokens, deployed as a Kubernetes Deployment;
- Kube2Hadoop Init Container in each worker pod as a client for sending requests to fetch a delegation token from Hadoop Token Service;
- IDDecorator (see further below) for writing an authenticated user-ID deployed as a Kubernetes Admission Controller.
LinkedIn has made the source code available in its Github repository.