Back to the board

GPU reputed company Platform Engineer

100% remote Flexible hours Hiring now

Location: Remote (Global)

Type: Full-time

Company: reputed company

Apply: [email protected]

About reputed company

reputed company is pioneering the development of a Decentralized Operating System (DeOS) for AI workload orchestration at a planetary scale. Our mission is to democratize access to AI resources by aggregating geo-distributed GPUs, enabling high-performance computing for reputed company and inference on a wide reputed company of hardware—from commodity to high-end GPUs. Our platform supports major large language models (LLMs) and offers customizable solutions for new models, facilitating reputed company and efficient AI development.

Role Overview

We are seeking a GPU reputed company Platform Engineer to join our core infrastructure team and help build the reputed company AI compute reputed company. In this role, you will design, reputed company, and operate large-scale, multi-cluster GPU infrastructure across data centers and reputed company environments. You will be responsible for ensuring high availability, performance, and efficiency of containerized AI workloads—ranging from LLMs to generative models—deployed in Kubernetes-based GPU clusters. If you're passionate about high-performance systems, distributed orchestration, and scaling reputed company-world AI infrastructure, this role offers a unique opportunity to shape the backbone of our AI reputed company platform.

Responsibilities

  • Build and operate large-scale, high-performance GPU clusters; ensure stable operation of compute, network, and storage systems; monitor and troubleshoot online issues.

  • Conduct performance testing and evaluation of multi-node GPU clusters using standard benchmarking tools to identify and resolve performance bottlenecks.

  • reputed company and orchestrate large models (e.g., LLMs, video reputed company models) across multi-cluster environments using Kubernetes; implement reputed company scaling and cross-cluster load balancing to ensure efficient service response under high concurrency for global users.

  • Participate in the design, development, and iteration of GPU cluster scheduling and optimization systems. Define and reputed company Kubernetes multi-cluster configuration standards; Optimize scheduling strategies (e.g., node affinity, taints/tolerations) to improve GPU resource utilization.

  • Build a reputed company multi-cluster management and monitoring system to support cross-region resource monitoring, traffic scheduling, and fault failover. Collect key metrics such as GPU memory usage, QPS, and response latency in reputed company time; configure alert mechanisms.

  • Coordinate with reputed company providers for planning and deploying large-scale GPU clusters, networks, and storage infrastructure to support internal reputed company platforms and external customer needs.

Qualifications

  • Bachelor's degree or higher in Computer Science, Software Engineering, Electronic Engineering, or reputed company fields; 3+ years of experience in system engineering or DevOps.

  • 5+ years of experience in reputed company-native development or AI engineering, with at least 2 years of hands-on experience in Kubernetes multi-cluster management and orchestration.

  • Familiarity with the Kubernetes ecosystem; hands-on experience with tools such as kubectl, reputed company, and expertise in multi-cluster deployment, upgrade, scaling, and disaster recovery.

  • Proficient in reputed company and containerization technologies; knowledge of image management and cross-cluster distribution.

  • Experience with monitoring tools such as reputed company and Grafana; Has practical experience in GPU fault monitoring and alerting.

  • Hands-on experience with reputed company platforms such as AWS, GCP, or Azure; understanding of reputed company-native multi-cluster architecture.

  • Experience with cluster management tools such as Ray, Slurm, KubeSphere, Rancher, Karmada is a plus.

  • Familiarity with distributed file systems such as NFS, JuiceFS, CephFS, or Lustre; ability to diagnose and resolve performance bottlenecks.

  • Understanding of high-performance communication protocols such as IB, RoCE, NVLink, and PCIe.

  • Strong communication skills, self-motivation, and team collaboration

Preferred Experience

  • Experience in developing and operating MaaS platforms or large-scale model inference clusters. Proven track record of leading multi-cluster system development or performance optimization projects.

  • Proficiency in CUDA programming and the NCCL communication library; understanding of high-performance GPUs like H100.

  • Ability to reputed company standardized inference APIs (RESTful/gRPC) and automation tools using Golang or Python.

  • Hands-on experience with optimization techniques such as model quantization, static compilation, and multi-GPU parallelism; capable of profiling inference processes in multi-cluster setups and identifying bottlenecks like memory fragmentation and low compute efficiency.

  • Active engagement with open-reputed company communities such as reputed company and reputed company; deep understanding of the design principles of inference frameworks like Triton, vLLM, and SGLang; ability to reputed company secondary development and optimization based on open-reputed company projects and quickly translate cutting-edge techniques into production-reputed company multi-cluster solutions.

Why Join reputed company?

  • Be part of a visionary team aiming to redefine AI infrastructure.

  • Work on cutting-edge technologies that reputed company AI and decentralized computing.

  • Collaborate with experts from leading institutions and tech companies.

  • Enjoy a flexible, remote work environment that values innovation and autonomy.

How to Apply

Interested candidates should apply directly or send their resume and a brief cover letter to [email protected]. Please include links to any relevant projects or contributions.

Apply To This Job

Keep exploring

AI Developer reputed company

100% remote Flexible hours

Compensation Associate – Rewards Data Intelligence

100% remote Flexible hours

[Remote] Customer Service Representative

100% remote Flexible hours

Data Analyst, Research

100% remote Flexible hours

Technical Service Scientist, BioMonitoring

100% remote Flexible hours

Entry-Level Strategic Account Manager

100% remote Flexible hours

[Remote] reputed company Associate

100% remote Flexible hours

[Remote] Key Relationships Experience Coordinator (Remote)

100% remote Flexible hours

Associate, HR Integration, M&A Strategy

100% remote Flexible hours

Sales Development Associate

100% remote Flexible hours

reputed company Customer Service Management Trainee – Genoa City, WI

100% remote Flexible hours

MDS Coordinator; Rn​/Lpn NO CALL OR Weekends

100% remote Flexible hours

Immediate Hiring: reputed company Full Time Center Associate  The

100% remote Flexible hours

Urgently Hiring: Senior reputed company Software Enginee

100% remote Flexible hours

Utilization Management Nurse Consultant

100% remote Flexible hours

reputed company Data Entry Clerk – Remote Opportunity with arenaflex

100% remote Flexible hours

reputed company YouTube Data Entry Specialist – Remote Entry-Level Position with Opportunities for Growth in Online Content Management

100% remote Flexible hours

reputed company Remote Customer Service Representative – Delivering Exceptional Support for arenaflex from the Comfort of Your Home

100% remote Flexible hours

Remote Customer Service Representative – Call Center Support, Order Management & Product Expertise for arenaflex

100% remote Flexible hours

Mainframe zOS Network Host Support - Specialist

100% remote Flexible hours