Get in Touch

Course Outline

EXO Infrastructure as Code

  • Overview of EXO deployment patterns: single-node, multi-node, and RDMA clusters for government use cases.
  • Automating dependency installation (Xcode, uv, Node.js, Rust) with configuration management tools to streamline setup processes for government environments.
  • Using Nix flakes for reproducible EXO builds and developer environments to ensure consistency across deployments for government projects.
  • Writing Ansible playbooks or shell scripts for unattended cluster provisioning in government data centers and cloud infrastructure.

Reproducible Builds and CI Integration

  • Pinning dependencies and building the dashboard within CI pipelines to maintain reliability and traceability for government applications.
  • Running EXO smoke tests in GitHub Actions or GitLab CI runners to ensure compliance with government standards and specifications.
  • Creating golden images and snapshot-based rollback workflows for macOS and Linux VMs to facilitate rapid recovery and consistent state management for government systems.
  • Versioning custom model cards alongside application code to support transparent and auditable deployment practices in government operations.

Cluster Discovery and Networking Automation

  • Configuring mDNS and static DNS for reliable libp2p node discovery, enhancing network stability and performance for government clusters.
  • Automating network profile creation and Thunderbolt bridge management on macOS to optimize connectivity and security in government environments.
  • Using custom namespaces (EXO_LIBP2P_NAMESPACE) to separate development, staging, and production clusters for enhanced isolation and control in government networks.
  • Implementing firewall rules and network segmentation for multi-tenant environments to ensure data integrity and compliance with government security policies.

Storage and Model Lifecycle Management

  • Designing EXO_MODELS_DIRS and EXO_MODELS_READ_ONLY_DIRS strategies to optimize storage and access for government models and datasets.
  • Mounting NFS or SAN shares as read-only model repositories to facilitate fast provisioning and secure data management in government systems.
  • Implementing garbage collection of stale caches and versioned weight retention policies to maintain efficient use of resources for government operations.
  • Automating model pre-downloads and health checks before rolling updates to ensure continuous availability and performance in government applications.

Monitoring and Alerting

  • Shipping EXO logs to centralized logging solutions (ELK, Loki, or Splunk) for comprehensive visibility and audit capabilities in government environments.
  • Building Grafana dashboards from EXO_TRACING_ENABLED output to monitor performance and identify issues proactively for government systems.
  • Setting up alerts for cluster membership changes, out-of-memory events, and inference latency spikes to ensure timely response and resolution in government operations.
  • Correlating macmon hardware telemetry with model performance regressions to optimize system efficiency and reliability for government applications.

Update, Rollback, and Disaster Recovery

  • Staging EXO binary updates in a canary node before fleet-wide rollout to minimize disruption and ensure stability in government deployments.
  • Implementing model-level rollback capabilities: switching between quantized versions without re-downloading to maintain service continuity for government systems.
  • Backing up and restoring cluster state, custom namespaces, and cached weights to facilitate rapid recovery in the event of system failures or data loss for government operations.
  • Documenting recovery runbooks for total cluster rebuild scenarios to ensure preparedness and resilience in government IT environments.

Security Hardening and Compliance

  • Applying TLS at the reverse proxy layer (nginx, traefik) for the dashboard and API to secure data transmission and protect sensitive information in government applications.
  • Implementing API rate limiting and IP whitelisting for EXO endpoints to prevent abuse and ensure authorized access in government systems.
  • Isolating clusters with VLANs and zero-trust network policies to enhance security and compliance with government regulations and standards.
  • Auditing access and maintaining an inventory of deployed models and versions to support transparency and accountability in government operations.

Requirements

  • Experience with DevOps practices, including continuous integration and deployment (CI/CD), infrastructure as code (IaC), and container orchestration.
  • Familiarity with system administration and package management in macOS or Linux environments.
  • Understanding of networking principles, Domain Name System (DNS) operations, and storage solutions.

Audience for government

  • DevOps engineers
  • Infrastructure architects
  • SREs responsible for managing on-premise AI workloads
 21 Hours

Number of participants


Price per participant

Testimonials (2)

Upcoming Courses

Related Categories