Thank you for sending your enquiry! One of our team members will contact you shortly.
Thank you for sending your booking! One of our team members will contact you shortly.
Course Outline
EXO Infrastructure as Code
- Overview of EXO deployment patterns: single-node, multi-node, and RDMA clusters for government use cases.
- Automating dependency installation (Xcode, uv, Node.js, Rust) with configuration management tools to streamline setup processes for government environments.
- Using Nix flakes for reproducible EXO builds and developer environments to ensure consistency across deployments for government projects.
- Writing Ansible playbooks or shell scripts for unattended cluster provisioning in government data centers and cloud infrastructure.
Reproducible Builds and CI Integration
- Pinning dependencies and building the dashboard within CI pipelines to maintain reliability and traceability for government applications.
- Running EXO smoke tests in GitHub Actions or GitLab CI runners to ensure compliance with government standards and specifications.
- Creating golden images and snapshot-based rollback workflows for macOS and Linux VMs to facilitate rapid recovery and consistent state management for government systems.
- Versioning custom model cards alongside application code to support transparent and auditable deployment practices in government operations.
Cluster Discovery and Networking Automation
- Configuring mDNS and static DNS for reliable libp2p node discovery, enhancing network stability and performance for government clusters.
- Automating network profile creation and Thunderbolt bridge management on macOS to optimize connectivity and security in government environments.
- Using custom namespaces (EXO_LIBP2P_NAMESPACE) to separate development, staging, and production clusters for enhanced isolation and control in government networks.
- Implementing firewall rules and network segmentation for multi-tenant environments to ensure data integrity and compliance with government security policies.
Storage and Model Lifecycle Management
- Designing EXO_MODELS_DIRS and EXO_MODELS_READ_ONLY_DIRS strategies to optimize storage and access for government models and datasets.
- Mounting NFS or SAN shares as read-only model repositories to facilitate fast provisioning and secure data management in government systems.
- Implementing garbage collection of stale caches and versioned weight retention policies to maintain efficient use of resources for government operations.
- Automating model pre-downloads and health checks before rolling updates to ensure continuous availability and performance in government applications.
Monitoring and Alerting
- Shipping EXO logs to centralized logging solutions (ELK, Loki, or Splunk) for comprehensive visibility and audit capabilities in government environments.
- Building Grafana dashboards from EXO_TRACING_ENABLED output to monitor performance and identify issues proactively for government systems.
- Setting up alerts for cluster membership changes, out-of-memory events, and inference latency spikes to ensure timely response and resolution in government operations.
- Correlating macmon hardware telemetry with model performance regressions to optimize system efficiency and reliability for government applications.
Update, Rollback, and Disaster Recovery
- Staging EXO binary updates in a canary node before fleet-wide rollout to minimize disruption and ensure stability in government deployments.
- Implementing model-level rollback capabilities: switching between quantized versions without re-downloading to maintain service continuity for government systems.
- Backing up and restoring cluster state, custom namespaces, and cached weights to facilitate rapid recovery in the event of system failures or data loss for government operations.
- Documenting recovery runbooks for total cluster rebuild scenarios to ensure preparedness and resilience in government IT environments.
Security Hardening and Compliance
- Applying TLS at the reverse proxy layer (nginx, traefik) for the dashboard and API to secure data transmission and protect sensitive information in government applications.
- Implementing API rate limiting and IP whitelisting for EXO endpoints to prevent abuse and ensure authorized access in government systems.
- Isolating clusters with VLANs and zero-trust network policies to enhance security and compliance with government regulations and standards.
- Auditing access and maintaining an inventory of deployed models and versions to support transparency and accountability in government operations.
Requirements
- Experience with DevOps practices, including continuous integration and deployment (CI/CD), infrastructure as code (IaC), and container orchestration.
- Familiarity with system administration and package management in macOS or Linux environments.
- Understanding of networking principles, Domain Name System (DNS) operations, and storage solutions.
Audience for government
- DevOps engineers
- Infrastructure architects
- SREs responsible for managing on-premise AI workloads
21 Hours
Testimonials (2)
Craig was extremely involved in the training, always making sure we are paying attention, adapted the examples to our day-to-day activities and always provided an answer when asked, even if the information was not added in the presentation.
Ecaterina Ioana Nicoale - BOOKING HOLDINGS ROMANIA SRL
Course - DevOps Foundation®
High level of commitment and knowledge of the trainer