Online or onsite, instructor-led live GPU (Graphics Processing Unit) training courses demonstrate through interactive discussion and hands-on practice the fundamentals of GPU programming.
GPU training is available as "online live training" or "onsite live training." Online live training (also known as "remote live training") is conducted via an interactive remote desktop. Onsite live training can be arranged at customer premises in Virginia or at Govtra corporate training centers in Virginia.
Govtra — Your Trusted Training Provider for Government
VA, Stafford - Quantico Corporate
800 Corporate Drive, Suite 301, Stafford, united states, 22554
The venue is located between interstate 95 and the Jefferson Davis Highway, in the vicinity of the Courtyard by Mariott Stafford Quantico and the UMUC Quantico Cororate Center.
VA, Fredericksburg - Central Park Corporate Center
1320 Central Park Blvd., Suite 200, Fredericksburg, united states, 22401
The venue is located behind a complex of commercial buildings with the Bank of America just on the corner before the turn leading to the office.
VA, Richmond - Two Paragon Place
Two Paragon Place, 6802 Paragon Place Suite 410, Richmond, United States, 23230
The venue is located in bustling Richmond with Hampton Inn, Embassy Suites and Westin Hotel less than a mile away.
VA, Reston - Sunrise Valley
12020 Sunrise Valley Dr #100, Reston, United States, 20191
The venue is located just behind the NCRA and Reston Plaza Cafe building and just next door to the United Healthcare building.
VA, Reston - Reston Town Center I
11921 Freedom Dr #550, Reston, united states, 20190
The venue is located in the Reston Town Center, near Chico's and the Artinsights Gallery of Film and Contemporary Art.
VA, Richmond - Sun Trust Center Downtown
919 E Main St, Richmond , united states, 23219
The venue is located in the Sun Trust Center on the crossing of E Main Street and S to N 10th Street just opposite of 7 Eleven.
Richmond, VA – Regus at Two Paragon Place
6802 Paragon Place, Suite 410, Richmond, United States, 23230
The venue is located within the Two Paragon Place business campus off I‑295 and near Parham Road in North Richmond, offering convenient access by car with free on-site parking. Visitors arriving from Richmond International Airport (RIC), approximately 16 miles northwest, can expect a taxi or rideshare ride of around 20–25 minutes via I‑64 West and I‑295 North. Public transit is available via GRTC buses, with routes stopping along Parham Road and Quioccasin Road, just a short walk to the campus.
Virginia Beach, VA – Regus at Windwood Center
780 Lynnhaven Parkway, Suite 400, Virginia Beach, United States, 23452
The venue is situated within the Windwood Center along Lynnhaven Parkway, featuring modern concrete-and-glass architecture and ample on-site parking. Easily accessible by car via Interstate 264 and the Virginia Beach Expressway, the facility offers a hassle-free commute. From Norfolk International Airport (ORF), located about 12 miles northwest, a taxi or rideshare typically takes 20–25 minutes via VA‑168 South and Edenvale Road. For those using public transit, the HRT bus system includes stops at Lynnhaven Parkway and surrounding streets, providing convenient access by bus.
The Huawei Ascend family of AI processors is designed for high-performance inference and training applications.
This instructor-led, live training (online or onsite) is aimed at intermediate-level AI engineers and data scientists who wish to develop and optimize neural network models using Huawei’s Ascend platform and the CANN toolkit. The course is tailored to align with public sector workflows, governance, and accountability for government.
By the end of this training, participants will be able to:
Set up and configure the CANN development environment.
Develop AI applications using MindSpore and CloudMatrix workflows.
Optimize performance on Ascend NPUs using custom operators and tiling techniques.
Deploy models to edge or cloud environments, ensuring compliance with government standards.
Format of the Course
Interactive lecture and discussion sessions.
Hands-on use of Huawei Ascend and the CANN toolkit in sample applications relevant to government operations.
Guided exercises focused on model building, training, and deployment within a governmental context.
Course Customization Options
To request a customized training for this course based on your specific infrastructure or datasets, please contact us to arrange. We can tailor the content to meet the unique needs of government agencies.
Huawei’s AI stack — from the low-level CANN SDK to the high-level MindSpore framework — provides a tightly integrated environment optimized for Ascend hardware, designed to support efficient AI development and deployment.
This instructor-led, live training (online or on-site) is aimed at technical professionals at beginner to intermediate levels who wish to gain a comprehensive understanding of how the CANN and MindSpore components work together to facilitate AI lifecycle management and infrastructure decisions.
By the end of this training, participants will be able to:
- Understand the layered architecture of Huawei’s AI compute stack for government.
- Identify how CANN supports model optimization and hardware-level deployment in various environments.
- Evaluate the MindSpore framework and toolchain in comparison to industry alternatives.
- Position Huawei's AI stack within enterprise or cloud/on-premises environments, ensuring alignment with public sector workflows and governance.
**Format of the Course**
- Interactive lecture and discussion.
- Live system demonstrations and case-based walkthroughs.
- Optional guided labs on model flow from MindSpore to CANN.
**Course Customization Options**
- To request a customized training for this course, please contact us to arrange.
This instructor-led, live training in Virginia (online or onsite) is aimed at beginner to intermediate developers who wish to utilize OpenACC for programming heterogeneous devices and leveraging their parallelism.
By the end of this training, participants will be able to:
- Set up an OpenACC development environment.
- Write and run a basic OpenACC program.
- Annotate code with OpenACC directives and clauses.
- Utilize OpenACC API and libraries.
- Profile, debug, and optimize OpenACC programs for government applications.
The CANN SDK (Compute Architecture for Neural Networks) provides robust deployment and optimization tools for real-time AI applications in computer vision and natural language processing, particularly on Huawei Ascend hardware.
This instructor-led, live training (online or onsite) is designed for intermediate-level AI practitioners who aim to build, deploy, and optimize vision and language models using the CANN SDK for production use cases for government.
By the end of this training, participants will be able to:
- Deploy and optimize computer vision and natural language processing models using CANN and AscendCL.
- Utilize CANN tools to convert models and integrate them into live pipelines.
- Enhance inference performance for tasks such as detection, classification, and sentiment analysis.
- Construct real-time computer vision and natural language processing pipelines suitable for edge or cloud-based deployment scenarios.
**Format of the Course**
- Interactive lecture and demonstration.
- Hands-on laboratory sessions with model deployment and performance profiling.
- Live pipeline design using real-world computer vision and natural language processing use cases.
**Course Customization Options**
- To request a customized training for this course, please contact us to arrange.
This instructor-led, live training in Virginia (online or onsite) is aimed at beginner to intermediate-level developers who wish to learn the basics of GPU programming and the main frameworks and tools for developing GPU applications for government.
By the end of this training, participants will be able to:
Understand the difference between CPU and GPU computing, including the benefits and challenges of GPU programming in a public sector context.
Select the appropriate framework and tool for their GPU application development needs.
Create a basic GPU program that performs vector addition using one or more of the selected frameworks and tools.
Utilize the respective APIs, languages, and libraries to query device information, manage device memory allocation and deallocation, transfer data between host and device, launch kernels, and synchronize threads for efficient application performance.
Leverage various memory spaces, such as global, local, constant, and private, to optimize data transfers and memory access patterns.
Control parallelism using execution models, including work-items, work-groups, threads, blocks, and grids, to enhance computational efficiency.
Debug and test GPU programs using tools such as CodeXL, CUDA-GDB, CUDA-MEMCHECK, and NVIDIA Nsight to ensure robustness and reliability in government applications.
Optimize GPU programs using techniques such as coalescing, caching, prefetching, and profiling to achieve maximum performance and efficiency for government use cases.
CANN TIK (Tensor Instruction Kernel) and Apache TVM facilitate advanced optimization and customization of AI model operators for Huawei Ascend hardware.
This instructor-led, live training (online or onsite) is designed for advanced-level system developers who aim to build, deploy, and tune custom operators for AI models using CANN’s TIK programming model and TVM compiler integration.
By the end of this training, participants will be able to:
- Write and test custom AI operators using the TIK DSL for Ascend processors.
- Integrate custom operators into the CANN runtime and execution graph.
- Use TVM for operator scheduling, auto-tuning, and benchmarking.
- Debug and optimize instruction-level performance for custom computation patterns.
**Format of the Course**
- Interactive lecture and demonstration.
- Hands-on coding of operators using TIK and TVM pipelines.
- Testing and tuning on Ascend hardware or simulators.
**Course Customization Options**
- To request a customized training for government, please contact us to arrange.
This instructor-led, live training in [location] (online or onsite) is designed for government developers at beginner to intermediate levels who wish to explore various frameworks for GPU programming and compare their features, performance, and compatibility.
By the end of this training, participants will be able to:
- Set up a development environment that includes the OpenCL SDK, CUDA Toolkit, ROCm Platform, a device that supports OpenCL, CUDA, or ROCm, and Visual Studio Code.
- Develop a basic GPU program that performs vector addition using OpenCL, CUDA, and ROCm, and compare the syntax, structure, and execution of each framework.
- Utilize the respective APIs to query device information, manage device memory allocation and deallocation, transfer data between host and device, launch kernels, and synchronize threads.
- Write kernels in the respective languages that execute on the device and manipulate data.
- Employ built-in functions, variables, and libraries to perform common tasks and operations.
- Optimize data transfers and memory accesses using the respective memory spaces, such as global, local, constant, and private.
- Control parallelism through the use of threads, blocks, and grids in the respective execution models.
- Debug and test GPU programs using tools like CodeXL, CUDA-GDB, CUDA-MEMCHECK, and NVIDIA Nsight.
- Enhance performance with techniques such as coalescing, caching, prefetching, and profiling.
This training is tailored to meet the specific needs of developers working for government agencies, ensuring they have the skills necessary to leverage GPU programming effectively in their projects.
CloudMatrix is Huawei’s unified artificial intelligence (AI) development and deployment platform, designed to support scalable, production-grade inference pipelines.
This instructor-led, live training (online or onsite) is aimed at beginner-level to intermediate-level AI professionals who wish to deploy and monitor AI models using the CloudMatrix platform with CANN and MindSpore integration for government applications.
By the end of this training, participants will be able to:
Utilize CloudMatrix for model packaging, deployment, and serving.
Convert and optimize models for Ascend chipsets.
Establish pipelines for real-time and batch inference tasks.
Monitor deployments and adjust performance in production settings.
Format of the Course
Interactive lectures and discussions.
Hands-on use of CloudMatrix with practical deployment scenarios.
Guided exercises focused on conversion, optimization, and scaling.
Course Customization Options
To request a customized training for this course based on your specific AI infrastructure or cloud environment, please contact us to arrange.
The Huawei Ascend CANN toolkit facilitates robust AI inference on edge devices such as the Ascend 310. CANN provides critical tools for compiling, optimizing, and deploying models in environments with limited compute and memory resources.
This instructor-led, live training (online or onsite) is designed for intermediate-level AI developers and integrators who aim to deploy and optimize models on Ascend edge devices using the CANN toolchain.
By the end of this training, participants will be able to:
Prepare and convert AI models for deployment on the Ascend 310 using CANN tools.
Construct lightweight inference pipelines utilizing MindSpore Lite and AscendCL.
Enhance model performance in compute- and memory-constrained environments.
Deploy and monitor AI applications in practical edge scenarios.
Format of the Course
Interactive lecture and demonstration.
Hands-on lab work with edge-specific models and scenarios.
Live deployment examples on virtual or physical edge hardware.
Course Customization Options for Government
To request a customized training for this course, please contact us to arrange.
This instructor-led, live training (online or onsite) is designed for government developers at the beginner to intermediate level who wish to install and use ROCm on Windows to program AMD GPUs and leverage their parallel processing capabilities.
By the end of this training, participants will be able to:
- Set up a development environment that includes the ROCm Platform, an AMD GPU, and Visual Studio Code on Windows.
- Create a basic ROCm program that performs vector addition on the GPU and retrieves results from GPU memory.
- Utilize the ROCm API to query device information, manage device memory allocation and deallocation, transfer data between host and device, launch kernels, and synchronize threads.
- Write kernels using HIP language that execute on the GPU and manipulate data.
- Employ HIP built-in functions, variables, and libraries to perform common tasks and operations.
- Optimize data transfers and memory accesses by leveraging ROCm and HIP memory spaces such as global, shared, constant, and local.
- Control parallelism through ROCm and HIP execution models, defining threads, blocks, and grids.
- Debug and test ROCm and HIP programs using tools like the ROCm Debugger and ROCm Profiler.
- Optimize ROCm and HIP programs with techniques including coalescing, caching, prefetching, and profiling.
This training is tailored to enhance the skills of developers for government projects that require efficient and scalable GPU programming.
This instructor-led, live training in [location] (online or onsite) is designed for beginner to intermediate developers who wish to use ROCm and HIP to program AMD GPUs and leverage their parallel processing capabilities.
By the end of this training, participants will be able to:
- Set up a development environment that includes the ROCm Platform, an AMD GPU, and Visual Studio Code.
- Create a basic ROCm program that performs vector addition on the GPU and retrieves the results from GPU memory.
- Utilize the ROCm API to query device information, manage device memory allocation and deallocation, transfer data between host and device, launch kernels, and synchronize threads.
- Write HIP language kernels that execute on the GPU and manipulate data.
- Employ HIP built-in functions, variables, and libraries to perform common tasks and operations.
- Optimize data transfers and memory accesses using ROCm and HIP memory spaces, such as global, shared, constant, and local.
- Control parallelism through ROCm and HIP execution models by defining threads, blocks, and grids.
- Debug and test ROCm and HIP programs using tools like the ROCm Debugger and ROCm Profiler.
- Enhance ROCm and HIP program performance using techniques such as coalescing, caching, prefetching, and profiling.
This training is tailored to align with public sector workflows, governance, and accountability standards for government.
The Compute Architecture for Neural Networks (CANN) is Huawei’s AI computing toolkit designed for compiling, optimizing, and deploying AI models on Ascend AI processors.
This instructor-led, live training (available online or onsite) is aimed at beginner-level AI developers who wish to understand how CANN fits into the model lifecycle from training to deployment, and how it integrates with frameworks such as MindSpore, TensorFlow, and PyTorch.
By the end of this training, participants will be able to:
Understand the purpose and architecture of the CANN toolkit.
Set up a development environment with CANN and MindSpore.
Convert and deploy a simple AI model to Ascend hardware.
Gain foundational knowledge for future CANN optimization or integration projects, including those for government applications.
Format of the Course
Interactive lecture and discussion.
Hands-on labs with simple model deployment.
Step-by-step walkthrough of the CANN toolchain and integration points.
Course Customization Options
To request a customized training for this course, please contact us to arrange.
Ascend, Biren, and Cambricon are leading AI hardware platforms in China, each providing specialized acceleration and profiling tools for production-scale AI workloads.
This instructor-led, live training (available online or onsite) is designed for advanced-level AI infrastructure and performance engineers who seek to optimize model inference and training workflows across multiple Chinese AI chip platforms.
By the end of this training, participants will be able to:
Evaluate models on Ascend, Biren, and Cambricon platforms.
Identify system bottlenecks and inefficiencies in memory and compute performance.
Implement graph-level, kernel-level, and operator-level optimizations.
Refine deployment pipelines to enhance throughput and reduce latency.
Format of the Course
Interactive lecture and discussion sessions.
Hands-on use of profiling and optimization tools for each platform.
Guided exercises focusing on practical tuning scenarios.
Course Customization Options
To request a customized training for government or other specific environments based on your performance needs or model type, please contact us to arrange.
The CANN SDK (Compute Architecture for Neural Networks) is Huawei’s AI computing foundation designed to enable developers to fine-tune and optimize the performance of deployed neural networks on Ascend AI processors.
This instructor-led, live training (available online or onsite) is targeted at advanced-level AI developers and system engineers who seek to enhance inference performance using CANN’s comprehensive toolset, which includes the Graph Engine, TIK, and custom operator development.
By the end of this training, participants will be able to:
- Understand the runtime architecture and performance lifecycle of CANN.
- Utilize profiling tools and the Graph Engine for performance analysis and optimization.
- Develop and optimize custom operators using TIK and TVM.
- Address memory bottlenecks and improve model throughput.
**Format of the Course:**
- Interactive lecture and discussion.
- Hands-on labs with real-time profiling and operator tuning.
- Optimization exercises using edge-case deployment examples.
**Course Customization Options:**
- To request a customized training for government or other specific needs, please contact us to arrange.
Chinese GPU architectures, including Huawei Ascend, Biren, and Cambricon MLUs, provide viable alternatives to CUDA, specifically designed for the local AI and HPC markets.
This instructor-led, live training (online or onsite) is targeted at advanced-level GPU programmers and infrastructure specialists who are looking to migrate and optimize existing CUDA applications for deployment on Chinese hardware platforms.
By the end of this training, participants will be able to:
Evaluate the compatibility of current CUDA workloads with Chinese chip alternatives.
Port CUDA codebases to Huawei CANN, Biren SDK, and Cambricon BANGPy environments.
Compare performance metrics and identify optimization opportunities across different platforms.
Address practical challenges in supporting and deploying applications across multiple architectures.
Format of the Course
Interactive lectures and discussions.
Hands-on code translation and performance comparison labs.
Guided exercises focusing on multi-GPU adaptation strategies.
Course Customization Options for Government
To request a customized training for this course based on your specific platform or CUDA project, please contact us to arrange.
This instructor-led, live training in [location] (online or onsite) is aimed at beginner to intermediate level developers who wish to use CUDA to program NVIDIA GPUs and leverage their parallel processing capabilities for government applications.
By the end of this training, participants will be able to:
- Set up a development environment that includes the CUDA Toolkit, an NVIDIA GPU, and Visual Studio Code.
- Create a basic CUDA program that performs vector addition on the GPU and retrieves the results from GPU memory.
- Use the CUDA API to query device information, allocate and deallocate device memory, transfer data between host and device, launch kernels, and synchronize threads.
- Write CUDA C/C++ language kernels that execute on the GPU and manipulate data.
- Utilize CUDA built-in functions, variables, and libraries to perform common tasks and operations.
- Employ CUDA memory spaces, such as global, shared, constant, and local, to optimize data transfers and memory accesses.
- Control the threads, blocks, and grids that define the parallelism using the CUDA execution model.
- Debug and test CUDA programs using tools such as CUDA-GDB, CUDA-MEMCHECK, and NVIDIA Nsight.
- Optimize CUDA programs using techniques such as coalescing, caching, prefetching, and profiling.
97% of clients report satisfaction with this training.
The Compute Architecture for Neural Networks (CANN) is Huawei’s AI computing stack designed for deploying and optimizing AI models on Ascend AI processors.
This instructor-led, live training (online or onsite) is aimed at intermediate-level AI developers and engineers who wish to deploy trained AI models efficiently to Huawei Ascend hardware using the CANN toolkit and tools such as MindSpore, TensorFlow, or PyTorch.
By the end of this training, participants will be able to:
- Understand the CANN architecture and its role in the AI deployment pipeline.
- Convert and adapt models from popular frameworks to Ascend-compatible formats.
- Utilize tools like ATC, OM model conversion, and MindSpore for edge and cloud inference.
- Diagnose deployment issues and optimize performance on Ascend hardware.
**Format of the Course**
- Interactive lecture and demonstration.
- Hands-on lab work using CANN tools and Ascend simulators or devices.
- Practical deployment scenarios based on real-world AI models.
**Course Customization Options for Government**
To request a customized training for this course, please contact us to arrange.
Biren AI Accelerators are high-performance GPUs designed for artificial intelligence and high-performance computing (HPC) workloads, with robust support for large-scale training and inference.
This instructor-led, live training (available online or onsite) is aimed at intermediate to advanced developers who wish to program and optimize applications using Biren’s proprietary GPU stack. The course also includes practical comparisons to CUDA-based environments.
By the end of this training, participants will be able to:
- Understand Biren GPU architecture and memory hierarchy.
- Set up the development environment and use Biren’s programming model.
- Translate and optimize CUDA-style code for Biren platforms.
- Apply performance tuning and debugging techniques.
**Format of the Course**
- Interactive lecture and discussion.
- Hands-on use of the Biren SDK in sample GPU workloads.
- Guided exercises focused on porting and performance tuning.
**Course Customization Options**
To request a customized training for government or based on your specific application stack or integration needs, please contact us to arrange.
Cambricon MLUs (Machine Learning Units) are specialized AI chips designed for optimizing inference and training in both edge and data center environments.
This instructor-led, live training (available online or onsite) is tailored for intermediate-level developers who aim to construct and deploy AI models using the BANGPy framework and Neuware SDK on Cambricon MLU hardware.
By the end of this training, participants will be able to:
Set up and configure the BANGPy and Neuware development environments for government applications.
Develop and optimize Python- and C++-based models for deployment on Cambricon MLUs.
Deploy models to edge and data center devices running the Neuware runtime.
Integrate machine learning workflows with MLU-specific acceleration features to enhance performance.
Format of the Course
Interactive lecture and discussion sessions.
Hands-on practice using BANGPy and Neuware for development and deployment tasks.
Guided exercises focused on optimization, integration, and testing to ensure robust model performance.
Course Customization Options
To request a customized training for this course based on specific Cambricon device models or use cases, please contact us to arrange.
This instructor-led, live training in [location] (online or onsite) is designed for beginner-level system administrators and IT professionals who wish to install, configure, manage, and troubleshoot CUDA environments for government use.
By the end of this training, participants will be able to:
- Understand the architecture, components, and capabilities of CUDA.
- Install and configure CUDA environments.
- Manage and optimize CUDA resources.
- Debug and troubleshoot common CUDA issues.
This instructor-led, live training in Virginia (online or onsite) is designed for beginner to intermediate developers who wish to use OpenCL to program heterogeneous devices and leverage their parallel processing capabilities.
By the end of this training, participants will be able to:
- Set up a development environment that includes the OpenCL SDK, a device compatible with OpenCL, and Visual Studio Code.
- Develop a basic OpenCL program that performs vector addition on the device and retrieves results from device memory.
- Utilize the OpenCL API to query device information, create contexts, command queues, buffers, kernels, and events.
- Write kernels using the OpenCL C language to execute tasks on the device and manipulate data.
- Employ OpenCL built-in functions, extensions, and libraries to perform common operations and tasks.
- Optimize data transfers and memory accesses using OpenCL host and device memory models.
- Control work-items, work-groups, and ND-ranges using the OpenCL execution model.
- Debug and test OpenCL programs with tools such as CodeXL, Intel VTune, and NVIDIA Nsight.
- Enhance OpenCL program performance using techniques like vectorization, loop unrolling, local memory usage, and profiling.
This training is tailored to support developers in enhancing their skills for government projects that require efficient parallel processing.
This instructor-led, live training in Virginia (online or onsite) is aimed at intermediate-level developers who wish to use CUDA to build Python applications that run in parallel on NVIDIA GPUs for government projects.
By the end of this training, participants will be able to:
Leverage the Numba compiler to accelerate Python applications running on NVIDIA GPUs for government use.
Develop, compile, and deploy custom CUDA kernels for government applications.
Effectively manage GPU memory in government computing environments.
Transform a CPU-based application into a GPU-accelerated application suitable for government operations.
This instructor-led, live training course in [location] is designed to provide government participants with comprehensive knowledge on programming GPUs for parallel computing. The curriculum covers the utilization of various platforms, including an in-depth exploration of the CUDA platform and its features. Participants will learn how to implement optimization techniques using CUDA. Practical applications discussed during the course include deep learning, analytics, image processing, and engineering solutions tailored for government workflows.
Read more...
Last Updated:
Testimonials (2)
Very interactive with various examples, with a good progression in complexity between the start and the end of the training.
Jenny - Andheo
Course - GPU Programming with CUDA and Python
Trainers energy and humor.
Tadeusz Kaluba - Nokia Solutions and Networks Sp. z o.o.
Online Graphics Processing Unit (GPU) training in Virginia, Graphics Processing Unit training courses in Virginia, Weekend GPU courses in Virginia, Evening Graphics Processing Unit training in Virginia, Graphics Processing Unit (GPU) instructor-led in Virginia, Graphics Processing Unit classes in Virginia, Graphics Processing Unit instructor in Virginia, GPU private courses in Virginia, GPU instructor-led in Virginia, Evening Graphics Processing Unit (GPU) courses in Virginia, GPU (Graphics Processing Unit) trainer in Virginia, Online Graphics Processing Unit training in Virginia, GPU (Graphics Processing Unit) boot camp in Virginia, Weekend Graphics Processing Unit (GPU) training in Virginia, GPU (Graphics Processing Unit) on-site in Virginia, GPU (Graphics Processing Unit) coaching in Virginia, Graphics Processing Unit (GPU) one on one training in Virginia