Course Outline
Introduction
- How Site Reliability Engineering (SRE) integrates traditional IT practices with modern software development methodologies.
- The critical importance of automation and observability in ensuring system reliability.
- The distinct roles and responsibilities of software engineers and system administrators within the SRE framework.
- The differences between Site Reliability Engineers and DevOps engineers, highlighting their complementary roles.
Overview of an IT System
- Architectural considerations for both on-premise and cloud-based systems.
Overview of SRE Principles and Practices
- The concept of Infrastructure as Code (IaC) and its role in maintaining system reliability.
- The significance of containerization and orchestration technologies, such as Docker and Kubernetes, in modern IT environments.
- Continuous Integration, Continuous Deployment, and Continuous Delivery (CI/CD) practices to streamline development and operations.
- The importance of observability in monitoring and maintaining system health.
Evaluating an IT System
- Assessing the current team and organizational resources for implementing SRE practices.
- Mapping out existing systems and processes to identify areas for improvement.
- Estimating the potential impact of adopting SRE methodologies on system reliability and efficiency.
- Defining the role of the software engineering team in supporting SRE initiatives.
- Understanding the responsibilities of the operational team in maintaining system reliability.
- The role of management in driving and supporting SRE adoption within the organization.
Maintaining the Reliability of a System
- Defining and measuring the desired reliability levels for services.
- Understanding Service Level Objectives (SLOs) and their importance in setting reliability targets.
- Exploring Service Level Indicators (SLIs) and Service Level Agreements (SLAs) to ensure service performance.
- Utilizing error budgets to manage risk and maintain system reliability.
- Developing effective SLOs to guide system improvement efforts.
Optimizing System Administration
- Establishing a robust development environment for efficient system administration.
- Evaluating and selecting appropriate SRE tools to enhance operational capabilities.
- Prioritizing tasks for automation to improve efficiency and reduce manual errors.
- Writing high-quality software to support system reliability and performance.
Deploying "Infrastructure as Code"
- Testing and iteratively refining code to ensure robustness and reliability.
- Designing systems to be anti-fragile, capable of improving under stress.
- Learning from system failures to continuously improve practices and processes.
Monitoring a System
- Observing and analyzing system performance to identify issues and optimize operations.
- Utilizing SRE tools and techniques for effective monitoring and troubleshooting.
The Future of SRE
Summary and Conclusion
Requirements
- A fundamental understanding of IT infrastructure for government.
- An overview of the software development process.
- Experience in programming or scripting in any language.
Audience
- Developers
- System Administrators
- Software Architects
- DevOps Engineers
- IT Managers
Testimonials (7)
How detailed subjects are explained with real world examples
Brian Hlabane - African Bank
Course - Site Reliability Engineering (SRE) Fundamentals
She is expert in area and provide really nice training. Material, training was really mix of examples , discussion and
Peter Tutka - Deutsche Telekom IT & Telecommunications Slovakia s.r.o.
Course - Site Reliability Engineering (SRE) Fundamentals
View on the SRE/ DevOps from more business/ theoretical point of view. Most helpful for people who already have the practical view.
Michael Varhol - Deutsche Telekom IT & Telecommunications Slovakia s.r.o.
Course - Site Reliability Engineering (SRE) Fundamentals
Approach of the training to send questionnaire before the training, so the training was planned accordingly to expectations. Brings the participants more active.
Stefan Girman - Deutsche Telekom IT & Telecommunications Slovakia s.r.o.
Course - Site Reliability Engineering (SRE) Fundamentals
Sticking to the initial survey from attendees about what should be the focus of training.
Denis Majorsky - Deutsche Telekom IT & Telecommunications Slovakia s.r.o.
Course - Site Reliability Engineering (SRE) Fundamentals
discussions , SRE definition
Daniel Horvath - Deutsche Telekom IT & Telecommunications Slovakia s.r.o.
Course - Site Reliability Engineering (SRE) Fundamentals
Concept of the training, keeping the people focused by asking them a questions and triggering discussions. Also group breakout sessions were great to think about things in groups and see different outcomes from other group.