- Paperback: 552 pages
- Publisher: O'Reilly Media; 1 edition (April 16, 2016)
- Language: English
- ISBN-10: 149192912X
- ISBN-13: 978-1491929124
- Product Dimensions: 7 x 1.1 x 9.2 inches
- Shipping Weight: 2.2 pounds (View shipping rates and policies)
- Customer Reviews:
- Amazon Best Sellers Rank: #28,052 in Books (See Top 100 in Books)
Site Reliability Engineering: How Google Runs Production Systems 1st Edition
Use the Amazon App to scan ISBNs and compare prices.
Fulfillment by Amazon (FBA) is a service we offer sellers that lets them store their products in Amazon's fulfillment centers, and we directly pack, ship, and provide customer service for these products. Something we hope you'll especially enjoy: FBA items qualify for FREE Shipping and Amazon Prime.
If you're a seller, Fulfillment by Amazon can help you grow your business. Learn more about the program.
Enter your mobile number or email address below and we'll send you a link to download the free Kindle App. Then you can start reading Kindle books on your smartphone, tablet, or computer - no Kindle device required.
To get the free app, enter your mobile phone number.
Customers who viewed this item also viewed these digital items
Customers who bought this item also bought these digital items
From the Publisher
This book is divided into four sections:
- Introduction—Learn what site reliability engineering is and why it differs from conventional IT industry practices
- Principles—Examine the patterns, behaviors, and areas of concern that influence the work of a site reliability engineer (SRE)
- Practices—Understand the theory and practice of an SRE’s day-to-day work: building and operating large distributed computing systems
- Management—Explore Google's best practices for training, communication, and meetings that your organization can use
How to Read This Book
This book is a series of essays written by members and alumni of Google’s Site Reliability Engineering organization. It’s much more like conference proceedings than it is like a standard book by an author or a small number of authors. Each chapter is intended to be read as a part of a coherent whole, but a good deal can be gained by reading on whatever subject particularly interests you. (If there are other articles that support or inform the text, we reference them so you can follow up accordingly.)
You don’t need to read in any particular order, though we’d suggest at least starting with Chapters 2 and 3, which describe Google’s production environment and outline how SRE approaches risk, respectively. (Risk is, in many ways, the key quality of our profession.) Reading cover-to-cover is, of course, also useful and possible; our chapters are grouped thematically, into Principles (Part II), Practices (Part III), and Management (Part IV). Each has a small introduction that highlights what the individual pieces are about, and references other articles published by Google SREs, covering specific topics in more detail. Additionally, there’s a companion website mentioned in the book that has a number of helpful resources.
We hope this will be at least as useful and interesting to you as putting it together was for us.
— The Editors.
About the Author
Niall Murphy leads the Ads Site Reliability Engineering team at Google Ireland. He has been involved in the Internet industry for about 20 years, and is currently chairperson of INEX, Ireland’s peering hub. He is the author or coauthor of a number of technical papers and/or books, including "IPv6 Network Administration" for O’Reilly, and a number of RFCs. He is currently cowriting a history of the Internet in Ireland, and is the holder of degrees in Computer Science, Mathematics, and Poetry Studies, which is surely some kind of mistake. He lives in Dublin with his wife and two sons.
Betsy Beyer is a Technical Writer for Google Site Reliability Engineering in NYC. She has previously written documentation for Google Datacenters and Hardware Operations teams. Before moving to New York, Betsy was a lecturer on technical writing at Stanford University.
Chris Jones is a Site Reliability Engineer for Google App Engine, a cloud platform-as-a-service product serving over 28 billion requests per day. Based in San Francisco, he has previously been responsible for the care and feeding of Google’s advertising statistics, data warehousing, and customer support systems. In other lives, Chris has worked in academic IT, analyzed data for political campaigns, and engaged in some light BSD kernel hacking, picking up degrees in Computer Engineering, Economics, and Technology Policy along the way. He’s also a licensed professional engineer.
Jennifer Petoff is a Program Manager for Google’s Site Reliability Engineering team and based in Dublin, Ireland. She has managed large global projects across wide-ranging domains including scientific research, engineering, human resources, and advertising operations. Jennifer joined Google after spending eight years in the chemical industry. She holds a PhD in Chemistry from Stanford University and a BS in Chemistry and a BA in Psychology from the University of Rochester.
There was a problem filtering reviews right now. Please try again later.
I bought the Kindle version anyways because I spend enough time in front of a backlit screen that it seemed worth it to read something this large using a device that's better on your eyes. Unfortunately the Kindle version is formatted terribly and I wish I'd bought the print version instead. The book is broken up into Parts which are broken up into Chapters which are further broken up into headlined sections. The Kindle version identifies those headlined sections as chapters which is somewhat useless.
Anyways, the first few chapters aren't especially useful unless you work at Google. They mostly discuss what's unique about Google's computing infrastructure. Despite this, they were EASILY my favorite part of the book because the material is so interesting and their approach is so unique. After that, each chapter is written in a way that it can stand on its own if you aren't reading the entire book, or are reading it out of order. This is convenient for people who want to pick and choose what parts they want to read, but means that people who are reading the entire thing wind up getting a lot of the same information multiple times. It's all written by different people too, which on the one hand makes it not quite as repetitive, but on the other hand makes it hard to just skim over the sections with info you already have because you don't recognize it as information you already know until you've processed it.
Overall this is a fantastic book on DevOps, SRE, and current trends in the industry, It's a great read for anyone who wants to apply some "best practices" to their role. I would however say that reading the entire thing is overkill for most people and not necessarily the best use of your time if you have other things you'd like to be learning as well.
Part 1 - Fascinating read. I imagine this would be a good overview if you're about to start at Google and want a sneak peek at how things are done, but I'm only speculating this as an outsider.
Part 2 - Interesting and useful concepts for modern cloud computing.
Part 3 - Some useful info and a lot of stuff that's not really unique to Google in my experience. Read the parts that you think you could use some improvement on, skip the rest.
Part 4 - A condensed view from a managerial perspective of things you already read in Part 3.
Part 5 - Some case studies, comparisons from other businesses, a useless recap, and examples that could be useful to share using the website version of the book if you're trying to explain to your team what new concepts are being implemented.
The other thing that annoyed me was that EVERY tool they talked about was written in-house and has almost no relevance in the real world. Someone needs to write a (shorter) 'real world' SRE book.
This book has a lot of great information, which I found invaluable over the years. One of the harder thing for growing organizations is to keep teams focused, and I've seen that DevOps and SRE practices help to zero in on what is essential.
A lot of Automation related work feels like 'yak shaving,' which is a term to refer to entirely unrelated things that don't add value to our product. For development teams, this feels very frustrating. Why would I want to make a script to automate this? We only use it once a year!
SRE helps to solve these frustrations, to some extent, with practices that help organizations understand why should they communicate, why should they talk about issues, and why we measure some things on some level and not others.
There are also large sections that go into specific internal Google software tools, which are essentially not valuable to anyone who doesn't plan to work at Google.
Otherwise, it's a good book. :)
There is a large portion of the book that is Googlecentric, but is required to understand their path to this construct.
I felt a large injustice was done by not addressing the hit or miss mentality of custom engineering. What to buy vs build. At the scale of Google build was almost always better than buy, however, that is rarely true in the real world (or at least is rarely perceived as true).
Top international reviews
This is the direction that infrastructure teams should be heading in terms of skill levels too.
This book is a solid description of Site Reliability Engineering at Google. It is full of good ideas. However, most would be difficult to implement to many organisations without revolutionary change in the culture.
Need for Revolutionary Cultural Change
The revolutionary cultural changes needed are that operational work is something that we do as our first job. Operational work is not something that si done on the side.
The change that organisations need to make is to recognise that operational work is a vital component of a product. A product is more than features shovelled out the door—it is about the experience of using that product. This is where operational work is critical: we find ways to make the product stable and reliable.
Good Ideas From SRE
The good ideas I got from this book are:
Continual incident management training
Continual improvement in alerting
Incident Management Training
In all service organisations I have worked at, incident management training has been limited to a few professionals in the Service Delivery/Operations. All operational personnel should have regular incident management training to keep their skills current.
The practice of having a few people trained means that there is confusion about roles and expectations in a real incident. And there is usually just person trying to juggle being an incident commander, customer liaison, incident recorder, etc. In the end, they become less effective in these critical role.
Google ensures that all SRE personnel are able to do those roles, and holds regular drills to practice them. These drills are based upon post-mortems of production issues.
Google’s policy is that pages should only be sent if a human has to done something. Google aims for a maximum of two (2) pages per 12 hour shift. All other alerts should either have an automated response or just logged for future reference.
In many organisations, alert management is seen as unwelcome toil. I have been to sites where there are thousands of critical database alerts that no one was investigating. (One site had over 6,000, and another had about 2,000.) In both cases, management was wondering why the systems were so unstable.
Alerts need to tell the SRE about a potential problem before a customer notices. Too often, operational personnel are only reacting to customer complaints.
To help people look at alerts, the alerts should be tuned for relevance (not all threshold violations will impact service delivery), and frequency (alert storms should be curtailed or throttled).
Automation is key to a successful SRE team. The more work can be done by computers, the better. The book does have a salutary lesson about an automated task wiping all data in a data centre. And with automation, there comes the issue of deskilling of SRE personnel.
SRE automation should be treated as production changes. The same care and attention that is taken for customer facing applications should be applied to critical automation scripts. This is where software development experience and knowledge becomes vital for SRE personnel.
Deskilling can be counteracted through live drills for incident management training. However, this means systems should be set aside for such a purpose.
En este libro se ve que la cultura de Google es "blameless" y que no hay una línea entre devs y ops, existe el concepto de SRE que podría decirse que es parecido al actual de devops, aunque con más funciones.
Libro que debería leer toda persona que trabaja en IT y también a toda la
This book is very interesting because shows different tips & tricks to resolve and manage communication problems between departments and of course reliability problems.
I suggest it to every IT professional, ITIL experts, DevOps wannabe and of course CTO.