dl.google.com: Powered by Go
10:00 26 Jul 2013
Tags: download, oscon, port, c++, google, groupcache, caching

Brad Fitzpatrick
Gopher, Google
@bradfitz
bradfitz@golang.org
http://bradfitz.com/
https://go.dev/
https://github.com/golang/groupcache/

* Overview / tl;dw:

- dl.google.com serves Google downloads
- Was written in C++
- Now in Go
- Now much better
- Extensive, idiomatic use of Go's standard library
- ... which is all open source
- composition of interfaces is fun
- _groupcache_, now Open Source, handles group-aware caching and cache-filling

* too long...

* me

- Brad Fitzpatrick
- bradfitz.com
- @bradfitz
- past: LiveJournal, memcached, OpenID, Perl stuff...
- nowadays: Go, Go, Camlistore, Go, anything & everything written in Go ...

* I love Go

- this isn't a talk about Go, sorry.
- but check it out.
- simple, powerful, fast, liberating, refreshing
- great mix of low- and high- level
- light on the page
- static binaries, easy to deploy
- not perfect, but my favorite language yet

* dl.google.com

* dl.google.com

- HTTP download server
- serves Chrome, Android SDK, Earth, much more
- Some huge, some tiny (e.g. WebGL white/blacklist JSON)
- behind an edge cache; still high traffic
- lots of datacenters, lots of bandwidth

* Why port?

* reason 0

$ apt-get update

.image oscon-dl/slow.png

- embarrassing
- Google can't serve a 1,238 byte file?
- Hanging?
- 207 B/s?!

* Yeah, embarrassing, for years...

.image oscon-dl/crbug.png

* ... which led to:

- complaining on corp G+. Me: "We suck. This sucks."
- primary SRE owning it: "Yup, it sucks. And is unmaintained."
- "I'll rewrite it for you!"
- "Hah."
- "No, serious. That's kinda our job. But I get to do it in Go."
- (Go team's loan-out-a-Gopher program...)

* How hard can this be?

* dl.google.com: few tricks

each "payload" (~URL) described by a protobuf:

- paths/patterns for its URL(s)
- go-live reveal date
- ACLs (geo, network, user, user type, ...)
- dynamic zip files
- custom HTTP headers
- custom caching

* dl.google.com: how it was

.image oscon-dl/before.png

* Aside: Why good code goes bad

* Why good code goes bad

- Premise: people don't suck
- Premise: code was once beautiful
- code tends towards complexity (gets worse)
- environment changes
- scale changes

* code complexity

- without regular love, code grows warts over time
- localized fixes and additions are easy & quick, but globally crappy
- features, hacks and workarounds added without docs or tests
- maintainers come & go,
- ... or just go.

* changing environment

- Google's infrastructure (hardware & software), like anybody's, is always changing
- properties of networks, storage
- design assumptions no longer make sense
- scale changes (design for 10x growth, rethink at 100x)
- new internal services (beta or non-existent then, dependable now)
- once-modern home-grown invented wheels might now look archaic 

* so why did it suck?

.image oscon-dl/slow.png

- stalling its single-threaded event loop, blocking when it shouldn't
- maxed out at one CPU, but couldn't even use a fraction of a single CPU.

* but why?

- code was too complicated
- future maintainers slowly violated unwritten rules
- or knowingly violated them, assuming it couldn't be too bad?
- C++ single-threaded event-based callback spaghetti
- hard to know when/where code was running, or what "blocking" meant

* Old code

- served from local disk
- single-threaded event loop
- used sendfile(2) "for performance"
- tried to be clever and steal the fd from the "SelectServer" sometimes to manually call sendfile
- while also trying to do HTTP chunking,
- ... and HTTP range requests,
- ... and dynamic zip files,
- lots of duplicated copy/paste code paths
- many wrong/incomplete in different ways

* Mitigation solution?

- more complexity!
- ad hoc addition of more threads
- ... not really defined which threads did what,
- ... or what the ownership or locking rules were,
- no surprise: random crashes

* Summary of 5-year old code in 2012

- incomplete docs, tests
- stalling event loop
- ad-hoc threads...
- ... stalling event loops
- ... races
- ... crashes
- copy/paste code
- ... incomplete code
- two processes in the container
- ... different languages

* Environment changes

- Remember: on start, we had to copy all payloads to local disk
- in 2007, using local disk wasn't restricted
- in 2007, sum(payload size) was much smaller
- in 2012, containers get tiny % of local disk spindle time
- ... why aren't you using the cluster file systems like everybody else?
- ... cluster file systems own disk time on your machine, not you.
- in 2007, it started up quickly.
- in 2012, it started in 12-24 hours (!!!)
- ... hope we don't crash! (oh, whoops)

* Copying N bytes from A to B in event loop environments (node.js, this C++, etc)

- Can *A* read?
- Read up to _n_ bytes from A.
- What'd we get? _rn_
- _n_ -= _rn_
- Store those.
- Note we want to want to write to *B* now.
- Can *B* write?
- Try to write _rn_ bytes to *B*. Got _wn_.
- buffered -= _wn_
- while (blah blah blah) { ... blah blah blah ... }

* Thought that sucked? Try to mix in other state / logic, and then write it in C++.

*  

.image oscon-dl/cpp-write.png

*  

.image oscon-dl/cpp-writeerr.png

*   

.image oscon-dl/cpp-toggle.png

* Or in JavaScript...

- [[https://github.com/nodejitsu/node-http-proxy/blob/master/lib/node-http-proxy/http-proxy.js]]
- Or Python gevent, Twisted, ...
- Or Perl AnyEvent, etc.
- Unreadable, discontiguous code.

* Copying N bytes from A to B in Go:

.code oscon-dl/copy.go /START OMIT/,/END OMIT/

- dst is an _io.Writer_ (an interface type)
- src is an _io.Reader_ (an interface type)
- synchronous (blocks)
- Go runtime deals with making blocking efficient
- goroutines, epoll, user-space scheduler, ...
- easier to reason about
- fewer, easier, compatible APIs
- concurrency is a _language_ (not _library_) feature

* Where to start?

- baby steps, not changing everything at once
- only port the `payload_server`, not the `payload_fetcher`
- read lots of old design docs
- read lots of C++ code
- port all command-line flags
- serve from local disk
- try to run integration tests
- while (fail) { debug, port, swear, ...}

* Notable stages

- pass integration tests
- run in a lightly-loaded datacenter
- audit mode
- ... mirror traffic to old & new servers; compare responses.
- drop all SWIG dependencies on C++ libraries
- ... use IP-to-geo lookup service, not static file + library

* Notable stages

- fetch blobs directly from blobstore, falling back to local disk on any errors,
- relying entirely on blobstore, but `payload_fetcher` still running
- disable `payload_fetcher` entirely; fast start-up time.

* Using Go's Standard Library

* Using Go's Standard Library

- dl.google.com mostly just uses the standard library

* Go's Standard Library

- net/http
- io
- [[/pkg/net/http/#ServeContent][http.ServeContent]]

* Hello World

.play oscon-dl/server-hello.go

* File Server

.play oscon-dl/server-fs.go

* http.ServeContent

.image oscon-dl/servecontent.png

* io.Reader, io.Seeker

.image oscon-dl/readseeker.png
.image oscon-dl/reader.png
.image oscon-dl/seeker.png

* http.ServeContent

$ curl -H "Range: bytes=5-" http://localhost:8080

.play oscon-dl/server-content.go

* groupcache

* groupcache

- memcached alternative / replacement
- [[http://github.com/golang/groupcache]]
- _library_ that is both a client & server
- connects to its peers
- coordinated cache filling (no thundering herds on miss)
- replication of hot items

* Using groupcache

Declare who you are and who your peers are.

.code oscon-dl/groupcache.go /STARTINIT/,/ENDINIT/

This peer interface is pluggable. (e.g. inside Google it's automatic.)

* Using groupcache

Declare a group. (group of keys, shared between group of peers)

.code oscon-dl/groupcache.go /STARTGROUP/,/ENDGROUP/

- group name "thumbnail" must be globally unique
- 64 MB max per-node memory usage
- Sink is an interface with SetString, SetBytes, SetProto

* Using groupcache

Request keys

.code oscon-dl/groupcache.go /STARTUSE/,/ENDUSE/

- might come from local memory cache
- might come from peer's memory cache
- might be computed locally
- might be computed remotely
- of all threads on all machines, only one thumbnail is made, then fanned out in-process and across-network to all waiters

* dl.google.com and groupcache

- Keys are "<blobref>-<chunk_offset>"
- Chunks are 2MB
- Chunks cached from local memory (for self-owned and hot items),
- Chunks cached remotely, or
- Chunks fetched from Google storage systems

* dl.google.com interface composition

.code oscon-dl/sizereaderat.go /START_1/,/END_1/

* io.SectionReader

.image oscon-dl/sectionreader.png

* chunk-aligned ReaderAt

.code oscon-dl/chunkaligned.go /START_DOC/,/END_DOC/

- Caller can do ReadAt calls of any size and any offset
- `r` only sees ReadAt calls on 2MB offset boundaries, of size 2MB (unless final chunk)

* Composing all this

- http.ServeContent wants a ReadSeeker
- io.SectionReader(ReaderAt + size) -> ReadSeeker
- Download server payloads are a type "content" with Size and ReadAt, implemented with calls to groupcache.
- Wrapped in a chunk-aligned ReaderAt
- ... concatenate parts of with MultiReaderAt

.play oscon-dl/server-compose.go /START/,/END/

* Things we get for free from net/http

- Last-Modified
- ETag
- Range requests (w/ its paranoia)
- HTTP/1.1 chunking, etc.
- ... old server tried to do all this itself
- ... incorrectly
- ... incompletely
- ... in a dozen different copies

* Overall simplification

- deleted C++ payload_server & Python payload_fetcher
- 39 files (14,032 lines) deleted
- one binary now (just Go `payload_server`, no `payload_fetcher`)
- starts immediately, no huge start-up delay
- server is just "business logic" now, not HTTP logic

* From this...

.image oscon-dl/before.png

* ... to this.

.image oscon-dl/after.png

* And from page and pages of this...

.image oscon-dl/cpp-writeerr.png

* ... to this

.image oscon-dl/after-code.png

* So how does it compare to C++?

- less than half the code
- more testable, tests
- same CPU usage for same bandwidth
- ... but can do much more bandwidth
- ... and more than one CPU
- less memory (!)
- no disk
- starts up instantly (not 24 hours)
- doesn't crash
- handles hot download spikes

* Could we have just rewritten it in new C++?

- Sure.
- But why?

* Could I have just fixed the bugs in the C++ version?

- Sure, if I could find them.
- Then have to own it ("You touched it last...")
- And I already maintain an HTTP server library. Don't want to maintain a bad one too.
- It's much more maintainable. (and 3+ other people now do)

* How much of dl.google.com is closed-source?

- Very little.
- ... ACL policies
- ... RPCs to Google storage services.
- Most is open source:
- ... code.google.com/p/google-api-go-client/storage/v1beta1
- ... net/http and rest of Go standard library
- ... `groupcache`, now open source ([[https://github.com/golang/groupcache][github.com/golang/groupcache]])