Software Engineer at @isogeo and @madameaime. Active on multiple other projects. Former @epitech student. Also film director.
8 stories
·
1 follower

Highlighting

14 Comments and 28 Shares
And if clicking on any word pops up a site-search for articles about that word, I will close all windows in a panic and never come back.
Read the whole story
Share this story
Delete
12 public comments
jimwise
4054 days ago
reply
(Thank you)
kazriko
4055 days ago
reply
I tend to do this as well...
Colorado Plateau
Michdevilish
4055 days ago
reply
the highlight of my day
Canada
mrak
4054 days ago
I see what you did there
ashtonbt1
4055 days ago
reply
Holy carp, I thought I was the only one!
elzapp
4055 days ago
reply
The best is still when you highlight text, and the application automatically jumps into edit-mode, loosing the highlight. Yes, JIRA, I'm looking at you!
smadin
4055 days ago
reply
I know several people with this habit. I've always assumed they do it specifically to drive me bonkers. HOW can you read like that?!?
Boston
RedSonja
4055 days ago
YES. My husband does that, and I refuse to try to read anything over his shoulder because of it.
denismm
4055 days ago
It gives you a visual mark of where you were last, so your eye doesn't get lost as you scroll.
smadin
4055 days ago
I guess I just do that by...knowing where I am in the text? I wonder if there's a correlation between this behavior and moving a finger down the page of a physical book as you read.
denismm
4055 days ago
For me it depends on the site design. If it's small text in big blocks I'm more likely to do it, etc.
tdarby
4055 days ago
Well, I do it because it helps me read. That it drives other people bonkers is simply an added bonus.
RedSonja
4055 days ago
For my husband, it seems to be more of a fidgety thing. I've seen him select different bits of text 3,4,5 times without doing any scrolling. It seems like the mouse user's equivalent of running their thumbnail over the pages of the book you're reading.
Lilyheart
4055 days ago
I do it to keep my place depending on how the block of text is set up. I am a fidgety person, but I've never used my finger to keep my place on phsyical media.
rtreborb
4055 days ago
reply
Ugh, I hate how Chrome handles this. It does the last one, not #2
San Antonio, TX
jepler
4055 days ago
reply
(this habit is the real reason gnome wants to remove middle-click quick paste)
Earth, Sol system, Western spiral arm
JayM
4055 days ago
reply
Guilty. Though others reading over my shoulder HATE it!
Atlanta, GA
maxdibe
4055 days ago
reply
I know that feel, bro.
on a bike
shrodes
4055 days ago
reply
Title text:

"And if clicking on any word pops up a site-search for articles about that word, I will close all windows in a panic and never come back."
Melbourne, Australia

Google's Datacenters on Punch Cards

9 Comments and 25 Shares

Google's Datacenters on Punch Cards

If all digital data were stored on punch cards, how big would Google's data warehouse be?

James Zetlen

Google almost certainly has more data storage capacity than any other organization on Earth.

Google is very secretive about its operations, so it's hard to say for sure. There are only a handful of organizations who might plausibly potentially have more storage capacity or a larger server infrastructure. Here's my short list of the top contenders:

Honorable mentions:

  • Amazon (They're huge, but probably not as big as Google.)
  • Facebook (They're on the right scale and growing fast, but still playing catch-up.)
  • Microsoft (They have a million servers,[1]Data Center Knowledge: [Ballmer: Microsoft has 1 Million Servers although no one seems sure why.)

Let's take a closer look at Google's computing platform. try to figure out how much computing power Google has.

Follow the money

We'll start by following the money. Google's aggregate capital expenditures–spending on building stuff stuff—adds up to somewhere over $12 billion dollars. [2]I'm excluding the cost of an extremely expensive building they bought in New York. —adds up to somewhere over $12 billion dollars.[3]Data Center Knowledge: Google’s Data Center Building Boom Continues: $1.6 Billion Investment in 3 Months Their biggest data centers cost half a billion to a billion billion, dollars, so they can't have more than 20 or so of those.

On their website,[4]Data center locations Google acknowledges that they have datacenters in the following locations:

  1. Berkeley County, South Carolina
  2. Council Bluffs, Iowa
  3. Atlanta, Georgia
  4. Mayes County, Oklahoma
  5. Lenoir, North Carolina
  6. The Dalles, Oregon
  7. Hong Kong
  8. Singapore
  9. Taiwan
  10. Hamina, Finland
  11. St Ghislain, Belgium
  12. Dublin, Ireland
  13. Quilicura, Chile

In addition, they appear to operate a number of other large datacenters (sometimes through subsidiary corporations), including:

  1. Eemshaven, Netherlands
  2. Groningen, Netherlands
  3. Budapest, Hungary
  4. Wrocław, Poland
  5. Reston, Virginia
  6. Additional sites near Atlanta, Georgia

They also operate equipment at dozens to hundreds of smaller locations around the world.

Follow the power

To figure out how many servers Google is running, we can look at their electricity consumption. power bill. Unfortunately, we can't just sneak up to a datacenter and read the meter.[5]Actually, wait, can we? Somebody should try that. this. Instead, we have to do some digging.

The company disclosed that in 2010 they consumed an average of 258 megawatts of power.[6]Google used 2,259,998 MWh of electricity in 2010, which translates to an average of 258 megawatts. How many computers can they run with that?

We know that their datacenters are quite efficient, only spending 10-20% of their power on cooling and other overhead.[7]Google: Efficiency: How we do it To get an idea of how much power each server uses, we can look at their "container data center" concept from 2005. It's not clear whether they actually use these containers in practice—it may just have been a now-outdated experiment—but it gives an idea of what they consider(ed) reasonable power consumption. The answer: 215 watts per server.

Judging from that number, in 2010, they were operating around a million servers.

They've grown a lot since then. By the end of 2013, the total amount of money they've pumped into their datacenters will be three or four times what it was as of 2010. They've contracted to buy over three hundred megawatts of power at just three sites,[8]Google: Purchasing clean energy which is more than they used for all their operations in 2010.

Based on datacenter power usage and spending estimates, my guess would be that Google is currently running—or will soon be running—between 1.8 and 2.4 million servers.

But what do these "servers" actually represent? Google could be experimenting in all kinds of wild ways, running boards with 100 cores or 100 attached disks. If we assume that each server has a couple[9]Anywhere from 2 to 5 of 2 TB disks attached, we come up with close to 10 exabytes [10]As a refresher, the order is: kilo, mega, giga, tera, peta, exa, zetta, yotta. An exabyte is a million terabytes. of active storage attached to running clusters.

10 Exabytes

The commercial hard disk industry ships about 8 exabytes worth of drives annually. [12] [10] IDC: Worldwide External Disk Storage Systems Factory Revenue Declines for the Second Consecutive Quarter Those numbers don't necessarily necessarilly include companies like Google, but in any case, it seems likely that Google is a large piece of the global hard drive market.

To make things worse, given the huge number of drives they manage, Google has a hard drive die every few minutes.[11]Eduardo Pinheiro, Wolf-Dietrich Weber and Luiz Andre Barroso, [Failure Trends in a Large Disk Drive Population This isn't actually all that expensive a problem, in the grand scheme of things—they just get good at replacing drives—but it's weird to think that when a Googler runs a piece of code, they know that by the time it finishes executing, one of the machines it was running on will probably have suffered a drive failure.

Google tape storage

Of course, that only covers storage attached to running servers. What about "cold" storage? Who knows how much data Google—or anyone else—has stored in basement archives?

In a 2011 phone interview with Paul Mah of SMB Tech, Simon Anderson of Tandberg Data let slip [13] [11] SMB Tech: Is Tape Still Relevant for SMBs? that Google is the world's biggest single consumer of magnetic tape cartridges, purchasing 200,000 per year. Assuming they've stepped up their purchasing since then as they've expanded, this could add up to another few exabytes of tape archives.

All this could

Putting it all together

Let's assume Google has a storage capacity of 15 exabytes, [12]As a refresher, the order is: kilo, mega, giga, tera, peta, exa, zetta, yotta. An exabyte is a million terabytes. or 15,000,000,000,000,000,000 bytes.

A punch card can hold about 80 characters, and a box of cards holds 2000 cards:

15 exabytes of punch cards would be enough to cover my home region, New England, to a depth of about 4.5 kilometers. That's three times deeper than the ice sheets that covered the region during the last advance of the glaciers:

That seems like a lot.

However, it's nothing compared to the ridiculous claims by some news reports about the NSA datacenter in Utah.

NSA datacenter

The NSA is building a datacenter in Utah. Media reports claimed that it could hold up to a yottabyte of data, [14] [13] CNET: NSA to store yottabytes in Utah data centre which is patently absurd.

Later reports changed their minds, suggesting that the facility could only hold on the order of 3-12 exabytes. [15] [14] Forbes: Blueprints Of NSA's Ridiculously Expensive Data Center In Utah Suggest It Holds Less Info Than Thought We also know the facility uses about 65 megawatts of power, [16] [15] Salt-Lake City Tribune: NSA Bluffdale Center won’t gobble up Utah’s power supply which is about what a large Google datacenter consumes.

A few headlines, rather than going with one estimate or the other, announced that the facility could hold "between an exabyte and a yottabyte" of data [17] [16] Dailykos: Utah Data Center stores data between 1 exabyte and 1 yottabyte ... which is a little like saying "eyewitnesses report that the snake was "the escaped snake is believed to be between 1 millimeter and 1 kilometer long."

Uncovering further Google secrets

There are a lot of tricks for digging up information about Google's operations. Ironically, many of them involve using Google itself—from Googling for job postings in strange cities to using image search to find leaked cell camera photos of datacenter visits.

However, the best trick for locating secret Google facilities might be the one revealed by ex-Googler talentlessclown on reddit: [18] [17] reddit: Can r/Australia help find Google's Sydney data center? Seems like a bit of a mystery...

The easiest way to find manned Google data centres is to ask taxi drivers and pizza delivery people.

There's something pleasing poetic about that. Google has created what might be the most sophisticated information-gathering apparatus in the history of the Earth ... and the only people with the Earth ... yet the people with the most information about them are the pizza delivery drivers.

Who watches the watchers?

Apparently, Domino's.

Apparently, Domino's.

Read the whole story
Share this story
Delete
7 public comments
chrisamico
4063 days ago
reply
There's a surprising amount of investigative work in this one.
Boston, MA
strugk
4068 days ago
reply
ogrom googla
Cambridge, London, Warsaw, Gdynia
schmod
4068 days ago
reply
Real life imitates Snow Crash. Again.
glenniebun
4068 days ago
reply
So, essentially, you find Google the same way you find Torchwood?
CT USA
rclatterbuck
4068 days ago
reply
.
ksteimle
4061 days ago
Hee. "Illustration courtesy xkcd.com, used with permission."
chusk3
4068 days ago
reply
Hah! I know the guy that asked this one!
Austin, Texas
mcmc
4068 days ago
reply
http://what-if.xkcd.com/63/ Utah Data Center stores data between 1 exabyte and 1 yottabyte ... which is a little like saying "eyewitnesses report that the snake was between 1 millimeter and 1 kilometer long."
North of Boston

Fiona 1.0

1 Share

At last, 1.0: https://pypi.python.org/pypi/Fiona/1.0.

Fiona is OGR's neat, nimble, no tears API. It is a Python library, not a GIS library, and is designed for Python programmers who appreciate:

  • Simplicity and less code.
  • Familiar Python types and protocols like files, dicts, and iterators instead of classes specific to GIS.
  • GeoJSON style feature records.
  • Reading and writing single and multi-layer files.
  • Reading zipped data, too.
  • A handy command line tool that upgrades "ogr2ogr -f GeoJSON".
  • Comprehensive tests.
  • 15 pages of narrative documentation.

I've had lots of help getting to this stage. Thanks, everybody!

The name? At first it was a Shrek reference, but now it's just a probably too cute but hopefully not too annoying recursive bacronym.

Share and enjoy this Fiona Deluxe Professional Home Enterprise Edition 1.0.

Read the whole story
Share this story
Delete

Caching: Your Worst Best Friend

1 Share

A cache is a mixed blessing. On the one hand, it helps make things faster. On the other, it can become a crutch, or worse, an excuse. Furthermore, caches can deceive us in two ways. First, they can easily hide a system's true performance characteristics, masking poor design until things start to fail. Secondly, while a cache might exposes a simple interface, getting the most of it requires that you pay close attention to details and have a thorough understand of how your system works.

Defining a Cache

To me, what defines something as a cache is the possibility of a miss. If you write code assuming that data is in the cache, then it isn't a cache, it's a database. The most tragic abuse of this I've seen are applications that won't even start unless they can connect to the cache.

We can rephrase this by defining what a cache isn't. A cache is not the reason your system can handle its current load. If you unplug your cache and system resources become exhausted, it isn't a cache. A cache isn't how you make your site fast, or handle concurrency. All it does is help.

Of course, that's a generalization. In fact, the goal of this post is to help you take a small step away from this generalization; to help you use caching as a integral part of a good design instead of just a bandaid.

Stats Or It Didn't Happen

To get the most out of a cache, you need to collect statistics. If you can't answer the question "What's your cache hit ratio?", you aren't caching, you're guessing. But a cache hit ratio is just the start of what you need to know.

For the rest of this post, I'm going to assume that you're using a cache to store a variety of objects with different characteristics. I'm talking about things like object size, cache duration, cost of a miss and so on.

When we look at our cache, there are 4 high-level statistics that we want to track, per object type:

  • cache hit ratio,
  • number of gets,
  • cost of a miss, and
  • size of the data

All of these help paint a clear picture of the health and effectiveness of your cache.

Since our cache sits as a proxy on top of a RESTful API, we can use the URL to figure out the object type. For example /users/50u.json and /videos/99v/recommendations map to the show user and list video recommnedation object types. Normally you should be able to use the object's cache key to figure out its type.

We collect our statistics via a custom nginx log format. Something like:

  # defined in the http section
  # our caching proxy sets the X-Cache header to hit or miss
  log_format cache '$request_method $uri $sent_http_x_cache $bytes_sent $upstream_response_time';
  ....

  # defined in the specific location that proxies to our cache
  access_log /var/log/apiproxy/cache.log cache buffer=128K;

This gives us a file that looks like:

GET /v4/videos/75028v.json hit 2384 0.000
GET /v4/videos/176660v.json miss 2287 0.002
GET /v4/episodes.json hit 372 0.001
GET /v4/videos/222741v/timed_comments/en.json hit 36747 0.001
GET /v4/roles.json miss 511 0.012
GET /v4/containers/20186c.json hit 1561 0.000
GET /v4/containers/20186c/episodes.json miss 426 0.002
GET /v4/containers/20186c/people.json miss 425 0.09
GET /v4/containers/20186c/recommendations.json miss 5376 0.002
GET /v4/containers/6784c/covers/en.json hit 9653 0.001
GET /v4/containers/20186c/contributions.json miss 441 0.016
GET /v4/users/51351u/subscriptions.json hit 360 0.001

We don't need a huge amount of data. Just enough to get a good representation. It doesn't take much to parse this. Here's a snapshot of the first time we analyzed our cache:

route                       |  total hits  |  ratio  |  size  |  miss time  |
show videos                 |    758112    |  42.4   |  2114  |      4      |
list videos/recommendations |    406792    |  38     |  4529  |      6      |
show containers             |    393084    |  53.1   |  4577  |      4      |
show sessions               |    390095    |  72.9   |   988  |     11      |
show videos/subtitles       |    288032    |  89.6   | 19233  |     20      |
list containers/people      |    266975    |  91.1   |  1232  |     17      |
list videos/streams         |    228886    |  18.9   |   716  |      5      |
show users                  |    221952    |  0      |   997  |     11      |

Pretty horrible, eh? Our global cache hit ratio, hovering around a not-great 50%, wasn't very accurate. As an average, the routes with a good hit ratio masked just how bad things really were (not the mention the fact that something was obviously broken with the user's endpoint).

From the above, there's 1 core statistic that we aren't measuring: reason for evictions. Specifically, it would be helpful to know if items are being evicted because they are stale, or because the system is under pressure. Our solution for now, which is far from ideal due to the lack of granularity, is to measure the number of times we need to prune our cache due to memory pressure. Since we aren't currently under memory pressure (more on why later), we aren't seeing many prunes, and thus haven't looked very deeply into it.

If your cache isn't a proxy sitting near the top of your stack, you'll likely have to jump through more hoops to extract this data. You could also opt to log some or all of it in real-time using Statsd (or something similar). Finally, although it's always better to measure than assume, there's a good chance that you'll already have an idea about the size and load time of these objects relative to each other (still, measure and expect and be delighted at surprises!).

Master Your Keys

Few of our objects use a simple cache key, such as video:4349v. At best, some types have a handful of permutations. At worse, thousands. Many of our objects vary based on the user's country, platform (web, mobile, tv, ...), user's role and query string parameters (to name a few). This is the reason we had to move beyond Varnish: we needed to generate cache keys that were application-aware. Many object types, for example, treat our iOS and Android applications as the same platform, but a few don't.

This ends up being application specific. The important thing to know is that there'll come a point where a single cache key format just won't scale (because you'll have to pick the one with the most permutations for all routes). It might seem like you're violating some type of design rule, but coupling your keys to your data and system behavior becomes critical if you want to keep things under control.

I will say that we did spend some time normalizing our cache keys. For example, a page=1 is the same as no page parameter, and a per_page=10 is the same no per_page parameter. It's trivial to do and it can cover some rather common cases (these two are particularly good example of common cases).

Speaking of paging, early on, we had the capability to serve small per_page request from a larger one. For example, if we had a per_page=25 cached, we'd be able to server any requests where per_page was <= 25. We'd actually over-fetch and overlap pages to minimize the impact of various paging inputs. Ultimately this proved incompatible with other design goals, but it's a decent example of how your cache keys don't have to be dumb.

Key permutations is one of those things that can easily cause your caching to fall apart. It's pretty easy to miss the impact on your cache that new features will have. A seemingly isolated feature might force you to add new permutations across the system (we're currently facing this), which can easily cause a cascade. It isn't just that more permutations lead to more misses, but also that more variation -> more memory usage -> more evictions -> more misses.

Dump and examine your cache keys. Looks for patterns and duplications. Most importantly, make sure you're using the most granular key possible.

Own Your Values

We've taken two steps to minimize the size of our cache, thus minimizing the number of misses due to premature eviction. The first and simplest is to store compressed values. As a proxy cache, this makes a lot of sense since decompression is distributed to clients. For an internal application cache, you'll likely get much less value from it. We use Nginx's relatively new gunzip module to take care of decompressing the content should the client not support compression.

The other feature that we've recently added is deduplication. The number of possible values is far smaller than the total number of possible keys. For example, even though we need to query them separately, a Canadian user is likely to get the same video details as an American user. This was one of the many projects our amazing intern accomplished. The first thing he did was analyze the cache for duplicates. At 45% duplication after only a day's worth of uptime, we felt the feature was worth adding.

Freshness

Despite everything we've talked about so far, ultimately the best way to improve the effectiveness of your cache is to increase the duration items can stay in the cache. The downside to this is that users will get stale data.

I'm aware of two solutions to this problem. The first is to use object versions to bust the cache. 37signals has a good explanation of this approach.

The approach that we're using is to purge the cache whenever there's an update. This worked well for us because we were already using a queue to communicate changes between system (as a side note, I strongly recommend that you do this, you'll end up using it more than you can imagine). So it was really just a matter of creating a new listener and sending an HTTP purge to each data center. Most items are purged within 2 seconds of an update.

This approach did require that we design our cache so that we could purge an object, say video 10 and purge all variations (json, xml, country us, country ca, ....). If you're using Varnish, it already supports purging all variations.

But Wait, There's More

Varnish has two neat features that we adopted into our own cache which can be quite useful: grace mode and saint mode. Grace mode is about serving up a slightly stale value while a fresh value is fetched in the background. This solves two problems. First, everyone gets a fast cache response. Secondly, when a lot of people ask for the same object at the same time (called a thundering herd), only a single fetch is made.

Saint mode is about serving up stale values when a fresh value isn't available.

Here's what the flow looks like:

cached, is_cached := cache.Get(key)
if is_cached {
  if c.Fresh() {
    return cached
  }

  // it's ok to serve something that's aged, but not stale
  if cached.Aged() {
    go grace(req)
    return cached
  }
}
//not cached
response := getResponse(req)
//we got an error, might as well serve the cached object
//no matter its age
if response.Status() >= 500 && is_cached {
   return cached
}

Conclusion

Caching isn't a trivial problem to solve. For us, the solution was to collect data and increase our coupling between the cache and the rest of the systems. We've introduced some cases where changes to the business logic will need to be reflected in the cache. We also spent time removing a lot of inefficiency that come from trying to use it as a generic get/set mechanism. Deduplication was a huge win, which for the time being has resolved our memory constraint. Cleaning up our keys is an ongoing effort as new features are introduced which add new permutations.

Today, our cache hit ratio is looking much better. Within the next couple days we hope to bring the top 10 resources up above 80% (we're almost there). The most important thing though is that all the systems which sit behind our caching layer can survive without it. Some considerably better than others, true. But we aren't driving our cache hit ratio up because our system won't work otherwise. We're driving it up because it results in a better user experience.

Read the whole story
Share this story
Delete

Calling All Music Lovers: Send Us Your "Never Stop Jamming" Video

1 Comment

Calling All Music Lovers - Whyd from Tony Hymes on Vimeo.

We are well on our way to becoming the definitive filter for the best of the newest music coming online every single day. Whyd has been growing rapidly, adding music lovers from nearly every country in the entire world. You give us incredible feedback, we listen, and we have always designed the product to do exactly what you want. Thank you for helping us to get where we are. 

We’ve been receiving some very nice feedback too, with music lovers telling us “I can’t even remember what I was doing before I found Whyd;" “this is exactly what I needed;" “It’s my home page;" and “it’s official, I can’t live without Whyd." Nothing makes working on this project more enjoyable than hearing things like this. 

One thing we’ve learned, the more people that join us on Whyd, the better the experience is for everyone. So we need your help today to spread the call. Please share this video to your social networks and directly to your friends who are music lovers. 

We need music lovers to share the music they find, to interact with us, to propel the community forward, and they need Whyd to get the most out of the incredible amount of new music coming online. It’s a win-win. 

By now your surely know our tag line: “Never Stop Jamming." We want to get this buzzing online, so we are asking our music lovers to film themselves saying their best “Never Stop Jamming" and posting it to Twitter, Instagram Video, Vine, Tumblr, and Facebook with the hashtag #neverstopjamming. The maker of the best “Never Stop Jamming" will be featured in Whyd’s brand new introduction video slated to come out later this summer! 

Read the whole story
Share this story
Delete
1 public comment
tusbar
4123 days ago
reply
Never stop jamming!
Paris, France

Screensaver

1 Comment and 10 Shares
I'm entering my 24th year of spending eight hours a day firing the Duck Hunt gun at the flying toasters. I'm sure I'll hit one soon.
Read the whole story
Share this story
Delete
1 public comment
slivergun
4153 days ago
reply
Damn!
Next Page of Stories