MapReduce usage at Google

Via High Scalability blog (a great addition to any RSS reader out there) there’s a link to Jefrrey Dean’s presentation on MapReduce usage in Google. Actually, his presentation touches upon a few aspects of Google infrastructure, such as GFS, and BigTable, so there’s more on this video. What caught my eye is the relative growth of MapReduce inside Google - 2.2 mln jobs run in September 2007.

image

In the table above, note the drastic growth of input data analyzed and output data generated. The number of actual MapReduce jobs has also grown significantly and reached 10,000 in September 2007.

image

Dean also presented an interesting graph about the frequency of commits of new MapReduce jobs into the repository - as you can see there are months when the number of new projects goes through the roof, followed by a spike.

image

The reason? Summer interns.

image

Complete set of slides is available from Yahoo! Research, which organized the Data-Intensive Computing Symposium.

24 Web site performance tips

Yahoo! Developer Network blog had an entry by Stoyan Stefanov and presentation from PHP Quebec conference. A few points to take away, in case you don’t feel like going through 76-slide presentation:

  1. A drop of 100ms in page rendering time leads to 10% in sales on Amazon. A drop of 500 ms leads to 20% less traffic to Google.
  2. Make fewer HTTP requests - combine CSS and JS files into single downloads. Minify both JS and CSS.
  3. Combine images into CSS sprites.
  4. Bring static content closer to the users. That usually means CDNs like Akamai or Limelight, but sometimes a co-location facility or data center in a foreign country is the only option.
  5. Static content should have Expires: headers way into the future, so that they’re never re-requested.
  6. Dynamic content should have Cache Control: header.
  7. Offer content gzip’ed.
  8. Stoyan claims nothing will be rendered in the browser till the last piece of CSS has been served, and therefore it’s critical to send CSS as early in the process as possible. I happen to have a document with CSS declared at the very end, and disagree with this statement - at least the content seems to render OK without CSS, and then self-corrects when CSS finally loads.
  9. Move the scripts all the way to the bottom to avoid the download block - Stoyan’s example shows placing the javascript includes right before </body> and </html>, although it’s possible to place them even further down (well, you’d break XHTML purity, I suppose, if you declare your documents to be XHTML).
  10. Avoid CSS expressions.
  11. Consider placing the minified CSS and JS files on separate servers to fight browser’s default pipelining settings - not everybody has FasterFox or tweaked pipeline settings.
  12. For super-popular pages consider inlining JS for fewer HTTP requests.
  13. Even though placing content on external servers with different domains will help you with HTTP pipelining, don’t go crazy with various domains - they all require DNS lookups.
  14. Every 301 redirect is a wasted HTTP request.
  15. For busy backend servers consider PHP’s flush().
  16. Use GET over POST any time you have a choice.
  17. Analyze your cookies - large number of them could substantially increase the number of TCP packets.
  18. For faster JavaScript and DOM parsing, reduce the number of DOM elements.
  19. document.getElementByTagName(’*').length will give you the number of total elements. Look at those abusive <div>s.
  20. Any missing JS file is a significant performance penalty - the browser will browse the 404 page you generate, trying to see if it has valid <script>s.
  21. Optimize your PNGs - check out pngcrush, pngoptimizer
  22. Optimize JPEGs - jpegtran
  23. Make sure you have favicon.ico - generating those 404s will be expensive, plus once you have it, it’s cache-able.
  24. Toolkits for measuring page loads: AOL PageTest, FiddlerTool HTTP debugging proxy, IBM Page Detailer instrumentation tool, YSlow, and Firebug are suggested in the presentation. My personal addition to the list is Charles that has been recommended by a colleague.

And here’s the whole presentation, although it’s not possible to follow links from Slideshare slides.

SlideShare

Giving an old PSP a new life

Over the past few months my PSP started to show the signs of old age. Whether it’s my addiction to World Tour Soccer series that completely worn out the analog stick, or frequent uses of the device in the train, on the beach, and on the planes that resulted in a dirty screen and what not. Cleaning out your PSP is actually pretty easy, and is roughly a 20-30 minute project. Things you’ll need (and I got them all from one place, your shopping experience may differ):

Unscrew three bolts on the back of PSP to separate the front and the back of the device. The last screw might be tough to find, and it is under the battery, which you need to take out. There’s a protective seal that warns you about warranty being void if you remove it. If you have any kind of warranty left on the device, you should probably have it replaced instead of cleaning it out yourself.

CIMG1772

Separate the front panel of PSP, and remove it. It contains many small buttons such as Start and Select, which could fall out, and get in the way. The front panel frame hardly needs any cleaning, so it’s safe to just put it away for the time being.

CIMG1773

The back of the panel, however, contains the analog stick, and 4-directional button. The analog stick can be replaced, if you have a replacement handy, or just cleaned out, if it’s just the matter of dust and a few odds and ends getting in. If you remove it, that’s the only electronic part on the front panel. The rest of it can have a date with Mr. Windex for brighter shine.

Replace the analog stick of 4-directional button, and you’re pretty much done. Check out the top buttons, R and L, that are not used in all the games, and therefore might be in different state of wear and tear. Those can be washed and cleaned out as well, nothing but white plastic there.

Time to put the PSP back together, and remember - no spare parts.

CIMG1776

For a pretty small budget, and 20-30 minutes of work you have a good-looking shiny gadget back in shape.

CIMG1777

Stress testing Web services

Pylot is a new stress testing tool for Web services testing. As creators describe it:

You begin by defining your test cases in an XML file. Test cases are where you specify the requests (url, method, body/payload, etc) and verifications. Server responses can be verified by matching content to regular expressions and with HTTP status codes. You can adjust the load settings in the workload controls on the GUI before you start a test run (number of agents, request intervals, rampup time, test duration). These settings enable you to model tests based on various load scenarios. At runtime, the cases are loaded and passed to the load generating engine. Agents are dispatched and run concurrently to send HTTP requests to your web service. Real-time stats and error reporting are displayed for monitoring the test as it executes.

It’s a Python script with command-line and graphical interfaces.

register_shutdown_function possible use cases

Eirik Hoem on his blog provides an overview of PHP’s register_shutdown_function, and suggests using it for the cases when for whatever reason your Web page ran out of memory, fatal’ed, and you don’t want to display a blank page to the users.

register_shutdown_function is also useful for command-line scripts with PHP. Pretty frequently your script has to do some task like parse a large XML file, and the test examples when it was originally written did not account for the XML file possible being huge. Therefore your script dies with like 23% completion, and you’re left with 23% of the XML file parsed. Not ideal, but a quick duct-tape-style fix, would be to introduce a register_shutdown_function call to system(), to which you pass the script itself.

If you happen to keep track of which line you’re on while parsing, you can pass the line number as the first parameter to your own script, and make it start off after that 23% mark, or wherever it died. The script then needs to be launched with 0 passed as the first parameter. It will run out of memory, die, launch register_shutdown_function, which will launch another copy of the script (while successfully shutting down the original process) with a new line number, which will repeat the process.

Again, this is a duct tape approach to PHP memory consumption issues while working with large data sets.

A perfect push-up

There was a lengthy article earlier this month in The New York Times on the importance of doing pushups, and how it’s an all-around exercise, responsible for exercising quite a few muscles in a human body.

The push-up is the ultimate barometer of fitness. It tests the whole body, engaging muscle groups in the arms, chest, abdomen, hips and legs. It requires the body to be taut like a plank with toes and palms on the floor. The act of lifting and lowering one’s entire weight is taxing even for the very fit.

Then there’s a blog post on NYT site as well, featuring 93-year-old Jack LaLanne, who still incorporates push-ups into his daily workout. They link to push-up calculator, which says that an average 27-year-old should be able to do 37 push-ups on average, and above 50 in a single session to be considered excellent. Google Video has a few videos on what’s considered a proper push-up.

The most expensive query

What’s the most expensive query you can think of? How about this one - USPS money orders are sold throughout the United States at numerous post office locations. Each money order has a unique ID number, and while there’s no data on how many money orders are sold annually, you’d assume that finding out about the status of the money order is running a SELECT query on some large table that has that money_order_id as unique index.

How long would that query take?

Well, for one, a trip to the post office (20 minutes sound reasonable, but your mileage may vary). You have to physically request Form 6401, as there’s no option to pre-fill it online. So make it another 10 minutes at postal window.

Then it takes $5, as specified in the USPS money order rules. After that the filled 6401 travels to some place in Iowa, which would get back to you two weeks later in an official letter from some kind of USPS database query execution department.

Total cost: 2 weeks + 20 minutes + 10 minutes + $5.00

TVTrip - videos of hotels worldwide

Pretty cool idea - instead of exploring officially approved photos on the travel agent’s Web site, see what the hotel looks like in a short video. TVTrip is founded by Expedia alumni, and has videos of the hotels from around the world. They claim 3,825 videos so far, and include a variety of destinations including some motel in Palo Alto as well as Radisson SAS in Paris, France.

PHP contest from PHParchitect.com

Guys at PHParchitect are running a PHP contest for smallest, fastest, most efficient command-line PHP script. A seemingly simple link parser task is probably very tricky, but the task itself is somewhat poorly specced out, as several things are not clear:

  1. Their example lists the href enclosed in <link rel=”stylesheet” type=”text/css” href=”/css/c7y.css” id=”Main C7Y CSS”></link>. So is that a valid link? Anything in href qualifies as a link?
  2. Does a JavaScript window.open qualifies as a link?
  3. What about <a href=”http://www.yahoo.com” onclick=”window.location=http://www.google.com”>link</a> or any similar shenanigans? What qualifies as a link there?

__DIR__ in PHP 5.3

Lars Strojny says that a new magic constant __DIR__ is coming to PHP 5.3. __DIR__ will refer to the current directory of the script. It’s useful for those include and include_once directives where it’s preferable to use absolute paths to avoid navigating down the include path.