Archives for the Programming category

MapReduce usage at Google

Via High Scalability blog (a great addition to any RSS reader out there) there’s a link to Jefrrey Dean’s presentation on MapReduce usage in Google. Actually, his presentation touches upon a few aspects of Google infrastructure, such as GFS, and BigTable, so there’s more on this video. What caught my eye is the relative growth of MapReduce inside Google - 2.2 mln jobs run in September 2007.

image

In the table above, note the drastic growth of input data analyzed and output data generated. The number of actual MapReduce jobs has also grown significantly and reached 10,000 in September 2007.

image

Dean also presented an interesting graph about the frequency of commits of new MapReduce jobs into the repository - as you can see there are months when the number of new projects goes through the roof, followed by a spike.

image

The reason? Summer interns.

image

Complete set of slides is available from Yahoo! Research, which organized the Data-Intensive Computing Symposium.

24 Web site performance tips

Yahoo! Developer Network blog had an entry by Stoyan Stefanov and presentation from PHP Quebec conference. A few points to take away, in case you don’t feel like going through 76-slide presentation:

  1. A drop of 100ms in page rendering time leads to 10% in sales on Amazon. A drop of 500 ms leads to 20% less traffic to Google.
  2. Make fewer HTTP requests - combine CSS and JS files into single downloads. Minify both JS and CSS.
  3. Combine images into CSS sprites.
  4. Bring static content closer to the users. That usually means CDNs like Akamai or Limelight, but sometimes a co-location facility or data center in a foreign country is the only option.
  5. Static content should have Expires: headers way into the future, so that they’re never re-requested.
  6. Dynamic content should have Cache Control: header.
  7. Offer content gzip’ed.
  8. Stoyan claims nothing will be rendered in the browser till the last piece of CSS has been served, and therefore it’s critical to send CSS as early in the process as possible. I happen to have a document with CSS declared at the very end, and disagree with this statement - at least the content seems to render OK without CSS, and then self-corrects when CSS finally loads.
  9. Move the scripts all the way to the bottom to avoid the download block - Stoyan’s example shows placing the javascript includes right before </body> and </html>, although it’s possible to place them even further down (well, you’d break XHTML purity, I suppose, if you declare your documents to be XHTML).
  10. Avoid CSS expressions.
  11. Consider placing the minified CSS and JS files on separate servers to fight browser’s default pipelining settings - not everybody has FasterFox or tweaked pipeline settings.
  12. For super-popular pages consider inlining JS for fewer HTTP requests.
  13. Even though placing content on external servers with different domains will help you with HTTP pipelining, don’t go crazy with various domains - they all require DNS lookups.
  14. Every 301 redirect is a wasted HTTP request.
  15. For busy backend servers consider PHP’s flush().
  16. Use GET over POST any time you have a choice.
  17. Analyze your cookies - large number of them could substantially increase the number of TCP packets.
  18. For faster JavaScript and DOM parsing, reduce the number of DOM elements.
  19. document.getElementByTagName(’*').length will give you the number of total elements. Look at those abusive <div>s.
  20. Any missing JS file is a significant performance penalty - the browser will browse the 404 page you generate, trying to see if it has valid <script>s.
  21. Optimize your PNGs - check out pngcrush, pngoptimizer
  22. Optimize JPEGs - jpegtran
  23. Make sure you have favicon.ico - generating those 404s will be expensive, plus once you have it, it’s cache-able.
  24. Toolkits for measuring page loads: AOL PageTest, FiddlerTool HTTP debugging proxy, IBM Page Detailer instrumentation tool, YSlow, and Firebug are suggested in the presentation. My personal addition to the list is Charles that has been recommended by a colleague.

And here’s the whole presentation, although it’s not possible to follow links from Slideshare slides.

SlideShare

Stress testing Web services

Pylot is a new stress testing tool for Web services testing. As creators describe it:

You begin by defining your test cases in an XML file. Test cases are where you specify the requests (url, method, body/payload, etc) and verifications. Server responses can be verified by matching content to regular expressions and with HTTP status codes. You can adjust the load settings in the workload controls on the GUI before you start a test run (number of agents, request intervals, rampup time, test duration). These settings enable you to model tests based on various load scenarios. At runtime, the cases are loaded and passed to the load generating engine. Agents are dispatched and run concurrently to send HTTP requests to your web service. Real-time stats and error reporting are displayed for monitoring the test as it executes.

It’s a Python script with command-line and graphical interfaces.

register_shutdown_function possible use cases

Eirik Hoem on his blog provides an overview of PHP’s register_shutdown_function, and suggests using it for the cases when for whatever reason your Web page ran out of memory, fatal’ed, and you don’t want to display a blank page to the users.

register_shutdown_function is also useful for command-line scripts with PHP. Pretty frequently your script has to do some task like parse a large XML file, and the test examples when it was originally written did not account for the XML file possible being huge. Therefore your script dies with like 23% completion, and you’re left with 23% of the XML file parsed. Not ideal, but a quick duct-tape-style fix, would be to introduce a register_shutdown_function call to system(), to which you pass the script itself.

If you happen to keep track of which line you’re on while parsing, you can pass the line number as the first parameter to your own script, and make it start off after that 23% mark, or wherever it died. The script then needs to be launched with 0 passed as the first parameter. It will run out of memory, die, launch register_shutdown_function, which will launch another copy of the script (while successfully shutting down the original process) with a new line number, which will repeat the process.

Again, this is a duct tape approach to PHP memory consumption issues while working with large data sets.

The most expensive query

What’s the most expensive query you can think of? How about this one - USPS money orders are sold throughout the United States at numerous post office locations. Each money order has a unique ID number, and while there’s no data on how many money orders are sold annually, you’d assume that finding out about the status of the money order is running a SELECT query on some large table that has that money_order_id as unique index.

How long would that query take?

Well, for one, a trip to the post office (20 minutes sound reasonable, but your mileage may vary). You have to physically request Form 6401, as there’s no option to pre-fill it online. So make it another 10 minutes at postal window.

Then it takes $5, as specified in the USPS money order rules. After that the filled 6401 travels to some place in Iowa, which would get back to you two weeks later in an official letter from some kind of USPS database query execution department.

Total cost: 2 weeks + 20 minutes + 10 minutes + $5.00

PHP contest from PHParchitect.com

Guys at PHParchitect are running a PHP contest for smallest, fastest, most efficient command-line PHP script. A seemingly simple link parser task is probably very tricky, but the task itself is somewhat poorly specced out, as several things are not clear:

  1. Their example lists the href enclosed in <link rel=”stylesheet” type=”text/css” href=”/css/c7y.css” id=”Main C7Y CSS”></link>. So is that a valid link? Anything in href qualifies as a link?
  2. Does a JavaScript window.open qualifies as a link?
  3. What about <a href=”http://www.yahoo.com” onclick=”window.location=http://www.google.com”>link</a> or any similar shenanigans? What qualifies as a link there?

__DIR__ in PHP 5.3

Lars Strojny says that a new magic constant __DIR__ is coming to PHP 5.3. __DIR__ will refer to the current directory of the script. It’s useful for those include and include_once directives where it’s preferable to use absolute paths to avoid navigating down the include path.

Emotiv publishes neuro SDK

Emotiv brain headsetSlashdot had a story on brain control headsets coming out soon from Emotiv. The company seems to have done a fair bit of research in linking various neural activity to explicit emotions. They’re targeting gaming market, and hoping to introduce game that analyze your emotions as well as kinetic signals that the brain is sending towards the other body organs. What’s also cool is they’re launching an SDK:

Additionally, Emotiv has announced the commercial availability of its full SDK. The SDK has been upgraded significantly since it was first announced in March 2007 at last years GDC. The commercially available version of the kit now includes:

  • 2 beta-version neuroheadsets
  • Software toolkit that exposes the APIs
  • Full access to detection libraries
  • Suite of development tools for effective creation and integration of applications with content

The Emotiv EPOC is the worlds first consumer neuroheadset. It detects and processes human conscious thoughts and expressions and non-conscious emotions. By integrating the Emotiv EPOC into their games or other applications, developers can dramatically enhance interactivity, gameplay and player enjoyment by, for example, enabling characters to respond to a players smile, laugh or frown; by adjusting the game dynamically in response to player emotions such as frustration or excitement; and enabling players to manipulate objects in a game or even make them disappear using the power of their thoughts.

Headset itself will cost $299 once released into commercial production, the SDK details are available here.

Better graph algorithims

There are a few interesting postings I came across lately on graph parsing optimizations.

Google’s Mark Chu-Carroll discusses parsing strongly-connected digraphs on ScienceBlogs. His approach includes breaking up a complex digraph into an array of strongly-connected subgraphs and then throwing the power of parallelization at the problem. Each machine is responsible for processing its own subgraph, and when it’s done, it reports back, making this algorithm a special case of MapReduce. Such algorithms are useful for microforecasting, i.e. predicting weather pattern at very precise locations, or, if you choose, figuring out giant link farms established by large blog networks all inter-connecting to their sister blogs in an attempts to drive PageRank.

Caltech’s Yuri Lifshits discusses similarity-based searches. Similarity searches are employed by visual shopping engines, such as Like.com, recommendation engines and their varieties like personalized news recommendation engines, ad targeting engines, and algorithms that do any kind of fuzzy matching (job site matching your resume with potential employers, dating site, etc.) There’s also a paper attached to that presentation for heavier reading.

15 MySQL tools

Unhandled Perception blog has a good list of 15 MySQL tools. Most of them are Windows tools that allow quicker access to database schemas and tables.