Archives for the Optimization category

MapReduce usage at Google

Via High Scalability blog (a great addition to any RSS reader out there) there’s a link to Jefrrey Dean’s presentation on MapReduce usage in Google. Actually, his presentation touches upon a few aspects of Google infrastructure, such as GFS, and BigTable, so there’s more on this video. What caught my eye is the relative growth of MapReduce inside Google - 2.2 mln jobs run in September 2007.

image

In the table above, note the drastic growth of input data analyzed and output data generated. The number of actual MapReduce jobs has also grown significantly and reached 10,000 in September 2007.

image

Dean also presented an interesting graph about the frequency of commits of new MapReduce jobs into the repository - as you can see there are months when the number of new projects goes through the roof, followed by a spike.

image

The reason? Summer interns.

image

Complete set of slides is available from Yahoo! Research, which organized the Data-Intensive Computing Symposium.

The most expensive query

What’s the most expensive query you can think of? How about this one - USPS money orders are sold throughout the United States at numerous post office locations. Each money order has a unique ID number, and while there’s no data on how many money orders are sold annually, you’d assume that finding out about the status of the money order is running a SELECT query on some large table that has that money_order_id as unique index.

How long would that query take?

Well, for one, a trip to the post office (20 minutes sound reasonable, but your mileage may vary). You have to physically request Form 6401, as there’s no option to pre-fill it online. So make it another 10 minutes at postal window.

Then it takes $5, as specified in the USPS money order rules. After that the filled 6401 travels to some place in Iowa, which would get back to you two weeks later in an official letter from some kind of USPS database query execution department.

Total cost: 2 weeks + 20 minutes + 10 minutes + $5.00

Seattle conference scalability conference talks on Google Video

The recent Slashdot discussion points out how hard it is to learn about system scalability, unless you already are working for a company that’s in the business of building large scalable systems. However, Google’s recent Seattle conference on scalability brought videos of some pretty interesting talks to the Web.

System Abstractions for Handling Large Datasets by Jeff Dean, Google

Building a Scalable Resource Mgmt System for Grid Computing by Khalid Ahmed, Platform Computing Corp.

YouTube Scalability by Cuong Do Cuong

Lessons In Building Scalable Systems by Reza Behforooz, Google Talk

Using MapReduce on Large Geographic Datasets by Barry Brumitt, Google, Inc.

Scaling Google for Every User by Marissa Mayer, Google

SCTPs Reliability and Fault Tolerance

VeriSign’s Global DNS Infrastucture

Lustre File System by Peter Braam, Cluster File Systems

Building a Linux-based peer-to-peer MMOG

Massively multiplayer online games are always interesting to study as far as network optimizations and data transfer optimizations that they resort two. Researchers from Thailand and China are building a Linux-based peer-to-peer massively multiplayer online game and have published a paper on the problems they encountered:

Building real-time interactive P2P game playing applications in Linux posses many challenges and opens a wide area of research. There have been many implementations of remote interactive servers for game playing in some universities or companies providing collaborative access to the remote services.

Most current systems that provide a collaborative remote game environment either use the (multi) Unicast technique or the Multicast technique to transmit data packets to the participants of the experiment group in the network. Both of these schemes have disadvantages and advantages. In this paper we implement architecture under Linux for real-time multiplayer game application based on XCAST especially for the cases where there exist numerous small collaborative groups. We had provided extension to the Linux kernel and XCAST and show that formation of numerous simultaneous groups, where each group would collaborate for a separate game, is possible. The system proposed is robust in consistently providing group formation and collaboration activities in real time, back-up route or priority queuing, and on time packet delivery with minimum delay in network in the presence of continuous node arrival and departure in the entire game playing procedure. We also show that for data packets of low size, the use of XCAST in the network layer decreases the stress at the sender in each group whereas due to the increased header size of the XCAST packets. Our implementation has shown significant improvements to meet the demands of some real-time game applications.

Price optimization popular in retail industry

Even if you spent a single day in an Economics class, you’re probably familiar with a concept of supply and demand, where the price for a product has an impact on the demand and subsequent sales of it.

Associated Press runs an article on retailers employing mathematical models for price optimization, where some products are priced higher to generate higher margins, and some are discounted to generate larger volumes even at the expense of per-product margins. DemandTec, Oracle and SAP are some of the companies producing those mathematical models for retailers around the country, with AP listing some of the pricing optimizations employed currently.

Most of the time, according to the article, the calculations are not made in vacuum, but in comparison with existing sales and current product selection. So three power drills selling for $90, $120 and $130, all generate certain among of profit for the retailer. The higher the price, the higher the profit, as markup is universal among all the models. High-end consumers go for the $130, while bargain hunters think there’s nothing wrong with $90 drill. Result? $120 drill doesn’t sell that well. Pricing optimization places the second drill at $110, where it’s suddenly affordable for those who used to buy $90 drills. After all, it’s only additional $20.

There’s an older article in CFO Magazine justifying the cost of price optimization systems for CFOs.

12 rules for running a successful open source project

There’s a tech talk How Open Source Projects Survive Poisonous People (And You Can Too) on Google Video, where the guys who developed Subversion and currently work at Google (the company tends to hire a bunch of open source developers including VIM creator) share their tips on running an open source project.

Here’s a synopsis of their tips, although the video is worth watching for real-life stories and anecdotes they tell.

  1. Don’t strive for perfection - you will end up polishing and improving and adding on instead of releasing actual software?
  2. Don’t “paint the bikeshed” - spend 1 day on discussing the nuclear power plant plans and spend 1 week discussing which color to paint the bike shed for the workers who ride bikes to work.
  3. Don’t get obsessed with the process - you’re building a product, not optimal process.
  4. Have a direction - don’t just try to build the best software ever, and wind up with a feature creep. Pick a direction and stick with it.
  5. Send people to mail archives for a discussion that already happened in the past.
  6. Keep documentation on design decisions, bug fixes, mistakes.
  7. Have developers write consistent log messages.
  8. Send commit e-mails - this allows everybody to be in the loop on what’s being worked on and who does what.
  9. Don’t let people submit mega-projects they quietly worked on for the past few months - no one can review it, no one can test it, there should be incremental commits and branches.
  10. Don’t let people put their names at the top of the source code files - this discourages additions and bug fixes, since people feel the file is being “owned” by the author.
  11. “Patches welcome” should be a reply to the users who request a variety of new features that primarily suits their own specific use case.
  12. Try to avoid people that genuinely would like to code, but cannot follow a self-paced learning schedule, and require hand-holding and explanations almost on every issue they face.

Bram Moolenaar on increasing everyday efficiency

Bram Moolenaar, the creator of vim, spoke at Google yesterday as part of their open source developer speaker series. He mainly spoke about developer efficiency, and not as much about vim itself, except for the Q&A session, when people asked lots of questions about vim ongoing development and future.

Moolenaar, now a Googler, spoke in great detail about improving efficiency when working with large amounts of text on aregular basis, which is 90% of a software developer job. His mantra consists of:

  1. finding inefficiency
  2. finding a way to correct that inefficiency and optimize it
  3. making that optimization a habit

Suppose you find yourself doing the same menial task such as copy and pasting or retrieving links from an HTML document. The proper way of thinking about this is “How can I optimize this to a single or double keystroke?” The solutions from Moolenaar (which, surprisingly, mostly involved vim) included sets of macros, regular expressions and just new commands that would make the routine task automated.

One drawback of such dependence, of course, is inablity to work in any other environment, except for the editor you’re married to and the one you created macros for. If you want to edit your Microsoft Outlook e-mails using your vim editor, because you find it most efficient, most likely it will not work.

Amazon crashed itself

That Amazon promotion on XBox 360 for $100 did not go too well this morning. I logged in at roughly 10:45, checked My Account, just to make sure the cookie that’s saved into Amazon system is fresh and new, so I am not presented with the login screen somewhere in the process. At about 11:00 am the site was completely inaccessible with Firefox tab icons spinning away for minutes at a time.

Greg Linden provides some insight as a former Amazonian:

When I was at Amazon, every year we in engineering would try to avoid spikes in traffic, especially around peak holiday loads, and every year marketing folks would want to run some promotion specifically designed to create a mad frenzy on the site. Usually, we convinced them to change the promotion, but apparently engineering lost (or was asleep at the switch) this year.

People who didn’t get the console got kinda upset with the site performance.

In related news, did you know you could build MySQL clusters on Amazon’s E2 service, while utilizing S3 for storage? Due to sudden availability of free time I did some reading on MySQL Cluster this morning.

memcache for MySQL

Brian Aker of MySQL AB released version 0.3 of memcache_engine for MySQL. memcache_engine marries in-memory object caching engine memcache (originally developed by Danga Interactive) with SQL query capabilities, basically enabling you do to SELECT/INSERT/UPDATE against your cache servers. The source is pretty light - take away the make files, and it’s just ha_memcache.cc and ha_memcache.h with round 25k worth of code.

Caching dynamic content with Apache

Caching is easy if all you gotta cache are some static files that need to be served over and over, and they never change. Unfortunately, most of the sites on the Internet are not in the business of serving that kind of content, and allowing Apache to cache dynamic content is frequently a problem. O’Relly Network has an article on mod_cache:

If you run a reverse proxy server–if your Apache server sits in front of some other back-end server, proxying requests to it–mod_cache will cache the content retrieved from those back-end servers, as well as local content. This is the configuration that folks seem to be most familiar with. Indeed, caching is often most effective in this scenario. When you have a slower legacy back-end server producing some of your content, this setup is useful to give it a little more pep.