- Woot – the original deal a day site, now featuring a t-shirt and wine subsections.
- Yugster
- BlingDaily
- JustDeals
- ThingFling – make 3 orders with them, and you get free shipping
- 1 Sale a Day
- Item Hut
- Schnoop
- Steep and Cheap – mostly clothing
- Going Today
- Midnight Box
- Lunar Loot – contrary to its name, mainly features stuff from Earth
- Dillyeo – you’d think most of the stuff would involve dill (or parsley), but a quick visit turns up electronic items
- Froobio – a few other categories, but by default redirects to “deal of the day”
- I have to have that – on some items the shipping is free, so doesn’t matter how many you buy
Daily deal sites like Woot
Top scalability mistakes
John Coggeshall, CTO of Automotive Computer Services, and author of Zend PHP Certification Practice Book and PHP5 Unleashed, gave a talk at OSCON 2008 on top 10 scalability mistakes. I wasn’t there, but he posted the slides for everybody to follow. Here’re some lessons learned.
- Define the scalability goals for your application. If you don’t know how many requests you’re shooting for, you don’t know whether you’ve built something that works, and how long it’s going to last you.
- Measure everything. CPU usage, memory usage, disk I/O, network I/O, requests per second, with the last one being the most important. If you don’t know the baseline, you don’t know whether you’ve improved.
- Design your database with scalability in mind. Assume you’ll have to implement replication.
- Do not rely on NFS for code sharing on a server farm. It’s slow and it’s got locking issues. While the idea of keeping one copy of code, and letting the rest of the servers load them via NFS might seem very convenient, it doesn’t work in practice. Stick to some tried practices like rsync. Keep the code local to the machine serving it, even if it means a longer push process.
- Play around with I/O buffers. If you’ve got tons of memory, play with TCP buffer size – your defaults are likely to be set conservatively. See your tax dollars at work and use this Linux TCP Tuning guide. If your site is written in PHP, use output buffering functions.
- Use Ram Disks for any data that’s disposable. But you do need a lot of available RAM lying around.
- Optimize bandwidth consumption by enabling compression via mod_deflate, setting zlib.put_compression value to true for PHP sites, or Tidy content reduction for PHP+Tidy sites.
- Confugure PHP for speed. Turn off the following: register_globals, auto_globals_jit, magic_quotes_gpc, expose_php, register_argc_argv, always_populate_raw_post_data, session.use_trans_sid, session.auto_start. Set session.gc_divisor to 10,000, output_buffering to 4096, in John’s example.
- Do not use blocking I/O, such as reading another remote page via curl. Make all the calls non-blocking, otherwise the wait is something you can’t really optimize against. Rely on background scripts to pull down the data necessary for processing the request.
- Don’t underestimate caching. If a page is cached for 5 minutes, and you get even 10 requests per second for a given page, that’s 3,000 requests your database doesn’t have to process.
- Consider PHP op-code cache. This will be available to you off-the-shelf with PHP6.
- For content sites consider taking static stuff out of dynamic context. Let’s say you run a content site, where the article content remains the same, while the rest of the page is personalized for each user, as it has My Articles section, and so on. Instead of getting everything dynamically from the DB, consider generating yet another PHP file on the first request, where the article text would be stored in raw HTML, and dynamic data pulled for logged-in users. This way the generated PHP file will only pull out the data that’s actually dynamic.
- Pay great attention to database design. Learn indexes and know how to use them properly. InnoDB outperforms MyISAM in almost all contexts, but doesn’t do full-text searching. (Use sphinx if your search needs get out of control.)
- Design PHP applications in an abstract way, so that the app never needs to know the IP address of the MySQL server. Something like ‘mysql-writer-db’, and ‘mysql-reader-db’ will be perfectly ok for a PHP app.
- Run external scripts monitoring the system health. Have the scripts change the HOSTS if things get out of control.
- Do not do database connectivity decision-making in PHP. Don’t spend time doing fallbacks if your primary DB is down. Consider running MySQL Proxy for simplifying DB connectivity issues.
- For super-fast reads consider SQLite. But don’t forget that it’s horrible with writes.
- Use Keepalive properly. Use it when both static and dynamic files are served off the same server, and you can control the timeouts, so that a bunch of Keep-alive requests don’t overwhelm your system. John’s rule? No Keep-alive request should last more than 10 seconds.
- Monitor via familiar Linux commands. Such as iostat and vmstat. The iostat command is used for monitoring system input/output device loading by observing the time the devices are active in relation to their average transfer rates. The iostat command generates reports that can be used to change system configuration to better balance the input/output load between physical disks. vmstat reports information about processes, memory, paging, block IO, traps, and cpu activity.
- Make sure you’re logging relevant information right away. Otherwise debugging issues is going to get tricky.
- Prioritize your optimizations. Optimization by 50% of the code that runs on 2% of the pages will result in 1% total improvement. Optimizing 10% of the code that runs on 80% of the pages results in 8% overall improvement.
- Use profilers. They draw pretty graphs, they’re generally easy to use.
- Keep track of your system performance. Keep a spreadsheet of some common stats you’re tracking, so that you can authoritatively say how much of performance gain you got by getting a faster CPU, installing extra RAM, or upgrading your Linux kernel.
Complete presentation is down below:
A few things about fair trade coffee
I am reading Starbucked by Taylor Clark, and the book is quite enjoyable, both as a look inside the coffee industry, and as a business case study of Starbucks. Clark dedicates an entire chapter to fair trade coffee practices, that I wasn’t too familiar with, but as anybody else, assumed it was a Good Thing. Fair trade coffee practices, controlled by a non-profit TransFair USA, pay farmers participating in the program $1.26 a pound for regular coffee, and $1.31 for certified organic. Under the fair trade label it’s resold to you at $12-15 a pound, making the retailer quite a winner in this transaction (originally fair trade was supposed to eliminate the middleman, and thereby lower the final cost of coffee).
When the price of coffee beans can occasionally go under 40c, this seems like a good deal, if you’re a coffee farmer, so what’s the catch?
- Fair trade contracts are binding, and requiring the coffee bean farmers to commit to $1.26-$1.31 even if market surges (as it does when there’s a cold summer in Brazil). Ok, this is a bit hypothetical, but coffee markets have been known to swing wildly nevertheless. In 2006 Starbucks (the largest seller of fair trade coffee in the US) has actually paid its non-fair-trade growers an average of $1.42 per pound. Oops.
- TransFair requires that each coffee farm participating in the program be coop-owned and employ no outside seasonal labor. This rules out private farms, family-owned farms, and corporation-owned farms. A family of coffee bean growers starts out a farm, hires seasonal labor to pick the beans, and wants to sell it as fair trade coffee? TransFair doesn’t let those capitalist pigs get anywhere near the application form.
- Roasters admit that fair trade coffee is of inferior quality. While the rest of the coffee farms have to compete in lower-priced open market, they frequently do it by quality of their product. When a fair trade farm is guaranteed $1.26-$1.31 a pound, the economic rationales start to take over, and growers always try to cut their costs to enjoy higher profit margins.
- TransFair requires every participant in the fair trade program – retailer or coffee grower – to sign a release form promising never to criticize the program in public.
MapReduce usage at Google
Via High Scalability blog (a great addition to any RSS reader out there) there’s a link to Jefrrey Dean‘s presentation on MapReduce usage in Google. Actually, his presentation touches upon a few aspects of Google infrastructure, such as GFS, and BigTable, so there’s more on this video. What caught my eye is the relative growth of MapReduce inside Google – 2.2 mln jobs run in September 2007.
In the table above, note the drastic growth of input data analyzed and output data generated. The number of actual MapReduce jobs has also grown significantly and reached 10,000 in September 2007.
Dean also presented an interesting graph about the frequency of commits of new MapReduce jobs into the repository – as you can see there are months when the number of new projects goes through the roof, followed by a spike.
The reason? Summer interns.
Complete set of slides is available from Yahoo! Research, which organized the Data-Intensive Computing Symposium.
Leave the first comment ▶24 Web site performance tips
Yahoo! Developer Network blog had an entry by Stoyan Stefanov and presentation from PHP Quebec conference. A few points to take away, in case you don’t feel like going through 76-slide presentation:
- A drop of 100ms in page rendering time leads to 10% in sales on Amazon. A drop of 500 ms leads to 20% less traffic to Google.
- Make fewer HTTP requests – combine CSS and JS files into single downloads. Minify both JS and CSS.
- Combine images into CSS sprites.
- Bring static content closer to the users. That usually means CDNs like Akamai or Limelight, but sometimes a co-location facility or data center in a foreign country is the only option.
- Static content should have Expires: headers way into the future, so that they’re never re-requested.
- Dynamic content should have Cache Control: header.
- Offer content gzip’ed.
- Stoyan claims nothing will be rendered in the browser till the last piece of CSS has been served, and therefore it’s critical to send CSS as early in the process as possible. I happen to have a document with CSS declared at the very end, and disagree with this statement – at least the content seems to render OK without CSS, and then self-corrects when CSS finally loads.
- Move the scripts all the way to the bottom to avoid the download block – Stoyan’s example shows placing the javascript includes right before </body> and </html>, although it’s possible to place them even further down (well, you’d break XHTML purity, I suppose, if you declare your documents to be XHTML).
- Avoid CSS expressions.
- Consider placing the minified CSS and JS files on separate servers to fight browser’s default pipelining settings – not everybody has FasterFox or tweaked pipeline settings.
- For super-popular pages consider inlining JS for fewer HTTP requests.
- Even though placing content on external servers with different domains will help you with HTTP pipelining, don’t go crazy with various domains – they all require DNS lookups.
- Every 301 redirect is a wasted HTTP request.
- For busy backend servers consider PHP’s flush().
- Use GET over POST any time you have a choice.
- Analyze your cookies – large number of them could substantially increase the number of TCP packets.
- For faster JavaScript and DOM parsing, reduce the number of DOM elements.
- document.getElementByTagName(‘*’).length will give you the number of total elements. Look at those abusive <div>s.
- Any missing JS file is a significant performance penalty – the browser will browse the 404 page you generate, trying to see if it has valid <script>s.
- Optimize your PNGs – check out pngcrush, pngoptimizer
- Optimize JPEGs – jpegtran
- Make sure you have favicon.ico – generating those 404s will be expensive, plus once you have it, it’s cache-able.
- Toolkits for measuring page loads: AOL PageTest, FiddlerTool HTTP debugging proxy, IBM Page Detailer instrumentation tool, YSlow, and Firebug are suggested in the presentation. My personal addition to the list is Charles that has been recommended by a colleague.
And here’s the whole presentation, although it’s not possible to follow links from Slideshare slides.
2 comments so far, add yours ▶
