Part 1: Drupal Performance

The first section of this book offers details on how to get good performance out of your Drupal powered website, and how to scale it as demand grows. The majority of the features discussed in this section are available without making any modifications to Drupal.

Chapter 1: Getting Started

Synopsis

This chapter explains the importance of fully defining your performance and scalability goals. It helps you to identify what you need to accomplish, showing how to set concrete and attainable goals. This chapter also explains why it's important to maintain historical performance logs, and discusses numerous technologies and services that are available to aide in this effort. It stresses the importance of making regular backups, and of testing backups before changes are made. Finally, this chapter describes best practices for testing changes, and deploying tested changes onto production servers.

Section 1: Setting Goals

Understanding and Defining the Problem

There are numerous ways that you can use this book. It is designed to be readable from cover to cover, while also being usable as a reference. Whether you have a specific performance problem you are trying to solve, or you're researching options for improving the general performance of your website, this book will prove helpful. It can also prove useful to someone making a decision as to whether or not Drupal can scale sufficiently for an upcoming project.

This first chapter is aimed at someone that has been tasked with making general performance improvements to their website, helping you to better understand what you need to accomplish and why you need to accomplish it. Rather than randomly finding problems and fixing them in the order they're found, you will review your entire website and identify all of the areas where there are significant performance and scalability problems. You will then prioritize these problems based on the potential gains, as well as on the size and complexity of the planned solution. Finally, you will begin focusing on the "lowest hanging fruit", often solving the simplest problems first and quickly realizing measurable performance improvements.

Goals versus Requirements

More often than not, there is a specific reason that you have begun to focus on improving the performance of your website. This reason may be a technical problem that needs solving, such as a database server that fails when you are linked to by popular websites like Slashdot and Digg. Or, your current interest in website performance may be business driven, a task that was handed down the management chain to make all pages on your website load within 2 seconds. Either way, it is important to understand the tasks that need to be accomplished, and to distinguish which tasks are requirements, and which tasks are goals.

A requirement is something that absolutely must be accomplished, while a goal is something that it would be nice to accomplish. Using the above examples, if your database server is failing whenever your website gets too busy, it is reasonable to consider solving this problem a requirement. On the other hand, if you are tasked with achieving sub-two-second page load times, this is more likely to be classified as a goal. You can often achieve sub-two-second page load times for the majority of visitors to most of your pages, but for a variety of reasons it is not always possible to achieve specific page load times for all visitors of all web pages. It is important to set realistic expectations.

Performance and Scalability Checklist

The following lists will help you better define the areas of your website that need to be improved, and to better understand what is driving this need for improvement. They are logically grouped into multiple sections. Review each section to determine which are applicable to your current project, and then work through your selected lists, thoroughly documenting the goals and requirements for your upcoming project. There is a temptation to skip this step and charge head first into actually making changes, however until you define your goals you have no way to measure your progress, and you may end up trying to fix things that aren't even broken.

Quantitative Goals

Can you quantify the performance improvements you would like to make to your website? Work through this section, clearly listing your quantitative goals to the best of your ability.

  • Average page load times: Are your web pages loading too slowly? Are users complaining of slow page loads? Are pages slow for anonymous visitors, or logged in users, or both? What are your targeted page load times for each?
  • Maximum page load times: Do most of your we pages load in a reasonable amount of time, while some pages take an abnormally long time? Are the same pages always slow, or does it seem to be more random than that? What is your current maximum page load time? What would be an acceptable maximum page load time? What would be an optimal maximum page load time?
  • Page load times for first time visitors: Do you need to make a good impression on first time visitors to your website? How long does it take someone to load your web page if they've never visited it before, and they don't have any of your page elements loaded in their web browser cache?
  • Number of monthly page views: How many page views has your website see on each of the previous six months? Have you launched any new advertising campaigns or made any significant announcements that you expect to result in more website traffic? What are your targeted number of page views for each of the next six months? What are you basing this projected growth on?
  • Number of monthly anonymous visitors: What percentage of your traffic is anonymous visitors that do not have user accounts or choose to not log in? Where have these anonymous visitors come from in the past? What is your targeted number of anonymous visitors for each of the next six months?
  • Number of monthly logged-in visitors: What percentage of your traffic is logged-in users? How much does your logged in traffic increase from month to month? What is your targeted number of logged-in users for each of the next six months?
  • Number of subscriptions: Is your website subscription oriented? How do subscriptions differ from normal users? How many subscriptions have you seen in each of the previous six months? What are your targeted number of subscriptions for each of the next six months?
  • The time it takes to submit content: How long does it currently take to submit a new story? How long does it take to submit a new comment? Are you using free tagging? How many seconds is an acceptable amount of time for submitting new content? How many seconds is an optimal amount of time for submitting a new content?

Business Goals

Are there business needs driving your current performance and scalability efforts? Work through this section of the checklist to fully define these business needs.

  • Growth rate: Is there a general business drive to increase the monthly growth rate of your website? How is this growth rate being measured? What is the targeted growth rate? How does this growth rate compare with past growth rates? Is the planned growth rate realistic?
  • Advertisement impressions: Does your business model depend on selling a certain number of advertisements on your website? Is online advertising new to your business, or is it an ongoing source of income? Do you plan to sell the same number of advertisements each month, or do you plan to regularly increase the number of advertisements you are selling? Are you managing the ads in-house, or are you using a third-party advertising network?
  • Partnerships: Are you partnering up with another popular website, and expecting a significant increase in web traffic? Will this traffic be mostly anonymous visitors, or mostly logged in users? How much traffic does your partner website see?

Risk Management Goals

Is your website a critical component of your business? Are you currently unable to take regular backups, or unclear even what data needs to be backed up? Work through this section to define what is acceptable data loss, setting goals and requirements for your upcoming performance and scalability efforts.

  • High availability: How fault tolerant is your existing infrastructure? What happens if your primary database server fails? What happens if a web server fails?
  • Minimizing down time: What is the most downtime your website has already experienced? What was the effect of this downtime? What are the consequences if your website is down for too long? What qualifies as downtime? What is your budget for building a fault tolerant infrastructure? How much downtime would be acceptable, and is it measured in seconds, minutes, hours, or days? How many much downtime would be catastrophic to your business, and is it measured in seconds, minutes, hours, or days?
  • Fast data recovery: Where do you store your backups? How often are you taking backups? How many copies of backups do you retain? Have you ever tried restoring data from your backups? How long did it take? If you something happened to your database, how long can you afford to recover data from a backup?
  • Survival after catastrophic failures: Do you have a plan if a hurricane, earthquake, or explosion wipes out your data center? Do you keep a copy of your data at a completely separate physical location? If using an online backup solution, have you confirmed that their servers are actually in another data center? How long would it take you to build an entire new infrastructure?

Other Goals

What other needs are driving your performance efforts? Reviewing the following goals, and try to come up with some more of your own.

  • Auditing current site performance: Do you currently not have a good idea of the performance of your website? Are you looking for ways to better understand how your site is currently performing, in order to understand what needs to be improved, if anything?
  • Solve specific known performance bottlenecks: Do you know exactly where the problems are with your website? Are you receiving complaints from website users, or from management? Can you duplicate the reported problems? Do you have a general idea of what is causing the problems? How can you measure the known performance bottlenecks?
  • Improve scalability: Are you expecting to outgrow your existing infrastructure? Do you know how much traffic your current infrastructure can handle? Do you have a budget to add additional servers to your network? Do you need to make due with the hardware you have?
  • Contributing back to Drupal: Have you solved some performance issues in ways that you think would be useful to other Drupal users? Would you like to be recognized for contributing code and documentation back to the Drupal project? Would you like to see your improvements merged into Drupal's core code so when you upgrade in the future you don't have to keep solving the same problems?

Section 2: Measuring Progress

Setting A Baseline

It is a common mistake to make performance oriented modifications to a website before measuring existing site performance. By doing this, it can then prove impossible to determine whether your changes have resulted in real performance improvements, or if instead they have resulted in reduced performance. For this reason, the first thing you should do is to set up proper monitoring of your website, and to quantify your current performance.

What To Monitor

There are many useful measurements that can be regularly monitored on your website, and a large number of tools that can help with taking these measurements. Each website will some unique monitoring needs, however there are some basic measurements that most all websites will want to regularly monitor. The following list will give you some ideas of what you should monitor. Thought the list is numbered, the numbers do not indicate the level of importance of each item. Instead, we number items so we can provide specific examples when we discuss monitoring tools.

  1. The time it takes to load the front page when not logged in and with nothing cached by your browser.
  2. The time it takes to load the front page again, when not logged in but when you already have the CSS, JavaScript and images cached by your.
  3. The time it takes to load the front page page when you're logged in.
  4. The time it takes to load each of your 25 most popular types of web pages.
  5. The time it takes to load the above pages from different areas of the world.
  6. The popularity of the various types of web pages on your website. (Example types of pages include the front page, forum pages, RSS feeds, and custom pages generated by modules.)
  7. Server resource utilization such as CPU, load average, free memory, cached memory, swap, disk IO, and network traffic.
  8. The number of pages being served by your web server(s).
  9. Your database, including including the number of queries per second, the efficiency of your query cache, how much memory is being used, and how often temporary tables are being created.
  10. Database queries taking more than 1 second to complete.
  11. Database queries not using indexes.
  12. The number of searches being performed per hour.
  13. Memcache, including how much memory is being used, how many queries are being made per second, and your hit versus miss rates.

Monitoring Tools

You will need to use multiple tools to fully monitor your website. Some of these tools can run on your existing infrastructure, while other tools may need to live outside of your network.

ps

Ps displays information about all processes that are currently running on your server. The command line utility supports a large number of optional flags that control which processes are displayed, and what information is displayed about each process. Information that can be displayed includes CPU usage, memory usage, how much CPU time the process has used, and much more. Common invocations of ps include ps -ef and ps aux. Learn more about ps by typing man ps on most Unix servers.

top

Top provides an automatically updating view of the processes running on a server. It offers a quick summary of a server's health, showing CPU utilization, as well as memory and swap usage. Processes can be sorted in many ways, such as listing the processes that are consuming the most CPU, or the processes that are using the most memory.

vmstat

Vmstat offers a useful report on several areas of system health, including the number of processes waiting to run, memory usage, swap activity, CPU utilization, and Disk IO. A common invocation of vmstat is vmstat 1 10. Learn more about vmstat by typing man vmstat on most Unix servers.

Sar

Sar is part of the Sysstat collection of Unix performance monitoring tools. Sar can be configured to collect regular comprehensive snapshots of a system's health without putting any noticeable load on the system. It is a very good idea to enable Sar on any server that you are managing, as the historical information this utility collects can prove invaluable when tuning a server, or when performing damage control on a failed server.

Cacti

Cacti is a PHP front-end for RRDTool, displaying useful graphs based on historical data collected from your servers. By default it tracks useful system information such as CPU and memory utilization, however it can also be integrated with programs such as MySQL, Apache, and memcache, displaying useful historical graphs of their performance.

YSlow

YSlow is a FireFox add-on that enhances Firebug to analyze how quickly your web pages load, highlighting areas that can be improved. This tool is discussed in depth in chapter 13.

AWStats

AWStats is a log analyzer that can be used to create graphical reports from web server and proxy log files. When scaling a Drupal website, you can achieve better performance by disabling Drupal's core statistics module, and instead using AWStats to generate regular reports from Apache's own access logs.

devel module

The devel module is one of a suite of development oriented Drupal modules. Among its many useful features, it can display a list of all queries used to build each page served by a Drupal powered website, highlighting slow queries and queries that are run multiple times. The devel module is discussed in depth in chapter 6.

mysqlreport

Mysqlreport is a perl script that generates reports based on numerous internal "status variables" maintained by MySQL. With this script, you can quickly interpret what these variables mean, helping you to tune your server for better performance. Mysqlreport is discussed in depth in chapter 22.

mysqlsla

Mysqlsla, the MySQL Statement Log Analyzer, is a perl script that helps you analyze MySQL logs. This script will be discussed in depth in chapter 23, detailing how it can be used to review MySQL's slow query logs.

mytop

Mytop is a useful tool for monitoring a MySQL database from the command line. It offers a summary of database threads in a format similar to how top lists running server processes.

innotop

Innotop was originally written to monitor MySQL's InnoDB storage engine, but it has long since evolved into a very powerful tool for monitoring all aspects of MySQL. Inspired by mytop, it takes MySQL monitoring to a new level.

MySQL Enterprise Monitor

The MySQL Enterprise Monitor is a commercial offering by Sun Microsystems for monitoring one or more MySQL servers. The comprehensive tool provides useful charts and graphs, makes tuning suggestions, and can send alerts when your MySQL servers need attention.

Online Services

There are many online services that can help you with monitoring your website. It is beyond the scope of this book to list and review all of these services, but popular examples include Google Analytics, IndexTools, ClickTracks and Omniture. Other online services can help you to understand how quickly your web pages are loading from various locations around the world, including Keynote, Webmetrics, Alert Bot, and host-tracker.com.

Section 3: Backups

Why To Backup

Hopefully you already understand the general importance of maintaining regular backups. For example, if a server fails and all data on that server is lost, you can create a new server just like the old by restoring a backup. If someone runs a bad query and accidentally deletes data from your database, you can restore the lost data from a backup. If you make a change to your website and later find that it was a buggy change, you can roll back to the previous version of your website from a backup. If you're setting up multiple web servers, you can build the second server from a backup of the first. When you need to test changes before deploying them on a live website, you can create a copy of your actual website on a development server by restoring a backup.

What To Backup

Generally speaking, it is important to back up anything that you can't afford to lose and you can't easily recreate. For example, you will certainly want to make regular backups of your database. If you have written custom themes and modules, they too should be backed up. If you written custom patches for Drupal unique to your website, back them up. Any customized configuration files on your servers should also be backed up. If your users upload files such as pictures or sounds, this data should also be backed up.

Backups are an inexpensive insurance policy for when things go wrong, as well as a useful tool for duplicating servers. When backups are combined with a revision control system, they can also be useful for reviewing changes over time, and for understanding how changes have affected your website. Often times data loss is not immediately detected, in which case it is important to have multiple copies of backups.

The following list offers a suggestion of data that you should consider backing up. When deciding what from the following list you will be backing up, ask yourself, "what happens if I lose this data?"

Data to include in your backups

  • Database
  • Database configuration file(s)
  • Web server configuration file(s)
  • PHP configuration file(s)
  • User uploaded content
  • Custom modules and themes
  • Custom patches

What You May Not Want To Backup

While it is possible to back up your entire server, including the underlying operating system, this is often not necessary. The underlying operating system can be re-installed on a new server with minimal fuss. Then, the various customized configuration changes can be restored from backups. Furthermore, backing up your entire server will require significantly more storage space. This becomes more and more of a concern as you add additional servers to your infrastructure. Finally, a backup one server may not easily restore to another server if it has different hardware, such as different network cards or a hard drive of another size.

When backing up your database tables, it is possible to not back up up certain tables. For example, you don't have to back up Drupal 6's four search tables as they can be regenerated if they are lost. The many cache tables also do not have to be backed up. As the watchdog and access log tables are already automatically flushed after a certain amount of time, they are also good candidates for tables to skip if trying to minimize the size of your backups. If you decide to skip certain tables when making your backups, be aware that this can complicate the restoration process. If you are building a new server from backups, in addition to restoring your backup you will also have to manually create any tables that weren't included in your backup.

Redundancy vs. Backups

You may have set up redundant systems, and expect this to take the place of backups. For example, you may two databases with one replicating to the other. Or, your data may be stored on a high end RAID system, mirrored onto multiple physical drives. However, remember that you're not only trying to protect yourself from system failures. One of the most common reasons for data loss is human error. If you accidentally run a query that deletes half your users, this errant query will run on your database slave as well and delete your users in both places. Or, if you accidentally delete a directory containing user-contributed content, again this change will also be made on the mirrored drives. For this reason, it's important to not assume that redundancy replaces the need for regular backups.

When To Backup

A single backup of the above data from all your servers is a good start. But most websites are constantly changing, with new content being posted, old content being updated, and new users signing up all the time. Any changes made between the time of your last backup and when something goes wrong will be lost. Thus, it is important to make regular backups.

In the first section of this chapter one of the discussed goals asked you to define how much data you can afford to lose. Can you afford to lose an hour of data? Can you afford to lose 24 hours of data? Can you afford to lose a week of data? Obviously you would prefer to not have any lost data, but at the end of the day it comes down to a question of practicality and budget. Set realistic goals for yourself, and then figure out how you can meet those goals. If you can afford to lose a week of data, obviously your backup strategy can be much simpler than someone who can't afford to lose more than an hour of data.

Also note that different types of data may change with different frequency. For example, your database is likely to be constantly changing, while your custom themes and modules are rarely changing. Thus, different data can be backed up at a different frequency. It'

Backup Schedules

Now that you've defined how much data you can afford to lose in the event of a catastrophic failure, it's time to set up a regular backup schedule that meets your requirements. Your backup schedule needs to take into account two significant questions:

  1. How often does the backed up data change?
  2. How much data can you afford to lose?

If the data being backed up never or very rarely changes, you can update your backup each time you make a change. If your data changes all the time, then you'll instead need to automate regular backups that happen at least as frequently as your needs dictate. For example, if you can only afford to lose 6 hours of data should your database fail, set up your backup scripts to backup your database once every 6 hours.

Examples

Tracking Multiple Text Database Backups With Git

The following script is a simple yet powerful example of how you could efficiently store multiple backups of your database within a revision control system. In this example, we are using 'git', however you could easily replace git with your favorite source control system. Note that git is designed for storing lots of small files, not for storing one large file, so it is may not be the best choice of tools for maintaining backups of a growing database. Our use of the "--single-transaction" flag for mysqldump assumes that you are using MySQL's InnoDB storage engine.

To use this script, you should edit the configuration section as appropriate for your system. You then need to create an empty directory at the path defined by the script's BACKUP_DIRECTORY variable. Next, create a new git repository by moving into this directory and typing 'git init'. With the repository initialized, manually run the mysqldump command to generate the first copy of your database. Add this text backup to the repository using 'git add', and check it in using 'git commit -a'.

The steps described in the previous paragraph could have been automated, however my goal was to keep the script as simple as possible. Furthermore, you may end up deciding to use a different revision control system than 'git', in which case you will need to set things up differently.

The actual backup script follows:

#!/bin/sh

# Configuration:
BACKUP_DIRECTORY="/var/backup/mysql.git"
DATABASE="database_name"
DATABASE_USERNAME="username"
DATABASE_PASSWORD="password"
# End of configuration.

export PATH="/usr/bin:/usr/local/bin:$PATH"

cd $BACKUP_DIRECTORY

START=`date +'%m-%d-%Y %H:%M:%S'`

mysqldump -u$DATABASE_USERNAME -p$DATABASE_PASSWORD \
           --single-transaction --add-drop-table \
           $DATABASE > $DATABASE.sql

END=`date +'%m-%d-%Y %H:%M:%S'`
CHANGES=`git diff --stat`
SIZE=`ls -lh $DATABASE.sql | awk '{print $5}'`

/usr/bin/git-commit -v -m "Started:  $START
Finished: $END
File size: $SIZE
$CHANGES" -v $DATABASE.dump

Each time you run the above script, it will generate a current backup of your database and check in the difference between this backup and the previous backup. The script should be called from a regular cronjob, causing your database to be backed up every few hours or every day, depending on your needs.

Using 'git log', you can review the versions of your database that have been checked in, and you can see the information that is logged each time you make a backup:

Author: Jeremy Andrews
Date:   Sun Jul 20 15:14:09 2008 -0400

    Started:  07-20-2008 15:13:01
    Finished: 07-20-2008 15:14:02
    File size: 14M
     database.sql |   44 ++++++++++++++++++++++----------------------
     1 files changed, 22 insertions(+), 22 deletions(-)

There are many simple improvements you could make to increase the usefulness of this script, including:

  • Occasionally run 'git gc' to compress all the older copies of your database stored in your git repository.
  • Replace 'git' with your favorite source control system.
  • Push a copy of your repository to a remote server, so the backups don't live only on the same server as your database. It is important that you can access the backups if your database server fails.
  • Generate an email each time the backup is completed, sending a brief status report.
  • Redirect stdout and stderr to a log file so you can see any errors that happen when running the script from crontab.
  • Minimize the size of the changes between each backup by making two backups of your database. One backup should only include your table definition using the --no-data option to mysqldump, and one backup should only include your data using the --no-create-info option.

Backing Up Your Website With Git

Git provides a very simple method for backing up your website. It offers much more than a backup, but that's all we're concerned about in this section. In preparation, first create an empty Git repository on your backup server. If you have multiple servers or web directories you wish to back up, you should create an empty Git repository for each. By using the "--bare" flag, we reduce the size of our backup as it won't maintain an uncompressed copy of the latest version of the files:

$ mkdir backup.git
$ cd backup.git
$ git --bare init
Initialized empty Git repository in /home/user/backup.git/

Next, on the web server that you are backing up, "initialize" a repository in your web directory. Add your website files to this repository, and then "push" it to the empty repository on the backup server. It is safe to initialize a Git repository on your live server and check files into it as this does not modify your files in any way. Instead, it creates a ".git" subdirectory where the local repository is stored. In this example, we'll assume that your backup server has an IP address of 10.10.10.10:

$ cd /var/www/html
$ git init
Initialized empty Git repository in .git/
$ git add .
$ git commit -a -m "Backup all files in website."
$ git remote add backup-server user@10.10.10.10:backup.git
$ git push backup-server master

Now, as you add new files to your web server, add them to your git repository by running "git add". Commit these new files and any changed files by running "git commit -a". And finally, push these updates to the backup server by running "git push backup-server master".

You will learn more about using Git in the next section of this chapter.

Testing Backups

Simply making backups of your data is only half of the job. It's also critical that you regularly validate your backups, insuring that they are not corrupt and that they contain everything you need to rebuild your websites.

One way to test your backups is to restore them to your development server, building an up-to-date development environment. Doing this one time is not enough, as though this does validate your general backup strategy, it doesn't regularly validate the integrity of each backup. You should instead update your development environment from backups on a regular schedule, such as once a week. The process can be automated through simple scripts.

Section 4: Staging Changes

Testing Changes

As you scale your website and its popularity grows, it becomes increasingly important to properly test all changes prior to updating your production website. At minimum, you should have a separate testing server which duplicates your production environment. Your development environment should be using the same exact operating system as your production servers, with the same extensions installed and updates applied. If you instead use for example CentOS on your production servers, and Fedora on your development servers, you may find that code which works perfectly on your development server fails in production due to issues such as to failed dependencies.

The more similar you make your development environment to your production environment, the more valid your testing will be. That said, very often while your production environment may be comprised of numerous servers, your development environment may be limited to a single server. In this case, you should do what you can to simulate your production infrastructure.

In this final section of chapter one, I offer best practices for testing changes and pushing these changes to your production servers.

Revision Control

It's often tempting to maintain your website one file at a time, manually copying individual files into place. Usually this involves first making a backup copy of the file you wish to change, over time resulting in dozens of old backups cluttering the directories of your production servers. Often this can also involve editing files directly on a production server and hoping for the best. Unfortunately even the most trivial seeming change can have unforeseen consequences. It can also quickly become confusing which files have been updated, and which files are still an older version. This can result in bug fixes never actually making it into production, or new and bigger bugs being created while trying to fix old bugs.

A simple yet extremely effective solution to this problem is to utilize a revision control system. Revision control is one of many phrases used to describe the management of multiple versions of the same information. Other popular phrases often used to describe this functionality include version control, source control, and source code management. There are a great many number of both open source and proprietary revision control tools available to you. Popular examples include CVS, Subversion (SVN), Perforce, and Git.

For the example contained in this book, we have chosen to use Git, a fast and flexible distributed source control system originally designed by Linus Torvalds for managing the Linux kernel. Git was selected because of its distributed design, its growing popularity, its flexibility, its applicability to what we are trying to solve, and its free availability. However, this does not mean that you also need to use Git to manage your website. It is possible to apply the tips and best practices we explain here to your favorite source control system.

Tracking File Changes

The basic steps required for managing files with Git were briefly discussed in the previous section on backups. In this section, we build upon our previous examples, showing you how Git can offer you much more than a versioned backup of your website.

Managing Drupal Core With Git

In this first example you will learn how you can manage a website built from Drupal's core files. Start with an older version of Drupal, which you will manually patch. You will then use Git to easily upgrade your website to a newer version of Drupal. Start by checking out Drupal 6.2 out of CVS:

$ cvs -z6 -d:pserver:anonymous:anonymous@cvs.drupal.org:/cvs/drupal \
  co -d html -r DRUPAL-6-2 drupal

Next, create a git repository in your website's directory, and in it store all the files you checked out from CVS, including the CVS files themselves. Then create a "drupal" branch where you'll keep an unmodified copy of Drupal for use in upgrading the site later. Finally, tag your release with the same tag found in CVS for simplified reference, and revert back to the "master" branch:

$ cd html/
$ git init
Initialized empty Git repository in .git/
$ git add .
$ git commit -a -m "Drupal 6.2"
$ git checkout -b drupal-core
Switched to a new branch "drupal-core"
$ git tag DRUPAL-6-2
$ git checkout master
Switched to branch "master"

Now use your web browser to configure your Drupal installation, which will create and configure settings.php. Once completed, add your new settings.php file to the master branch of your Git repository. Throughout these examples, you will be using many branches for merging and website development, but the "master" branch will always contain your actual website:

$ git add sites/default/settings.php
$ git commit -a -m "Add configured settings.php"

At this point you are ready to patch your new Drupal website. For this example, you will apply a very simple patch to bootstrap.inc that is intentionally a slightly different version of a change made to the file in Drupal 6.3. You do this to cause a conflict when you upgrade the website to Drupal 6.3:

$ cat bootstrap.inc.patch
index 44cd0d7..d45cf5d 100644
--- a/includes/bootstrap.inc
+++ b/includes/bootstrap.inc
@@ -283,0 +284,7 @@ function conf_init() {
+  // Do not use the placeholder url from default.settings.php.
+  if (isset($db_url)) {
+    if ($db_url == 'mysql://username:password@localhost/databasename') {
+      $db_url = '';
+    }
+  }
+

Manually apply this patch to your master branch, checking in the modified bootstrap.inc include file:

$ patch -p1 < bootstrap.inc.patch
$ git commit -a -m "custom bootstrap patch"    

Now, upgrade your website to Drupal 6.3. First, update the version in your "drupal-core" branch from CVS. You update the "drupal-core" branch so that CVS won't run into any conflicts. If you instead update your "master" branch, CVS will corrupt the bootstrap.inc include file due to our patch. We will later rely on Git to more intelligently help us resolve the merge conflict:

$ git checkout drupal-core
Switched to branch "drupal-core"
$ cvs update -r DRUPAL-6-3

With your "drupal-core" branch updated to Drupal 6.3, commit the updated files to your Git repository and tag them for possible future reference:

$ git commit -a -m "Drupal 6.3"
$ git tag DRUPAL-6-3

Now you use this updated "drupal-core" branch to upgrade your website. You will perform the merge in a temporary branch, though it would be just as easy to perform the merge in the "master" branch. Either way, Git provides easy mechanisms for undoing a merge if you make a mistake our change your mind. In this case, you should test the merge in your temporary branch before merging it into your official "master" branch:

$ git checkout master -b temporary
Switched to branch "temporary"
$ git merge drupal-core
Auto-merged includes/bootstrap.inc
CONFLICT (content): Merge conflict in includes/bootstrap.inc
Automatic merge failed; fix conflicts and then commit the result.

Git was able to automatically merge all files except for includes/bootstrap.inc, which failed because of the custom changes which modified the file in the exact same lines as Drupal 6.3. You will quickly resolve this conflict using a graphical tool, verify that the changes look sane, then check in all the merged results:

$ git mergetool
$ git diff --color master includes/bootstrap.inc
$ git commit -m "Upgrade to Drupal 6.3"

If you make a mistake during the merge, you can easily and safely delete the temporary branch ("git branch -d temporary"), recreate it, and try the above steps again, fixing your mistake. Once you've confirmed that the website is working as expected, merge the temporary branch into your master branch, and delete the temporary branch:

$ git checkout master
Switched to branch "master"
$ git merge temporary
$ git branch -d temporary
Deleted branch temporary.

Managing Contributed Themes And Modules With Git

Managing contributed themes and modules is best done by using another branch. It is helpful to create one branch for each remote source for the files you use to build your website. You can use a single branch for all your contributed modules and themes, as they all come from Drupal's "contrib" CVS repository.

In this example, we'll add the devel module to our website, checking it out of CVS:

$ git checkout master -b drupal-contrib
Switched to branch "drupal-contrib"
$ cd sites/default
$ mkdir modules
$ cvs -z6 \
 -d:pserver:anonymous:anonymous@cvs.drupal.org:/cvs/drupal-contrib \
 checkout -r DRUPAL-6--1-10 -d modules/devel contributions/modules/devel
$ git add modules
$ git commit -a -m "Devel module version 6.1.10"

You can repeat this process to check out additional contributed module or themes from CVS, checking them in to your local 'drupal-contrib' Git branch. Once you've checked out all the modules and themes you need for your website, merge them into your master branch:

$ git checkout master
Switched to branch "drupal-contrib"
$ git merge drupal-contrib

When you need to upgrade any of your contributed modules or themes, follow the same steps described above for updating Drupal core. Switch to the 'drupal-contrib' branch to checkout the updated version from CVS. Commit the changes to your "drupal-contrib" branch, then use Git to merge the changed files into your "master" branch.

The important thing is to keep the files in your 'drupal-contrib' branch unmodified so that CVS can update the files without any conflicts. If you need to modify any of the contributed modules or themes, do it in the 'master' branch, or in another development branch. If your changes conflict with future upgrades, you can easily resolve these conflicts in the same way that you did in our previous example with a conflict in bootstrap.inc.

Managing And Upgrading An Existing Website With Git

The previous examples assumed that you were creating a new website with Drupal. In this example, we will show you how Git can also help you to manage and upgrade an existing website, even if you've not been using revision control up to this point.

The first step is to create a new Git repository within your website directory, and to add your existing website files to this new repository. This first step is identical to the example provided in the previous section for backup up your website files:

$ cd /var/www/html
$ git init
Initialized empty Git repository in .git/
$ git add .
$ git commit -a -m "Initial commit."

When you're ready to upgrade your website, checkout the version of Drupal that you wish to upgrade your website to, creating a new Git repository with this new version of Drupal. In this example, you'll upgrade your website to Drupal 6.3:

$ cvs -z6 -d:pserver:anonymous:anonymous@cvs.drupal.org:/cvs/drupal \
  co -r DRUPAL-6-3 drupal
$ cd drupal
$ git init
Initialized empty Git repository in .git/
$ git add .
$ git commit -a -m "Drupal 6.3"

In previous examples, you've always kept all your files in different branches of the same Git repository. In this example, you take advantage of Git's distributed design to merge code from two different repositories. To upgrade your website to Drupal 6.3, switch back to your website repository and create a "drupal-core" branch. Now, "pull" the updated version of Drupal from the second repository you just created. Finally, merge the "drupal-core" branch into your "master" branch and manually resolve any conflicts that Git is unable to automatically merge:

$ cd ../html
$ git checkout -b drupal-core
$ git pull ../drupal
$ git mergetool
$ git commit -a -m "Drupal 6.3, resolved conflicts."
$ git checkout master
$ git merge drupal-core

At this point, you can either continue tracking Drupal core in the "drupal-core" branch of your website repository, or you can instead continue tracking Drupal core in the external "drupal" repository, deleting your local "drupal-core" branch until you need it again. There is no technical reason to favor one solution over the other, so it is left to you to decide which method works best for you.

Apply this same technique when you wish to upgrade contributed themes or modules. Once again, checkout the new version of the module and create a new repository with it. Then, merge this repository into a new branch of your website repository. Once you are happy that the upgrade has gone smoothly, merge the update into your "master" branch.

Finally, if multiple people are involved in the ongoing development of your website, each developer can use Git to "clone" your repository and implement their own custom changes. When they finish, you can then "pull" their changes back into your repository. It is such distributed development that Git excels at, providing you with powerful tools allowing you to pick and choose which changes to merge from another repository, and to undo commits if they later prove to be problematic. There is much documentation found online to help you master Git, thereby greatly increasing your productivity.

Tracking Database Schema Changes

As your website evolves, you will find that you sometimes need to update your database schema. Fortunately, Drupal provides a method of tracking and automating such schema changes. When developing custom modules for Drupal, you can define various "hooks" in the .install file. For example, the _install hook is called when your custom module is first enabled, and should be used to create custom database tables. If you need to modify your schema in the future, you define an _update_N hook in your module's .install file, then run update.php on your website. Drupal will track which updates have already been installed on your website, and will alert you as new updates come available. As you update your .install file, be sure to commit your changes to your Git repository. Read more about .install files, _install hooks and _update_N hooks in the official Drupal API documentation at the following URLs:

Staging configuration changes from development servers

It is best to test all configuration changes on your development server, before attempting to make changes on your production server. Once you have made all your desired changes, you have to decide the best method for duplicating these changes in production. Many tediously take notes as they make changes on their development servers, then manually repeat the same steps on their production servers.

It is much preferable if you can automate some of this process, allowing you to test for consistency and to track all configuration changes in the database. What follows is a recipe for partially automating and tracking this process using git. It does require a solid knowledge of SQL.

To begin, first configure an exact copy of your production website on a development server by restoring an up-to-date backup. Do not attempt to work from an outdated backup or the following steps may have unexpected results.

Next, create an empty sub-directory and capture a baseline database backup from your new development server. This will contain the same exact data as is in the backup you used to create this development server, however the backup will be formatted differently as you will be using different mysqldump options. In most cases, it will make sense to use the --no-create-info option as you will not be adding new tables or altering table definitions. It can also be very helpful to use the --skip-extended-insert option so that each change is on its own line, simplifying patch generation. Finally, the --complete-insert option can prove helpful for generating database queries to use when data in your database is being updated rather than simply inserted.

Once you have created your baseline snapshot with the appropriate mysqldump options, initialize a new git repository, and commit your database snapshot into your new empty repository:

$ mkdir snapshot
$ cd snapshot/
$ mysqldump -uUSERNAME -p --no-create-info --skip-extended-insert \
  --complete-insert DATABASE > snapshot.sql
$ git init
$ git add snapshot.sql
$ git commit -m "initial database snapshot"

Now, log in to your development website and make the necessary configuration changes. Do not attempt to make too many changes at one time, or it may prove too difficult to later merge these changes into your production website.

In our example, we will visit the Site configuration section in the Drupal Administration pages and make the following changes:

  • On the date and time page, disable user-configurable time zones
  • On the site information page, configure a new slogan.

You're now ready to extract the changes you've made on your development server, preparing to push them into production. First, get a new database snapshot using the same identical mysqldump flags that you used previously. Now, utilize a handy git feature to only commit the relevant changes into your temporary development repository. Finally, use git to generate a patch from this commit.

To commit only the relevant changes, you will use the git add --patch command. It will logically split your changes by table, referring to each table as a "hunk", asking you for each whether or not you wish to "stage this hunk". In this example, you will answer "n" to all changes affecting the cache* tables, the sessions table, and the watchdog table. You will only answer "y" to the changes affecting the variable table. You do not stage the changes for the many cache tables because these will be automatically generated on your production server as needed. You do not stage the changes to the sessions or users tables, because these are specific to your current session on your development server and unrelated to your configuration changes. You also do not stage the changes to the watchdog table as this is only internal logging information and not relevant to updating the configuration of your website:

$ mysqldump -uUSERNAME -p --no-create-info --skip-extended-insert \
  --complete-insert DATABASE > snapshot.sql$ git add --patch snapshot.sql
$ git commit -m "example configuration changes"

You can now generate a patch from your partial commit. First, use git log to find the previous commit against which a patch will be generated. In our example this is the initial database snapshot with an ID of 908f027ba0077baad4b7c52ebbe986fb89b40f41. Second, call git format-patch to generate the actual patch, passing in enough unique characters of the commit ID:

$ git log
 commit 968fe8271ed7ff08fa46d789371b626b80c46ac6
 Author: Jeremy Andrews 
 Date:   Fri Aug 22 16:20:54 2008 -0700
 
     example configuration changes 

 commit 908f027ba0077baad4b7c52ebbe986fb89b40f41
 Author: Jeremy Andrews 
 Date:   Fri Aug 22 16:06:22 2008 -0700

     initial database snapshot
$ git format-patch 908f02
0001-example-configuration-changes.patch

Next, use this automatically generated patch file to create an appropriate _update hook for a custom .install file. This is done by first opening the patch file with a text editor. Reviewing the patch, note that any pre-existing configuration options which you have updated involve two lines in the patch, one starting with a "-", and one starting with a "+". All lines starting with a "-" are being removed from your database, while all lines starting with a "+" are being added to your database. On our example website the site slogan was previously defined, so in our patch file we see a "-" line removing the old slogan, and a "+" line adding the new slogan:

-INSERT INTO `variable` (`name`, `value`) VALUES \
  ('site_slogan','s:18:\"This is my slogan.\";');
+INSERT INTO `variable` (`name`, `value`) VALUES \
  ('site_slogan','s:26:\"This is my updated slogan.\";');

Using our knowledge of SQL, we manually convert this into a single update as follows:

UPDATE `variable` SET `value` = \
  's:26:\"This is my updated slogan.\";' WHERE `name` = 'site_slogan';

Our other change was to disable user configurable time zones, and as this had never been updated on our website before we only find a single relevant line in our patch starting with a "+", and none starting with a "-":

+INSERT INTO `variable` (`name`, `value`) VALUES \
  ('configurable_timezones','s:1:\"0\";');

Finally, we use the queries we collected above and create a new _update_N hook in a custom module used for pushing database updates to our website. If you are not already using a custom module, you can create an empty custom.module file, a proper custom.info file, and a custom.install file. In the custom.install file, you will add a new _update_N hook. Refer to the links provided at the beginning of this subsection for a more in depth description of how these Drupal hooks work. In our example, we add the following function to our custom.install file. In your own usage, be sure to increment N in your new _update_N hook:

function custom_update_6001() {
  $ret = array();
  $ret[] = update_sql("UPDATE `variable` SET `value` = \
    's:26:\"This is my updated slogan.\";' WHERE `name` \
    = 'site_slogan';");
  $ret[] = update_sql("INSERT INTO `variable` (`name`, \
    `value`) VALUES ('configurable_timezones','s:1:\"0\";');");
  return $ret;
}

You should commit the changes you have made to your custom module files into your website source code repository. You can then push these changes to your production website as explained below. Note that it is highly recommended that you first push these changes to a staging server, testing the update process and verifying that you have properly written your update hook. To have your actual updates performed on your staging and production servers, you will need to point your browser to yoursite/update.sql and follow the directions.

The same principles that have been documented in this simplistic example can be applied to more complex configuration changes. You are not limited to just calling UPDATE and INSERT in your _update_N hooks, you can also call DELETE, CREATE, ALTER, and any other appropriate SQL command. When making more complex configuration changes, you should dump your database regularly without actually committing each individual change. After each database dump, you can use git diff --color to view how your changes are affecting the database. The more you do this, and the more familiar you get with how Drupal works under the hood, the quicker the process will become.

There has been much discussion about how these processes can be further automated in Drupal 7 and beyond. There are also existing projects attempting to further automate the process for earlier versions of Drupal, such as the Database Scripts project found at http://drupal.org/project/dbscripts.

Pushing Changes To Production

In previous examples, you've learned how you can use Git to manage your website, simplifying many processes including upgrading to a newer release of Drupal, and making configuration changes to your website. This final section discusses using Git to push changes to your production server. In an earlier example dealing with backups, we configured a Git repository on a backup server with the IP address 10.10.10.10. We will use this previously configured backup server again in this example.

At this point, you have updated your website to Drupal 6.3, and merged all of your changes into the master branch of your Git repository. You have tested all your changes, and are now ready to push them to your live web server. You should first tag your release for easy reference in the future. As you're working in a different repository than you used in the backup example, you need to configure the remote backup server. Then, push your current code to the remote server:

$ git tag RELEASE-2008-07-002
$ git remote add backup-server user@10.10.10.10:backup.git
$ git push backup-server master

This process is greatly simplified if only one person (or on Git repository) is pushing changes to the backup server. This one person can be responsible for merging together everyone else's work, and testing all the changes. Once the code is pushed to the backup server, it is now available to be pulled to your website. When using this work flow, it's important that you don't edit files directly on your web server, but instead that you always pull changes to files via your Git repository. On the production web server:

$ git pull user@10.10.10.10:backup.git master

If for any reason you want to revert to an earlier version of your website, this can be easily done using tags. We'll assume that your previous release was tagged as 'RELEASE-2008-07-001'. We use the "--hard" option

$ git reset --hard RELEASE-2008-07-001

You can now fix whatever problems you ran into by making changes to your local repository. Once things are fixed and tested, add a new tag and again push your changes to the backup server. Finally, pull these changes to your production server.

With this strategy, you always know exactly what version of your website is currently being used in production. It also becomes possible to quickly back out any changes if. Finally, if you have multiple web servers, it is now trivial to keep them all in sync by checking out files from the same remote Git repository.

Chapter 2: Drupal Infrastructure

This chapter will provide an overview of what is coming up later in the book. It will talk about cheap $5/month web hosts, versus slightly more powerful Virtual Private Servers, versus dedicated servers and server farms. It will collect together network diagrams for the various configurations, and point to later chapters where the various features are more fully explained.

  1. Bargain Basement Hosting
    1. Advantages
    2. Squeezing Water From A Rock
    3. Development and Testing
    4. Outgrowing Your Host
    5. Diagram
  2. Virtual Private Servers
    1. Advantages
    2. What Is Virtualization?
    3. Competing For Resources
    4. Outgrowing Your Host
    5. Diagram
  3. Multiple Installations versus Multi-site Installations
    1. Advantages
    2. Security Considerations
    3. Diagrams
  4. Dedicated Hosting
    1. Single Server
    2. Multiple Servers
    3. Sharing Files And File Systems
    4. Load Balancers
    5. High Availability
    6. Scaling Up vs. Scaling Out
    7. Caching
    8. Network Diagrams

Chapter 3: Performance Configuration

This chapter introduces Drupal's built-in performance features. It explains how Drupal's built-in page cache works, and details how it can be configured. The chapter also discusses Drupal's built-in CSS and JS aggregation and compression. The importance of regularly purging Drupal's logs will be discussed. And finally, the chapter will explore Drupal's throttle module.

Section 1: Performance Configuration

Page Cache

There are many things you can do to improve the performance and scalability of a Drupal powered website. Before adding or upgrading servers, applying performance oriented patches, or any of the many other topics of varying complexity that will be discussed in this book, you should first enable all of Drupal's relevant built-in performance options. Find Drupal's performance configuration options by navigating to the Performance page in the Site Configuration section of your website's administration pages.

When the page cache is enabled, Drupal will save a fully rendered copy of each page accessed by anonymous visitors in the cache_page database table. When the same page is subsequently visited by the same or another anonymous user, the pre-rendered, cached copy is quickly and efficiently served directly out of the cache_page table. As most public web pages see significantly more anonymous traffic than logged in traffic, enabling the page cache generally results in a very significant performance improvement.

Drupal's page cache only caches pages accessed by anonymous visitors utilizing the HTTP GET method.

Caching Mode

When enabling Drupal's page cache, you can select normal mode or aggressive mode. You can also completely disable the page cache. It should be noted that the page cache is not Drupal's only cache. Disabling the page cache does not affect Drupal's other caches, such as its menu cache, form cache, or filter cache. Of all of Drupal's caches, the page cache is one of only two caches that can be manually disabled. The other is the block cache, discussed below.

The different cache levels are defined as constants in the bootstrap.inc include file. Though CACHE_DISABLED, CACHE_NORMAL, and CACHE_AGGRESSIVE are all defined, only CACHE_DISABLED and CACHE_AGGRESSIVE are directly referenced in the core Drupal code. This is because whether you have normal caching or aggressive caching enabled, the same anonymous page content is cached. We will discuss the differences between these two caching modes more thoroughly below, for now simply noting that when in aggressive caching mode, Drupal does not call the _boot() and exit() hooks in any modules.

When page caching is enabled (normal or aggressive), the first time a page is generated for an anonymous visitor the resulting output is stored in the page_cache database table. This is the result of the last line of index.php, where there is a call to the function drupal_page_footer() in common.inc. This function calls page_set_cache() in the same file, which verifies that the current page is being served to an anonymous visitor using the HTTP GET method, and that there haven't been any Drupal messages set in the current session. If these three conditions are all true, the function calls PHP's built in ob_get_contents() function to retrieve from PHP's buffers the page that Drupal has generated. This output is optionally compressed, as described below, then the function calls PHP's built in ob_end_flush() function which tells PHP to flush its page buffer and send the generated page to the remote web browser. Finally, page_set_cache() calls the Drupal function, cache_set(), storing the anonymous page in the cache_page database table.

The next time this page is visited by the same or another anonymous visitor, the cached copy that was previously generated and stored is retrieved directly from the cache_page database table, bypassing the need to regenerate the page. Logic for actually retrieving a cached page from the database lives in Drupal's bootstrap.inc file. The process starts in the first couple of lines of index.php, with a call to the drupal_bootstrap() function defined in bootstrap.inc. The bootstrap function defines a series of phases which are called one by one. The first phase initializes Drupal's configuration array, reading settings.php. In the second phase it's possible to define custom caching functions, making it possible to do things like using memcached for caching Drupal data. The third phase initializes the database. The fourth phase loads the session data, typically from the database. And finally, the fifth phase makes a call to page_get_cache() which loads the cached page from the cache_page database table. If in normal caching mode, the fifth phase executes the _boot hook in all modules defining it, then displays the page to the anonymous visitor. Finally, the _exit hook is called in all modules that define it, and Drupal exits.

Though the above logic may already sound complicated, it all happens very quickly, and allows Drupal to avoid loading and running a significant amount of code.

As noted above, when switching the cache from normal mode to aggressive mode, Drupal no longer calls the _boot and _exit hooks during the fifth bootstrap phase. This has several performance and scalability advantages. First, it means that the modules defining these hooks do not need to be loaded into memory.

Minimum Cache Lifetime

Page Compression

Block Cache

Bandwidth Optimizations

Optimizing CSS Files

Optimizing JavaScript Files

Section 2: Drupal Logs

Watchdog Logs

The Access Log

Section 3: The Throttle Module

Background

Configuration

Modules

Blocks

Custom Integration

Why The Throttle Was Removed From Drupal 7

Chapter 4: Too Many Modules

This chapter takes an in depth look at Drupal's modular design. It explores the concept behind Drupal's “hooks”, using the nodeapi as an example. It also looks at Drupal's menu system. The chapter then puts all of this together by tracing what happens when you enable a single Drupal module. Finally, it discusses the temptation to enable hundreds of contributed modules.

  1. Modules and Hooks
    1. Drupal modules
    2. Adding Features With Hooks
    3. Example: the nodeapi
  2. Menus
    1. Defining Pages
  3. Enabling Modules
    1. Memory Limits
    2. .install Files
    3. Drupal 7 Registry Preview
    4. All You Can Eat?

Chapter 5: Caching Layer

This chapter dives into Drupal's code, taking a look at the underlying caching layer. It will begin with an accessible, high-level description before it dives into the actual implementation. Finally, it will teach module developers how to better use Drupal's built-in caching layers.

  1. Understanding Drupal's Caching Layer
    1. Overview
    2. Variables
    3. The Many Cache Tables
  2. Developing With Drupal's Caching Layer
    1. Drupal's Cache API
    2. Caching With Custom Modules
    3. Sessions

Chapter 6: The devel Module

This chapter will take a look at the contributed devel module, explaining its key importance in performance tuning a Drupal website. It will discuss the many configuration options, and explain how the module can be used to profile page loads.

  1. More Then A Development Tool
    1. Visualizing Slow Queries
    2. Timing Page Creation
    3. Page Elements Versus The Database
  2. Configuration
  3. Profiling Database Queries
    1. Identifying Slow Queries
    2. Identifying Duplicate Queries
    3. Common Queries and What They Mean

Chapter 7: To Patch Or Not To Patch

Drupal offers considerable performance and scalability without modifying the code in any way. However, much more performance can be obtained by patching the core code. This chapter weighs the pros and cons of patching Drupal, and the impact this has on keeping up to date with security patches and upgrading to new releases.

  1. The Case For Patching
    1. Optimal Performance
    2. Community Patchsets
    3. Backports
    4. Hitting Modularity Limitations
  2. The Case For Not Patching
    1. Avoiding The Unknown and Under Tested
    2. Keeping Up With Security Updates
    3. Upgrading To New Releases