Archiving a Wordpress Site

I was just about to throw out my old swedish blog, making room for this one, when I realized that it would be nice to keep it online, but have it compiled statically. Using HTTrack, you can simply download a whole site to a static local version.

The plan

The plan has always been to improve my online presence and most of all my personal brand online, keeping it up to date, putting a lot of thought in it and beeing proud of what I communicate.

Since I wanted to keep my WordPress based old blog online and serve it as static files, my first thought was to look for one of those thousands of WordPress plugins. I searched for one that would export it in a format like Markdown, or perhaps even static HTML files. There are many plugins out there, varying in quality, and since the content needed to be kept intact during the archiving I wasn´t able to find any that I would trust.

Then I realized that I once downloaded an entire site for a client (ages ago and purely for backup purposes). Using HTTrack would fit perfectly!

Backing it up (private version)

So, I started off by simply dumping the database:

mysqldump -u user -p pass DB_NAME > DB_BACKUP.sql

Zipping it together with the wp-content folder and storing it on the general backup store (Amazon S3).


Download WordPress Blog through HTTrack

Nowadays I use Node.js for this blog. I might release the source as a simple blog engine for you node.js peeps out there.

Anyway, making the old WordPress based blog static by downloading it and serving it through Connect.js static middleware would fit perfectly!

HTTrack is an excellent tool for downloading any site to a local stagic copy.

So, here we go:

sudo port install httrack

Yes, I run MacPorts on one of my instances... Maybe there is a Brew based one?

Now, launch httrack:


Follow through the simple wizard and voila, now you´ve downloaded your site and ready for it to be archived and served without WordPress or MySQL.

View the result from downloading the whole instance of my old swedish blog.

Update: recrawled the site again

This time after disabling comments, removing useless stuff such as pingback URL in header.php as well as Google Analytics, getclicky and other tracker snippets

I ended up using these options and went on a "manual" crawl where httrack prompted different options on what to do with external or "near" resources such as tracking scripts etc. You can choose to ignore all resources on a domain, mirror whole external domain etc.

httrack -W -O "/Users/jonas/Sites/starksignal"  -%v -%P