26

Is there a way to recover an entire website from the waybackmachine?

I have an old site that is archived but no longer have the website files to revive it again. Is there a way to recover the old data so I can get my long lost files back?

Dustin
  • 261
  • 1
  • 3
  • 4
  • What do you mean by 'website files' - just the html? If yes, then surely you could just go to that webpage and download the source from there through your browser. – franka Mar 16 '12 at 01:06
  • Yes, html, css, images, & possibly php files. This has multiple pages with images and custom css. – Dustin Mar 16 '12 at 01:58
  • 2
    I've came accross the same issue and I've ended up coding a gem. To install: `gem install wayback_machine_downloader` then run it with the base url of the website you want to retrieve as a parameter: `wayback_machine_downloader http://example.com` More information: https://github.com/hartator/wayback_machine_downloader – Hartator Aug 10 '15 at 06:38

1 Answers1

47

wget is a great tool to mirror an entire site and if you are on windows, you can use Cygwin to install it. The following command will mirror a site: wget -m domain.name

Update from comments:

The example wget command that the wont ascend to the parent dir (-np), ignores robot.txt (-e robots=off), uses the cdn domain (--domains=domain.name), and mirrors a url (the url to mirror, http://an.example.com ). All together you get:

 wget -np -e robots=off --mirror --domains=staticweb.archive.org,web.archive.org http://web.archive.org/web/19970708161549/http://www.google.com/

If you are dealing with https and a self signed cert, u can use --no-check-certificate to disable the certificate check. The wget help is the best place to see possible options.

mguymon
  • 8,636
  • 2
  • 37
  • 60
  • Thank you for the resource, much appreciated. I have a mac and app called site sucker which seems to do the same thing. The problem is downloading through a full archive.org url. – Dustin Mar 16 '12 at 01:59
  • 2
    + 1 for help with **blocked recursive crawling**! This should be approved answer. – jibiel Jul 18 '12 at 11:20
  • 1
    `-np` helps to don't get off from the specified date path. – Ray Oct 19 '13 at 01:21
  • Great, thanks. And for a great guide to install `wget` on Mac OSX without homebrew or similar, checkout http://coolestguidesontheplanet.com/install-and-configure-wget-on-os-x/ – Toby Jan 06 '14 at 20:58
  • When using https add --no-check-certificate – jrosell Feb 13 '14 at 16:51
  • Good stuff, I will update the example. – mguymon Feb 13 '14 at 18:01
  • 1
    @mguymon But is there any way to download the css and photos with that command? – jcarlosweb Apr 19 '15 at 12:40
  • 1
    @jcarlosweb you'll need to remove `-np`, and then it's a good idea to limit recursion, for example `-l 3` – valiano Jan 17 '19 at 11:27
  • Replying to @jcarlosweb — no, you need a few more options, e.g.: `wget --recursive --no-clobber --page-requisites --html-extension --convert-links --restrict-file-names=windows --domains domain.tld my.domain.tld/`, take a look at https://www.linuxjournal.com/content/downloading-entire-web-site-wget (note: this will work for web.archive.org as well, just add the extra options) – Gwyneth Llewelyn Nov 06 '19 at 18:44