Recover old website off waybackmachine

Question

Is there a way to recover an entire website from the waybackmachine?

I have an old site that is archived but no longer have the website files to revive it again. Is there a way to recover the old data so I can get my long lost files back?

What do you mean by 'website files' - just the html? If yes, then surely you could just go to that webpage and download the source from there through your browser. — franka, Mar 16 '12 at 01:06
Yes, html, css, images, & possibly php files. This has multiple pages with images and custom css. — Dustin, Mar 16 '12 at 01:58
I've came accross the same issue and I've ended up coding a gem. To install: `gem install wayback_machine_downloader` then run it with the base url of the website you want to retrieve as a parameter: `wayback_machine_downloader http://example.com` More information: https://github.com/hartator/wayback_machine_downloader — Hartator, Aug 10 '15 at 06:38

mguymon · Answer 1 · 2015-07-10T19:17:39.760

47

wget is a great tool to mirror an entire site and if you are on windows, you can use Cygwin to install it. The following command will mirror a site: wget -m domain.name

Update from comments:

The example wget command that the wont ascend to the parent dir (-np), ignores robot.txt (-e robots=off), uses the cdn domain (--domains=domain.name), and mirrors a url (the url to mirror, http://an.example.com ). All together you get:

 wget -np -e robots=off --mirror --domains=staticweb.archive.org,web.archive.org http://web.archive.org/web/19970708161549/http://www.google.com/

If you are dealing with https and a self signed cert, u can use --no-check-certificate to disable the certificate check. The wget help is the best place to see possible options.

edited Jul 10 '15 at 19:17

answered Mar 16 '12 at 01:08

mguymon

8,636
2
37
60

Thank you for the resource, much appreciated. I have a mac and app called site sucker which seems to do the same thing. The problem is downloading through a full archive.org url. – Dustin Mar 16 '12 at 01:59
2

+ 1 for help with **blocked recursive crawling**! This should be approved answer. – jibiel Jul 18 '12 at 11:20
1

`-np` helps to don't get off from the specified date path. – Ray Oct 19 '13 at 01:21
Great, thanks. And for a great guide to install `wget` on Mac OSX without homebrew or similar, checkout http://coolestguidesontheplanet.com/install-and-configure-wget-on-os-x/ – Toby Jan 06 '14 at 20:58
When using https add --no-check-certificate – jrosell Feb 13 '14 at 16:51
Good stuff, I will update the example. – mguymon Feb 13 '14 at 18:01
1

@mguymon But is there any way to download the css and photos with that command? – jcarlosweb Apr 19 '15 at 12:40
1

@jcarlosweb you'll need to remove `-np`, and then it's a good idea to limit recursion, for example `-l 3` – valiano Jan 17 '19 at 11:27
Replying to @jcarlosweb — no, you need a few more options, e.g.: `wget --recursive --no-clobber --page-requisites --html-extension --convert-links --restrict-file-names=windows --domains domain.tld my.domain.tld/`, take a look at https://www.linuxjournal.com/content/downloading-entire-web-site-wget (note: this will work for web.archive.org as well, just add the extra options) – Gwyneth Llewelyn Nov 06 '19 at 18:44

Recover old website off waybackmachine

1 Answers1

Update from comments: