Questions tagged [httrack]

HTTrack (Website copier)

HTTrack is a free and open source Web crawler and offline browser, developed by Xavier Roche and licensed under the GNU General Public License Version 3.

HTTrack allows users to download World Wide Web sites from the Internet to a local computer.[4][5] By default, HTTrack arranges the downloaded site by the original site's relative link-structure. The downloaded (or "mirrored") website can be browsed by opening a page of the site in a browser.

HTTrack can also update an existing mirrored site and resume interrupted downloads. HTTrack is configurable by options and by filters (include/exclude), and has an integrated help system. There is a basic command line version and two GUI versions (WinHTTrack and WebHTTrack); the former can be part of scripts and cron jobs.

HTTrack uses a Web crawler to download a website. Some parts of the website may not be downloaded by default due to the robots exclusion protocol unless disabled during the program. HTTrack can follow links that are generated with basic JavaScript and inside Applets or Flash, but not complex links (generated using functions or expressions) or server-side image maps.

Reference :

http://www.httrack.com/

http://en.wikipedia.org/wiki/HTTrack

61 questions
18
votes
5 answers

mirror single page with httrack

I am trying to use httrack (http://www.httrack.com/) in order to download a single page, not the entire site. So, for example, when using httrack in order to download www.google.com it should only download the html found under www.google.com along…
Max
  • 14,808
  • 14
  • 71
  • 121
11
votes
2 answers

httrack wget curl scrape & fetch

There are a number of tools on the internet for downloading a static copy of a website, such as HTTrack. There are also many tools, some commercial, for “scraping” content from a website, such as Mozenda. Then there are tools which are apparently…
Malik A. Rumi
  • 1,361
  • 2
  • 17
  • 31
7
votes
2 answers

Error with Capture URL / Catch URL in HTTrack

I have a problem when click capture URL from HTTrack. That is it generate a Proxy address not correct. This is result : Please TEMPORARILY set your browser's proxy preferences to: Proxy's address: fe80::141b:2ce3:3f57:fefb Proxy's port: …
Hoc N
  • 143
  • 3
  • 12
6
votes
3 answers

How can I make HTTrack only download files on the current domain?

No matter how hard I try, I can't seem to get httrack to leave links going to other domains intact. I've tried using the --stay-on-same-domain argument, and that doesn't seem to do it. I've also tried adding a filter doesn't do it. There simply…
Alex
  • 22,845
  • 25
  • 92
  • 147
5
votes
3 answers

Use httrack to download just one site, not external sites

I tried using httrack to download my phpbb forum, but no matter what setup I use, I cannot get it to stop downloading the entire wikipedia site as well, and many other websites whose links are anywhere in the forum... What I managed to do it make it…
Predrag Stojadinović
  • 3,121
  • 5
  • 30
  • 49
5
votes
1 answer

How to bundle httrack into a python 3 executable

There is a great website copier that I would like to bundle in my executable, created with python 3 and py2exe. On the HTTrack official website in the FAQ section they say that there is a DLL/library version available. But I don't know where to…
yuval
  • 1,948
  • 3
  • 21
  • 36
5
votes
3 answers

Compiling Httrack on MAC OS X

I'm trying to compile httrack on my MAC. ./configure is successful. But while compiling the package i'm getting following error, and not able to resolve it. In file included from htscore.c:40: In file included from ./htscore.h:81: In file included…
user3730989
  • 51
  • 1
  • 2
4
votes
1 answer

Retrieving a complete webpage including dynamically loaded links/images

Problem Downloading a complete working offline copy of a website that loads links/images dynamically Research There are questions (e.g. [1], [2], [3]) on Stackoverflow addressing this issue, most of which have the top answers using wget or httrack,…
Nader Alexan
  • 1,802
  • 20
  • 31
4
votes
2 answers

Issue downloading a complete website for offline use with HTTrack

I downloaded sonst.cc with HTTrack, but when viewing it offline there’s no content. Every single tab is empty. Why is that? Is there any other app with which I could download the whole thing? I’m losing my mind over here. Thanks. Edit: When I open…
jay
  • 41
  • 1
  • 3
4
votes
2 answers

HTTrack possible using cookies

I want to download the page from a URL, easy enough. But on the first page I have to login, as I normally do from a normal browser. But HTTrack is downloading from the first page since it can't use my cookies or login. Is it any way for me to get…
4
votes
2 answers

httrack follow redirects

I try to mirror webpages recursively starting from URL supplied by user (there is a depth limit set of course). Wget didn't catch links from css/js so I decided to use httrack. I try to mirror some site like this: # httrack -r6…
neutrinus
  • 1,429
  • 12
  • 20
3
votes
2 answers

How to block httrack ex programs?

all HTTRACK USER AGENT REQUESTS : Mozilla/2.0 (compatible; MS FrontPage Express 2.0) Mozilla/4.05 [fr] (Win98; I) Lynx/2.8rel.3 libwww-FM/2.14 Java1.1.4 Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98) HyperBrowser (Cray; I; OrganicOS…
3
votes
1 answer

HTTrack: How to download folders only from a certain subfolder level?

HTTrack gives filter options but I cannot figure out how to download a certain subfolder level and ignore all other subfolders. Example:…
Avatar
  • 11,039
  • 8
  • 98
  • 167
2
votes
1 answer

How can I mirror the results of MOSS plagiarism detection?

MOSS is a well-known server for checking software plagiarism. It allows teachers to send homework submissions, calculates the similarity between different submissions, and colors code blocks that are very similar. Here is an example of the results…
Erel Segal-Halevi
  • 26,318
  • 26
  • 92
  • 153
2
votes
2 answers

How should I download specific file type from folder (and ONLY it's subfolders) using wget or httrack?

I'm trying to use HTTrack or Wget do download some .docx files from a website. I want to do this only for a folder and it's subfolders. Ex: www.examplewebsite.com/doc (this goes down 5 more levels) How would be a good way to do this?
NoBlink
  • 21
  • 1
  • 2
1
2 3 4 5