Questions tagged [httrack]

HTTrack (Website copier)

HTTrack is a free and open source Web crawler and offline browser, developed by Xavier Roche and licensed under the GNU General Public License Version 3.

HTTrack allows users to download World Wide Web sites from the Internet to a local computer.[4][5] By default, HTTrack arranges the downloaded site by the original site's relative link-structure. The downloaded (or "mirrored") website can be browsed by opening a page of the site in a browser.

HTTrack can also update an existing mirrored site and resume interrupted downloads. HTTrack is configurable by options and by filters (include/exclude), and has an integrated help system. There is a basic command line version and two GUI versions (WinHTTrack and WebHTTrack); the former can be part of scripts and cron jobs.

HTTrack uses a Web crawler to download a website. Some parts of the website may not be downloaded by default due to the robots exclusion protocol unless disabled during the program. HTTrack can follow links that are generated with basic JavaScript and inside Applets or Flash, but not complex links (generated using functions or expressions) or server-side image maps.

Reference :

http://www.httrack.com/

http://en.wikipedia.org/wiki/HTTrack

61 questions
0
votes
2 answers

Download website links having specific elements around

I need to mirror recursively some site wallpaper images having a specific markup around, like:
Original Resolution: 4800x2700
Views:
mike
  • 360
  • 3
  • 16
0
votes
1 answer

HTTrack returns file not found

I downloaded a website with HTTrack using the following command: /usr/local/bin/httrack https://www.website.com -O /Users/mainuser/Desktop/website -n -j I than located the index.html file in website folder and run it. Chrome returns the message:…
sanjihan
  • 3,913
  • 6
  • 37
  • 78
0
votes
1 answer

How to download a website including all files with links starting with a certain path

I'd like to build a static website based on the styling of a Wordpress template, Inovado. I downloaded the website using HTTrack (in Linux) with the following command: httrack http://inovado.hellominti.com The resulting index.html contains several…
Kurt Peek
  • 34,968
  • 53
  • 191
  • 361
0
votes
1 answer

Remove Domain URL from downloaded wbsite by HTTrack

I have downloaded full website by HTTrack. But after downloading the site all URL contain the Domain name url of the site like: www.example.com/index.html instead of index.html is there any way to remove this url ?
akib
  • 5
  • 4
0
votes
0 answers

What does block the crawl of my website by Httrack or Wget?

I am attempting to clone my website to show it for a presentation offline. However I tried either with Httrack either with Wget and both are stoping to the second level of the source tree. What could be the reason ? This is the Wget cmd : wget -r…
Baldráni
  • 4,066
  • 3
  • 41
  • 63
0
votes
1 answer

Node.js get HTTP_USER_AGENT and Block HTTrack

I want to block all bots (like a HTTrack) on my website. Normally, I would use .htaccess file to block bots via RewriteCond %{HTTP_USER_AGENT} HTTrack [NC,OR]. However, my server is running Node.js Express. How can I get HTTP_USER_AGENT and do a…
0
votes
1 answer

HTTrack wait until page search completed

I'm trying to download with HTTrack the results of a search request at the URL here Unfortunately the download starts immediately and doesn't get the search result (as the page is still showing a wheel). Question: is it possible to force a pause…
Tom
  • 1,315
  • 3
  • 22
  • 41
0
votes
1 answer

Download .torrent from YTS

Is it possible to download all torrent files from the yts website? In HHTRACK I get a mirror error, probably caused by the captcha that you need to enter before accessing the site. Is there a way to bypass this or use another method?
dcf007
  • 21
  • 5
0
votes
1 answer

Using subprocess to run HTTrack from python in Windows

I'm in the process of writing a web scraping python script, and one of the things I'd like it to be able to do is have it take a snapshot of certain pages (all of the html, style sheets, and images necessary to view that particular page properly…
Empiromancer
  • 3,258
  • 1
  • 15
  • 41
0
votes
1 answer

Trying to mirror site that uses strapdown.js

there is a site that uses strapdown.js that I am trying to mirror using httrack or wget, but I fall short, because the site contains markdown and not HTML. Only strapdown converts the links to html links. Hence the client needs to interpret…
Buddy
  • 86
  • 7
0
votes
1 answer

httrack only downloads the index.html file

Usually when I download sites with Httrack I get all the files; images, CSS, JS etc. Today, the program finished downloading in just 2 seconds and only grabs the index.html file with CSS, IMG code etc inside still linking to external. I've already…
user3379220
  • 11
  • 1
  • 4
0
votes
1 answer

How do I push the result of this complex command line grep statement to mysql database?

This code searches through website html files and extracts a list of domain names... httrack --skeleton http://www.ilovefreestuff.com -V "cat \$0" | grep -iEo '[[:alnum:]-]+\.(com|net|org)' The result looks like…
Wyatt Jackson
  • 173
  • 1
  • 1
  • 10
-1
votes
1 answer

How to find the directory structure and the file names under a php website?

How do i get the directory structure and filenames under a PHP website I do not own?. Not the code, just the structure and the filenames.? I tried httrack, but since it's a PHP website, it doesn't work.
user12871659
-1
votes
1 answer

Different source code in inspect and in view-source code

While I was looking for source code a website it showed me some random-looking JS code in body block in view-source-code like following: