Questions tagged [httrack]

HTTrack (Website copier)

HTTrack is a free and open source Web crawler and offline browser, developed by Xavier Roche and licensed under the GNU General Public License Version 3.

HTTrack allows users to download World Wide Web sites from the Internet to a local computer.[4][5] By default, HTTrack arranges the downloaded site by the original site's relative link-structure. The downloaded (or "mirrored") website can be browsed by opening a page of the site in a browser.

HTTrack can also update an existing mirrored site and resume interrupted downloads. HTTrack is configurable by options and by filters (include/exclude), and has an integrated help system. There is a basic command line version and two GUI versions (WinHTTrack and WebHTTrack); the former can be part of scripts and cron jobs.

HTTrack uses a Web crawler to download a website. Some parts of the website may not be downloaded by default due to the robots exclusion protocol unless disabled during the program. HTTrack can follow links that are generated with basic JavaScript and inside Applets or Flash, but not complex links (generated using functions or expressions) or server-side image maps.

Reference :

http://www.httrack.com/

http://en.wikipedia.org/wiki/HTTrack

61 questions
2
votes
2 answers

Using HTTrack to mirror a single page

I've been attempting to use HTTrack to mirror a single page (downloading html + prerequisites: style sheets, images, etc), similar to the question [mirror single page with httrack][1]. However, the accepted answer there doesn't work for me, as I'm…
Empiromancer
  • 3,258
  • 1
  • 15
  • 41
1
vote
0 answers

How to resolve a problem of downloading site with Httrack

I recently tried to capture a site with HTTRACK. Unfortunately the capture did not take into account the whole site, in particular the icons at the level of the buttons (favicons) and the slider. For the slider, I got this error: !! Error at loading…
Adlo
  • 33
  • 4
1
vote
1 answer

Using HTTrack to download links only under a certain subdomain (nothing external)

So, this is what I am trying to download - https://www.slader.com/textbook/9781337624183-calculus-9th-edition/ Looks fairly simple, I tried adding a few lines to "scan rules" to force it to download everything under it but for some reason, the…
1
vote
0 answers

cookies.txt not working on Httrack version 3.49-2

hi guys I'm using httrack and using cookies for the authentication but it seems my cookie doesnt work the syntax i'm using httrack -b1 -%M -r1 −F "Mozilla/4.5 (compatible; MSIE 4.01; Windows NT)" …
dmh
  • 298
  • 7
  • 26
1
vote
0 answers

Unable to login to a website using HTTrack

I am trying to download the content of a website using HTTrack software. The website requires login details. After selecting the directory to save in, I added the URL. "http://*****/login I selected capture URL and temporarily added the temporary…
Olfa Fdhila
  • 99
  • 2
  • 8
1
vote
2 answers

'x86_64-linux-gnu-gcc' error in installing apackage using pip3

When I tried to install httrack in Ubuntu 16.04 I was not able to get those packages: pip3 install httrack-py Collecting httrack-py Using cached…
Neeraj Nair
  • 165
  • 9
1
vote
1 answer

HTTrack gives 404 on unicode urls with german special characters

I've realized that HTTrack can't download files if urls have special characters in them, like german ß - it returns a 404 response. Errors look like on screenshot: Is there any setting in HTTrack to make it able to deal with such characters? ps: I…
Evgeniy
  • 1,838
  • 18
  • 41
1
vote
1 answer

How do I get httrack to save files with their original names rather than index****.html?

I'm following the HTTrack docs example here: http://httrack.kauler.com/help/User-defined_structure The site I need to scrape has URLs in this…
BlueDogRanch
  • 444
  • 5
  • 27
1
vote
0 answers

Httrack faulty when encountering japanese encoded URLS

I usually don't have any problem with Httrack, but this time, I found out that it doesn't manage to grab pages with non ascii characters like this japanese URL : domain.com/リーク情報の真偽のほ/ ( read by the browser this way :…
majimekun
  • 200
  • 1
  • 10
1
vote
0 answers

Mirroring websites - 403 Forbidden with user agent strings

I'm working on an application to mirror US university academic catalogs. To do this, I have a cluster of Celery workers that use wget or httrack to mirror the content, styles and scripts, then upload to our S3 bucket. For a small number of…
Jason
  • 10,777
  • 19
  • 81
  • 169
1
vote
1 answer

How do I enter the variable value of my bash command into MySQL?

The following code extracts all the domain names from a website and sets them to the value of $domain from a httrack data stream. domain=$(httrack --skeleton http://www.ilovefreestuff.com -V "cat \$0" | grep -iEo '[[:alnum:]-]+\.(com|net|org)') The…
Wyatt Jackson
  • 173
  • 1
  • 1
  • 10
1
vote
0 answers

Mirror a website with httrack while executing javascript

I want do save a mirror of www.youtube.com/tv. I obviously do not want to save the videos. I want the code running the website in a local copy, everything else can stay remote. The code I want is mainly contained in 2 files: live.js and…
Martin
  • 3,592
  • 7
  • 39
  • 40
0
votes
0 answers

Is there a way to use HTTrack or wget with "Session Storage"?

I need to download locally a copy of a website. The problem is that to access the sections of this site, I need to perform a check by which, after clicking a button, will record a variable as true in the Session Storage of the browser (no cookies,…
Sting1
  • 39
  • 5
0
votes
0 answers

download/mirror a website on cloudflare for archiving

Trying to backup ( download / mirror ) a website for archival purposes. This site is apparently on Cloudflare. My usual tool for this would be wget, but it fails on me (even when using a cookie cfduid header). Example of a not-working wget…
0
votes
0 answers

How to complete scrape a web page with resources?

I want total scrape a web page with all resources (.css, .html, .favicon, ,js etc...) Currently, I'm using this command to do that. wget -E -H -k -K -e robots=off -p https://example.com -P ./myDir However, on some pages, the downloaded folder has…
Murat Colyaran
  • 467
  • 1
  • 12