Questions tagged [httrack]

HTTrack (Website copier)

HTTrack is a free and open source Web crawler and offline browser, developed by Xavier Roche and licensed under the GNU General Public License Version 3.

HTTrack allows users to download World Wide Web sites from the Internet to a local computer.[4][5] By default, HTTrack arranges the downloaded site by the original site's relative link-structure. The downloaded (or "mirrored") website can be browsed by opening a page of the site in a browser.

HTTrack can also update an existing mirrored site and resume interrupted downloads. HTTrack is configurable by options and by filters (include/exclude), and has an integrated help system. There is a basic command line version and two GUI versions (WinHTTrack and WebHTTrack); the former can be part of scripts and cron jobs.

HTTrack uses a Web crawler to download a website. Some parts of the website may not be downloaded by default due to the robots exclusion protocol unless disabled during the program. HTTrack can follow links that are generated with basic JavaScript and inside Applets or Flash, but not complex links (generated using functions or expressions) or server-side image maps.

Reference :

http://www.httrack.com/

http://en.wikipedia.org/wiki/HTTrack

61 questions

votes

2 answers

Using HTTrack to mirror a single page

I've been attempting to use HTTrack to mirror a single page (downloading html + prerequisites: style sheets, images, etc), similar to the question [mirror single page with httrack][1]. However, the accepted answer there doesn't work for me, as I'm…

asked Jan 14 '16 at 17:33

Empiromancer

3,258
1
15
41

vote

0 answers

How to resolve a problem of downloading site with Httrack

I recently tried to capture a site with HTTRACK. Unfortunately the capture did not take into account the whole site, in particular the icons at the level of the buttons (favicons) and the slider. For the slider, I got this error: !! Error at loading…

javascript wordpress httrack

asked Nov 19 '20 at 09:54

Adlo

vote

1 answer

Using HTTrack to download links only under a certain subdomain (nothing external)

So, this is what I am trying to download - https://www.slader.com/textbook/9781337624183-calculus-9th-edition/ Looks fairly simple, I tried adding a few lines to "scan rules" to force it to download everything under it but for some reason, the…

web download httrack

asked Aug 30 '20 at 08:04

EvilRaceHorse

vote

0 answers

cookies.txt not working on Httrack version 3.49-2

hi guys I'm using httrack and using cookies for the authentication but it seems my cookie doesnt work the syntax i'm using httrack -b1 -%M -r1 −F "Mozilla/4.5 (compatible; MSIE 4.01; Windows NT)" …

curl cookies httrack

asked Oct 23 '19 at 12:16

dmh

vote

0 answers

Unable to login to a website using HTTrack

I am trying to download the content of a website using HTTrack software. The website requires login details. After selecting the directory to save in, I added the URL. "http://*****/login I selected capture URL and temporarily added the temporary…

httrack

asked Jan 22 '19 at 13:27

Olfa Fdhila

vote

2 answers

'x86_64-linux-gnu-gcc' error in installing apackage using pip3

When I tried to install httrack in Ubuntu 16.04 I was not able to get those packages: pip3 install httrack-py Collecting httrack-py Using cached…

python pip httrack

asked Jul 20 '18 at 03:38

Neeraj Nair

vote

1 answer

HTTrack gives 404 on unicode urls with german special characters

I've realized that HTTrack can't download files if urls have special characters in them, like german ß - it returns a 404 response. Errors look like on screenshot: Is there any setting in HTTrack to make it able to deal with such characters? ps: I…

url unicode httrack

asked Aug 04 '17 at 13:50

Evgeniy

1,838
18
41

vote

1 answer

How do I get httrack to save files with their original names rather than index****.html?

I'm following the HTTrack docs example here: http://httrack.kauler.com/help/User-defined_structure The site I need to scrape has URLs in this…

html web-scraping wget httrack

asked Jul 11 '17 at 19:28

BlueDogRanch

vote

0 answers

Httrack faulty when encountering japanese encoded URLS

I usually don't have any problem with Httrack, but this time, I found out that it doesn't manage to grab pages with non ascii characters like this japanese URL : domain.com/リーク情報の真偽のほ/ ( read by the browser this way :…

url character-encoding httrack

asked Sep 29 '16 at 01:47

majimekun

vote

0 answers

Mirroring websites - 403 Forbidden with user agent strings

I'm working on an application to mirror US university academic catalogs. To do this, I have a cluster of Celery workers that use wget or httrack to mirror the content, styles and scripts, then upload to our S3 bucket. For a small number of…

wget mirroring httrack

asked May 27 '16 at 16:51

Jason

10,777
19
81
169

vote

1 answer

How do I enter the variable value of my bash command into MySQL?

The following code extracts all the domain names from a website and sets them to the value of $domain from a httrack data stream. domain=$(httrack --skeleton http://www.ilovefreestuff.com -V "cat \$0" | grep -iEo '[[:alnum:]-]+\.(com|net|org)') The…

mysql bash httrack

asked May 25 '14 at 17:00

Wyatt Jackson

vote

0 answers

Mirror a website with httrack while executing javascript

I want do save a mirror of www.youtube.com/tv. I obviously do not want to save the videos. I want the code running the website in a local copy, everything else can stay remote. The code I want is mainly contained in 2 files: live.js and…

javascript http download youtube httrack

asked Nov 13 '13 at 15:51

Martin

3,592
7
39
40

votes

0 answers

Is there a way to use HTTrack or wget with "Session Storage"?

I need to download locally a copy of a website. The problem is that to access the sections of this site, I need to perform a check by which, after clicking a button, will record a variable as true in the Session Storage of the browser (no cookies,…

web scripting wget httrack

asked Apr 16 '21 at 04:53

Sting1

votes

0 answers

download/mirror a website on cloudflare for archiving

Trying to backup ( download / mirror ) a website for archival purposes. This site is apparently on Cloudflare. My usual tool for this would be wget, but it fails on me (even when using a cookie cfduid header). Example of a not-working wget…

wget archive httrack

asked Mar 23 '21 at 10:18

christopher

votes

0 answers

How to complete scrape a web page with resources?

I want total scrape a web page with all resources (.css, .html, .favicon, ,js etc...) Currently, I'm using this command to do that. wget -E -H -k -K -e robots=off -p https://example.com -P ./myDir However, on some pages, the downloaded folder has…

node.js linux web-scraping wget httrack

asked Mar 17 '21 at 15:41

Murat Colyaran

Prev 1

3 4 5 Next