2023年3月24日 星期五

[研究]Wget for Windows 1.21.3網站下載軟體安裝、試用

[研究]Wget for Windows 1.21.3網站下載軟體安裝、試用

2023-03-15

Wget - 維基百科,自由的百科全書
https://zh.wikipedia.org/wiki/Wget

GNU Wget(常簡稱為Wget)是一個在網路上進行下載的簡單而強大的自由軟體,其本身也是GNU計劃的一部分。它的名字是「World Wide Web」和「Get」的結合,同時也隱含了軟體的主要功能。目前它支援通過HTTP、HTTPS,以及FTP這三個最常見的TCP/IP協定協定下載。

*****

Wget:1995年問世,由作者Hrvoje Nikšić開發。Wget 是一個由自由軟體基金會(英語:Free Software Foundation,FSF)開發的命令列工具,它支援多種網路協定,包括 HTTP、HTTPS、FTP 等,並且可以自動遞迴地下載整個網站。Wget 目前的版本是1.21.2,官網為 https://www.gnu.org/software/wget/

GNU Wget:1996年問世,Wget和GNU Wget是同一款軟體的不同名稱,都是由自由軟體基金會(FSF)開發的自由軟體。在早期,該軟體的名稱是Wget,並且由作者Hrvoje Nikšić自行維護。之後,Wget被加入GNU計劃中,並改名為GNU Wget (1997年)。因此,Wget和GNU Wget是同一款軟體,只是名稱有所不同。現在,官方上只繼續維護GNU Wget這個名稱。它支援HTTP、HTTPS和FTP協議,具有下載續傳、遞迴下載等功能。目前最新版為wget 1.21 (2020-12-31) 和 wget2 2.0.1 (2022-05-27),可在以下網址下載:https://www.gnu.org/software/wget/

wget for Windows。官網為 

https://gnuwin32.sourceforge.net/packages/wget.htm, wget 1.11.4 (2008-12-31)

https://eternallybored.org/misc/wget/,wget 1.21.3 版 (2022-03-12)

根據:https://eternallybored.org/misc/wget/manual/wget-1.18.html
得知 by Hrvoje Nikšić and others 和 GNU Wget was written by Hrvoje Nikšić

根據:https://en.wikipedia.org/wiki/Wget
得知 the development of which commenced in late 1995

********************************************************************************

wget.exe https://www.xxx.org.tw/:只會下載一頁

wget.exe https://www.xxx.org.tw/ -r:遞迴

wget --mirror -p --convert-links -P ./LOCAL URL:mirror整個網站


$ wget \
     --recursive \
     --no-clobber \
     --page-requisites \
     --html-extension \
     --convert-links \
     --restrict-file-names=windows \
     --domains example.org \
     --no-parent \
         www.example.org/tutorials/html/

從這個頁面:http://www.linuxjournal.com/content/downloading-entire-web-site-wget


wget --recursive --no-clobber --page-requisites --html-extension --convert-links --restrict-file-names=windows --domains sh168.osha.gov.tw --no-parent https://sh168.osha.gov.tw/default.aspx




wget --recursive --no-clobber --page-requisites --html-extension --convert-links --restrict-file-names=windows --domains www.xxx.idv.tw --no-parent https://www.xxx.idv.tw/ -P C:\Wget\Web3 -r


Incomplete or invalid multibyte sequence encountered


Incomplete or invalid multibyte sequence encountered


很多大檔案沒抓下來




wget --recursive --no-clobber --page-requisites --html-extension --convert-links --restrict-file-names=windows --domains xxx.idv.tw --no-parent https://www.xxx.idv.tw/ -P C:\Wget\Web4 


Incomplete or invalid multibyte sequence encountered


Incomplete or invalid multibyte sequence encountered


還是很多大檔案沒抓下來,它無法處理中文檔案名稱




解決中文檔案名稱  --restrict-file-names=nocontrol


wget --recursive --no-clobber --page-requisites --html-extension --convert-links --restrict-file-names=nocontrol --domains www.xxx.idv.tw --no-parent https://www.xxx.idv.tw/ -P C:\Wget\Web5 -r




-i,  --input-file=FILE  


wget -i --recursive --no-clobber --page-requisites --html-extension --convert-links --restrict-file-names=nocontrol -L --domains xxx.idv.tw --no-parent https://www.xxx.idv.tw/ -P C:\Wget\Web8 -r



wget "Incomplete or invalid multibyte sequence encountered" site:eternallybored.org


官方無解法




不檢查伺服器憑證


wget --no-check-certificate --recursive --no-clobber --page-requisites --html-extension --convert-links --restrict-file-names=windows --domains www.xxx.idv.tw --no-parent https://www.xxx.idv.tw/

和extensions相關的
-E,  --adjust-extension          save HTML/CSS documents with proper extensions
-A,  --accept=LIST               comma-separated list of accepted extensions
-R,  --reject=LIST               comma-separated list of rejected extensions

和include相關的 
     --referer=URL               include 'Referer: URL' header in HTTP request
-I,  --include-directories=LIST  list of allowed directories

和exclude相關的 
     --exclude-domains=LIST      comma-separated list of rejected domains
-X,  --exclude-directories=LIST  list of excluded directories



D:\SOFTWARE\Wget64>wget --help
GNU Wget 1.21.3, a non-interactive network retriever.
Usage: wget [OPTION]... [URL]...

Mandatory arguments to long options are mandatory for short options too.

Startup:
  -V,  --version                   display the version of Wget and exit
  -h,  --help                      print this help
  -b,  --background                go to background after startup
  -e,  --execute=COMMAND           execute a `.wgetrc'-style command

Logging and input file:
  -o,  --output-file=FILE          log messages to FILE
  -a,  --append-output=FILE        append messages to FILE
  -d,  --debug                     print lots of debugging information
  -q,  --quiet                     quiet (no output)
  -v,  --verbose                   be verbose (this is the default)
  -nv, --no-verbose                turn off verboseness, without being quiet
       --report-speed=TYPE         output bandwidth as TYPE.  TYPE can be bits
  -i,  --input-file=FILE           download URLs found in local or external FILE
       --input-metalink=FILE       download files covered in local Metalink FILE
  -F,  --force-html                treat input file as HTML
  -B,  --base=URL                  resolves HTML input-file links (-i -F)
                                     relative to URL
       --config=FILE               specify config file to use
       --no-config                 do not read any config file
       --rejected-log=FILE         log reasons for URL rejection to FILE

Download:
  -t,  --tries=NUMBER              set number of retries to NUMBER (0 unlimits)
       --retry-connrefused         retry even if connection is refused
       --retry-on-http-error=ERRORS    comma-separated list of HTTP errors to retry
  -O,  --output-document=FILE      write documents to FILE
  -nc, --no-clobber                skip downloads that would download to
                                     existing files (overwriting them)
       --no-netrc                  don't try to obtain credentials from .netrc
  -c,  --continue                  resume getting a partially-downloaded file
       --start-pos=OFFSET          start downloading from zero-based position OFFSET
       --progress=TYPE             select progress gauge type
       --show-progress             display the progress bar in any verbosity mode
  -N,  --timestamping              don't re-retrieve files unless newer than
                                     local
       --no-if-modified-since      don't use conditional if-modified-since get
                                     requests in timestamping mode
       --no-use-server-timestamps  don't set the local file's timestamp by
                                     the one on the server
  -S,  --server-response           print server response
       --spider                    don't download anything
  -T,  --timeout=SECONDS           set all timeout values to SECONDS
       --dns-servers=ADDRESSES     list of DNS servers to query (comma separated)
       --bind-dns-address=ADDRESS  bind DNS resolver to ADDRESS (hostname or IP) on local host
       --dns-timeout=SECS          set the DNS lookup timeout to SECS
       --connect-timeout=SECS      set the connect timeout to SECS
       --read-timeout=SECS         set the read timeout to SECS
  -w,  --wait=SECONDS              wait SECONDS between retrievals
                                     (applies if more then 1 URL is to be retrieved)
       --waitretry=SECONDS         wait 1..SECONDS between retries of a retrieval
                                     (applies if more then 1 URL is to be retrieved)
       --random-wait               wait from 0.5*WAIT...1.5*WAIT secs between retrievals
                                     (applies if more then 1 URL is to be retrieved)
       --no-proxy                  explicitly turn off proxy
  -Q,  --quota=NUMBER              set retrieval quota to NUMBER
       --bind-address=ADDRESS      bind to ADDRESS (hostname or IP) on local host
       --limit-rate=RATE           limit download rate to RATE
       --no-dns-cache              disable caching DNS lookups
       --restrict-file-names=OS    restrict chars in file names to ones OS allows
       --ignore-case               ignore case when matching files/directories
  -4,  --inet4-only                connect only to IPv4 addresses
  -6,  --inet6-only                connect only to IPv6 addresses
       --prefer-family=FAMILY      connect first to addresses of specified family,
                                     one of IPv6, IPv4, or none
       --user=USER                 set both ftp and http user to USER
       --password=PASS             set both ftp and http password to PASS
       --ask-password              prompt for passwords
       --use-askpass=COMMAND       specify credential handler for requesting
                                     username and password.  If no COMMAND is
                                     specified the WGET_ASKPASS or the SSH_ASKPASS
                                     environment variable is used.
       --no-iri                    turn off IRI support
       --local-encoding=ENC        use ENC as the local encoding for IRIs
       --remote-encoding=ENC       use ENC as the default remote encoding
       --unlink                    remove file before clobber
       --keep-badhash              keep files with checksum mismatch (append .badhash)
       --metalink-index=NUMBER     Metalink application/metalink4+xml metaurl ordinal NUMBER
       --metalink-over-http        use Metalink metadata from HTTP response headers
       --preferred-location        preferred location for Metalink resources

Directories:
  -nd, --no-directories            don't create directories
  -x,  --force-directories         force creation of directories
  -nH, --no-host-directories       don't create host directories
       --protocol-directories      use protocol name in directories
  -P,  --directory-prefix=PREFIX   save files to PREFIX/..
       --cut-dirs=NUMBER           ignore NUMBER remote directory components

HTTP options:
       --http-user=USER            set http user to USER
       --http-password=PASS        set http password to PASS
       --no-cache                  disallow server-cached data
       --default-page=NAME         change the default page name (normally
                                     this is 'index.html'.)
  -E,  --adjust-extension          save HTML/CSS documents with proper extensions
       --ignore-length             ignore 'Content-Length' header field
       --header=STRING             insert STRING among the headers
       --compression=TYPE          choose compression, one of auto, gzip and none. (default: none)
       --max-redirect              maximum redirections allowed per page
       --proxy-user=USER           set USER as proxy username
       --proxy-password=PASS       set PASS as proxy password
       --referer=URL               include 'Referer: URL' header in HTTP request
       --save-headers              save the HTTP headers to file
  -U,  --user-agent=AGENT          identify as AGENT instead of Wget/VERSION
       --no-http-keep-alive        disable HTTP keep-alive (persistent connections)
       --no-cookies                don't use cookies
       --load-cookies=FILE         load cookies from FILE before session
       --save-cookies=FILE         save cookies to FILE after session
       --keep-session-cookies      load and save session (non-permanent) cookies
       --post-data=STRING          use the POST method; send STRING as the data
       --post-file=FILE            use the POST method; send contents of FILE
       --method=HTTPMethod         use method "HTTPMethod" in the request
       --body-data=STRING          send STRING as data. --method MUST be set
       --body-file=FILE            send contents of FILE. --method MUST be set
       --content-disposition       honor the Content-Disposition header when
                                     choosing local file names (EXPERIMENTAL)
       --content-on-error          output the received content on server errors
       --auth-no-challenge         send Basic HTTP authentication information
                                     without first waiting for the server's
                                     challenge

HTTPS (SSL/TLS) options:
       --secure-protocol=PR        choose secure protocol, one of auto, SSLv2,
                                     SSLv3, TLSv1, TLSv1_1, TLSv1_2, TLSv1_3 and PFS
       --https-only                only follow secure HTTPS links
       --no-check-certificate      don't validate the server's certificate
       --certificate=FILE          client certificate file
       --certificate-type=TYPE     client certificate type, PEM or DER
       --private-key=FILE          private key file
       --private-key-type=TYPE     private key type, PEM or DER
       --ca-certificate=FILE       file with the bundle of CAs
       --ca-directory=DIR          directory where hash list of CAs is stored
       --crl-file=FILE             file with bundle of CRLs
       --pinnedpubkey=FILE/HASHES  Public key (PEM/DER) file, or any number
                                   of base64 encoded sha256 hashes preceded by
                                   'sha256//' and separated by ';', to verify
                                   peer against
       --random-file=FILE          file with random data for seeding the SSL PRNG

       --ciphers=STR           Set the priority string (GnuTLS) or cipher list string (OpenSSL) directly.
                                   Use with care. This option overrides --secure-protocol.
                                   The format and syntax of this string depend on the specific SSL/TLS engine.
HSTS options:
       --no-hsts                   disable HSTS
       --hsts-file                 path of HSTS database (will override default)

FTP options:
       --ftp-user=USER             set ftp user to USER
       --ftp-password=PASS         set ftp password to PASS
       --no-remove-listing         don't remove '.listing' files
       --no-glob                   turn off FTP file name globbing
       --no-passive-ftp            disable the "passive" transfer mode
       --preserve-permissions      preserve remote file permissions
       --retr-symlinks             when recursing, get linked-to files (not dir)

FTPS options:
       --ftps-implicit                 use implicit FTPS (default port is 990)
       --ftps-resume-ssl               resume the SSL/TLS session started in the control connection when
                                         opening a data connection
       --ftps-clear-data-connection    cipher the control channel only; all the data will be in plaintext
       --ftps-fallback-to-ftp          fall back to FTP if FTPS is not supported in the target server
WARC options:
       --warc-file=FILENAME        save request/response data to a .warc.gz file
       --warc-header=STRING        insert STRING into the warcinfo record
       --warc-max-size=NUMBER      set maximum size of WARC files to NUMBER
       --warc-cdx                  write CDX index files
       --warc-dedup=FILENAME       do not store records listed in this CDX file
       --no-warc-compression       do not compress WARC files with GZIP
       --no-warc-digests           do not calculate SHA1 digests
       --no-warc-keep-log          do not store the log file in a WARC record
       --warc-tempdir=DIRECTORY    location for temporary files created by the
                                     WARC writer

Recursive download:
  -r,  --recursive                 specify recursive download
  -l,  --level=NUMBER              maximum recursion depth (inf or 0 for infinite)
       --delete-after              delete files locally after downloading them
  -k,  --convert-links             make links in downloaded HTML or CSS point to
                                     local files
       --convert-file-only         convert the file part of the URLs only (usually known as the basename)
       --backups=N                 before writing file X, rotate up to N backup files
  -K,  --backup-converted          before converting file X, back up as X.orig
  -m,  --mirror                    shortcut for -N -r -l inf --no-remove-listing
  -p,  --page-requisites           get all images, etc. needed to display HTML page
       --strict-comments           turn on strict (SGML) handling of HTML comments

Recursive accept/reject:
  -A,  --accept=LIST               comma-separated list of accepted extensions
  -R,  --reject=LIST               comma-separated list of rejected extensions
       --accept-regex=REGEX        regex matching accepted URLs
       --reject-regex=REGEX        regex matching rejected URLs
       --regex-type=TYPE           regex type (posix|pcre)
  -D,  --domains=LIST              comma-separated list of accepted domains
       --exclude-domains=LIST      comma-separated list of rejected domains
       --follow-ftp                follow FTP links from HTML documents
       --follow-tags=LIST          comma-separated list of followed HTML tags
       --ignore-tags=LIST          comma-separated list of ignored HTML tags
  -H,  --span-hosts                go to foreign hosts when recursive
  -L,  --relative                  follow relative links only
  -I,  --include-directories=LIST  list of allowed directories
       --trust-server-names        use the name specified by the redirection
                                     URL's last component
  -X,  --exclude-directories=LIST  list of excluded directories
  -np, --no-parent                 don't ascend to the parent directory

Email bug reports, questions, discussions to <bug-wget@gnu.org>
and/or open issues at https://savannah.gnu.org/bugs/?func=additem&group=wget.

D:\SOFTWARE\Wget64>

https://www.gnu.org/software/wget/manual/wget.html

********************************************************************************

‘--restrict-file-names=modes’

更改在生成本地文件名期間必須轉義遠程 URL 中的哪些字符。 受此選項限制的字符被轉義,即替換為‘%HH’,其中‘HH’是對應於受限字符的十六進制數。 此選項還可用於強制所有字母大小寫為小寫或大寫。

默認情況下,Wget 會轉義操作系統文件名中無效或不安全的字符,以及通常不可打印的控製字符。 此選項對於更改這些默認值很有用,可能是因為您正在下載到非本機分區,或者因為您想要禁用控製字符的轉義,或者您想要進一步將字符限制為僅在 ASCII 值範圍內的字符。

這些模式是一組以逗號分隔的文本值。 可接受的值為“unix”、“windows”、“nocontrol”、“ascii”、“lowercase”和“uppercase”。 值“unix”和“windows”是互斥的(一個將覆蓋另一個),“lowercase”和“uppercase”也是如此。 最後那些是特殊情況,因為它們不會更改將被轉義的字符集,而是強制將本地文件路徑轉換為小寫或大寫。

當指定“unix”時,Wget 轉義字符“/”和 0-31 和 128-159 範圍內的控製字符。 這是類 Unix 操作系統的默認設置。

當給出“windows”時,Wget 轉義字符 '\'、'|'、'/'、':'、'?'、'"'、'*'、'<'、'>' 和控件 0-31 和 128-159 範圍內的字符。除此之外,Windows 模式下的 Wget 使用“+”而不是“:”來分隔本地文件名中的主機和端口,並使用“@”而不是“?”。 將文件名的查詢部分與其餘部分分開。因此,在 Unix 模式下將保存為“www.xemacs.org:4300/search.pl?input=blah”的 URL 將保存為“www.xemacs” .org+4300/search.pl@input=blah' 在 Windows 模式下。此模式是 Windows 上的默認模式。

如果您指定“nocontrol”,那麼控製字符的轉義也會被關閉。 當您在可以以 UTF-8 格式保存和顯示文件名的系統上下載名稱包含 UTF-8 字符的 URL 時,此選項可能有意義(UTF-8 字節序列中使用的一些可能的字節值落在指定的值範圍內 由 Wget 作為“控件”)。

“ascii”模式用於指定值超出 ASCII 字符範圍(即大於 127)的任何字節都應進行轉義。 這在保存編碼與本地使用的編碼不匹配的文件名時很有用。

********************************************************************************

最終

wget --no-check-certificate --recursive --no-clobber --page-requisites --html-extension --convert-links --restrict-file-names=windows --domains www.xxx.idv.tw --no-parent https://www.xxx.idv.tw/

(完)

相關

[研究]wget 1.21.1 網站下載軟體安裝、測試 (Rocky Linux 9.1)
https://shaurong.blogspot.com/2023/03/wget-1211-rocky-linux-91.html

[研究]Wget for Windows 1.21.3試用
https://shaurong.blogspot.com/2023/03/wget-for-windows-1213.html


沒有留言:

張貼留言