[研究]Wget for Windows 1.21.3網站下載軟體安裝、試用
2023-03-15
Wget - 維基百科,自由的百科全書
https://zh.wikipedia.org/wiki/Wget
GNU Wget(常簡稱為Wget)是一個在網路上進行下載的簡單而強大的自由軟體,其本身也是GNU計劃的一部分。它的名字是「World Wide Web」和「Get」的結合,同時也隱含了軟體的主要功能。目前它支援通過HTTP、HTTPS,以及FTP這三個最常見的TCP/IP協定協定下載。
Wget:1995年問世,由作者Hrvoje Nikšić開發。Wget 是一個由自由軟體基金會(英語:Free Software Foundation,FSF)開發的命令列工具,它支援多種網路協定,包括 HTTP、HTTPS、FTP 等,並且可以自動遞迴地下載整個網站。Wget 目前的版本是1.21.2,官網為 https://www.gnu.org/software/wget/
GNU Wget:1996年問世,Wget和GNU Wget是同一款軟體的不同名稱,都是由自由軟體基金會(FSF)開發的自由軟體。在早期,該軟體的名稱是Wget,並且由作者Hrvoje Nikšić自行維護。之後,Wget被加入GNU計劃中,並改名為GNU Wget (1997年)。因此,Wget和GNU Wget是同一款軟體,只是名稱有所不同。現在,官方上只繼續維護GNU Wget這個名稱。它支援HTTP、HTTPS和FTP協議,具有下載續傳、遞迴下載等功能。目前最新版為wget 1.21 (2020-12-31) 和 wget2 2.0.1 (2022-05-27),可在以下網址下載:https://www.gnu.org/software/wget/
wget for Windows。官網為
https://gnuwin32.sourceforge.net/packages/wget.htm, wget 1.11.4 (2008-12-31)
https://eternallybored.org/misc/wget/,wget 1.21.3 版 (2022-03-12)
根據:https://eternallybored.org/misc/wget/manual/wget-1.18.html
得知 by Hrvoje Nikšić and others 和 GNU Wget was written by Hrvoje Nikšić
根據:https://en.wikipedia.org/wiki/Wget
得知 the development of which commenced in late 1995
********************************************************************************
wget.exe https://www.xxx.org.tw/:只會下載一頁
wget.exe https://www.xxx.org.tw/ -r:遞迴
wget --mirror -p --convert-links -P ./LOCAL URL:mirror整個網站
$ wget \
--recursive \
--no-clobber \
--page-requisites \
--html-extension \
--convert-links \
--restrict-file-names=windows \
--domains example.org \
--no-parent \
www.example.org/tutorials/html/
從這個頁面:http://www.linuxjournal.com/content/downloading-entire-web-site-wget
wget --recursive --no-clobber --page-requisites --html-extension --convert-links --restrict-file-names=windows --domains sh168.osha.gov.tw --no-parent https://sh168.osha.gov.tw/default.aspx
wget --recursive --no-clobber --page-requisites --html-extension --convert-links --restrict-file-names=windows --domains www.xxx.idv.tw --no-parent https://www.xxx.idv.tw/ -P C:\Wget\Web3 -r
Incomplete or invalid multibyte sequence encountered
Incomplete or invalid multibyte sequence encountered
很多大檔案沒抓下來
wget --recursive --no-clobber --page-requisites --html-extension --convert-links --restrict-file-names=windows --domains xxx.idv.tw --no-parent https://www.xxx.idv.tw/ -P C:\Wget\Web4
Incomplete or invalid multibyte sequence encountered
Incomplete or invalid multibyte sequence encountered
還是很多大檔案沒抓下來,它無法處理中文檔案名稱
解決中文檔案名稱 --restrict-file-names=nocontrol
wget --recursive --no-clobber --page-requisites --html-extension --convert-links --restrict-file-names=nocontrol --domains www.xxx.idv.tw --no-parent https://www.xxx.idv.tw/ -P C:\Wget\Web5 -r
-i, --input-file=FILE
wget -i --recursive --no-clobber --page-requisites --html-extension --convert-links --restrict-file-names=nocontrol -L --domains xxx.idv.tw --no-parent https://www.xxx.idv.tw/ -P C:\Wget\Web8 -r
wget "Incomplete or invalid multibyte sequence encountered" site:eternallybored.org
官方無解法
不檢查伺服器憑證
wget --no-check-certificate --recursive --no-clobber --page-requisites --html-extension --convert-links --restrict-file-names=windows --domains www.xxx.idv.tw --no-parent https://www.xxx.idv.tw/
和extensions相關的
-E, --adjust-extension save HTML/CSS documents with proper extensions
-A, --accept=LIST comma-separated list of accepted extensions
-R, --reject=LIST comma-separated list of rejected extensions
和include相關的
--referer=URL include 'Referer: URL' header in HTTP request
-I, --include-directories=LIST list of allowed directories
D:\SOFTWARE\Wget64>wget --help GNU Wget 1.21.3, a non-interactive network retriever. Usage: wget [OPTION]... [URL]... Mandatory arguments to long options are mandatory for short options too. Startup: -V, --version display the version of Wget and exit -h, --help print this help -b, --background go to background after startup -e, --execute=COMMAND execute a `.wgetrc'-style command Logging and input file: -o, --output-file=FILE log messages to FILE -a, --append-output=FILE append messages to FILE -d, --debug print lots of debugging information -q, --quiet quiet (no output) -v, --verbose be verbose (this is the default) -nv, --no-verbose turn off verboseness, without being quiet --report-speed=TYPE output bandwidth as TYPE. TYPE can be bits -i, --input-file=FILE download URLs found in local or external FILE --input-metalink=FILE download files covered in local Metalink FILE -F, --force-html treat input file as HTML -B, --base=URL resolves HTML input-file links (-i -F) relative to URL --config=FILE specify config file to use --no-config do not read any config file --rejected-log=FILE log reasons for URL rejection to FILE Download: -t, --tries=NUMBER set number of retries to NUMBER (0 unlimits) --retry-connrefused retry even if connection is refused --retry-on-http-error=ERRORS comma-separated list of HTTP errors to retry -O, --output-document=FILE write documents to FILE -nc, --no-clobber skip downloads that would download to existing files (overwriting them) --no-netrc don't try to obtain credentials from .netrc -c, --continue resume getting a partially-downloaded file --start-pos=OFFSET start downloading from zero-based position OFFSET --progress=TYPE select progress gauge type --show-progress display the progress bar in any verbosity mode -N, --timestamping don't re-retrieve files unless newer than local --no-if-modified-since don't use conditional if-modified-since get requests in timestamping mode --no-use-server-timestamps don't set the local file's timestamp by the one on the server -S, --server-response print server response --spider don't download anything -T, --timeout=SECONDS set all timeout values to SECONDS --dns-servers=ADDRESSES list of DNS servers to query (comma separated) --bind-dns-address=ADDRESS bind DNS resolver to ADDRESS (hostname or IP) on local host --dns-timeout=SECS set the DNS lookup timeout to SECS --connect-timeout=SECS set the connect timeout to SECS --read-timeout=SECS set the read timeout to SECS -w, --wait=SECONDS wait SECONDS between retrievals (applies if more then 1 URL is to be retrieved) --waitretry=SECONDS wait 1..SECONDS between retries of a retrieval (applies if more then 1 URL is to be retrieved) --random-wait wait from 0.5*WAIT...1.5*WAIT secs between retrievals (applies if more then 1 URL is to be retrieved) --no-proxy explicitly turn off proxy -Q, --quota=NUMBER set retrieval quota to NUMBER --bind-address=ADDRESS bind to ADDRESS (hostname or IP) on local host --limit-rate=RATE limit download rate to RATE --no-dns-cache disable caching DNS lookups --restrict-file-names=OS restrict chars in file names to ones OS allows --ignore-case ignore case when matching files/directories -4, --inet4-only connect only to IPv4 addresses -6, --inet6-only connect only to IPv6 addresses --prefer-family=FAMILY connect first to addresses of specified family, one of IPv6, IPv4, or none --user=USER set both ftp and http user to USER --password=PASS set both ftp and http password to PASS --ask-password prompt for passwords --use-askpass=COMMAND specify credential handler for requesting username and password. If no COMMAND is specified the WGET_ASKPASS or the SSH_ASKPASS environment variable is used. --no-iri turn off IRI support --local-encoding=ENC use ENC as the local encoding for IRIs --remote-encoding=ENC use ENC as the default remote encoding --unlink remove file before clobber --keep-badhash keep files with checksum mismatch (append .badhash) --metalink-index=NUMBER Metalink application/metalink4+xml metaurl ordinal NUMBER --metalink-over-http use Metalink metadata from HTTP response headers --preferred-location preferred location for Metalink resources Directories: -nd, --no-directories don't create directories -x, --force-directories force creation of directories -nH, --no-host-directories don't create host directories --protocol-directories use protocol name in directories -P, --directory-prefix=PREFIX save files to PREFIX/.. --cut-dirs=NUMBER ignore NUMBER remote directory components HTTP options: --http-user=USER set http user to USER --http-password=PASS set http password to PASS --no-cache disallow server-cached data --default-page=NAME change the default page name (normally this is 'index.html'.) -E, --adjust-extension save HTML/CSS documents with proper extensions --ignore-length ignore 'Content-Length' header field --header=STRING insert STRING among the headers --compression=TYPE choose compression, one of auto, gzip and none. (default: none) --max-redirect maximum redirections allowed per page --proxy-user=USER set USER as proxy username --proxy-password=PASS set PASS as proxy password --referer=URL include 'Referer: URL' header in HTTP request --save-headers save the HTTP headers to file -U, --user-agent=AGENT identify as AGENT instead of Wget/VERSION --no-http-keep-alive disable HTTP keep-alive (persistent connections) --no-cookies don't use cookies --load-cookies=FILE load cookies from FILE before session --save-cookies=FILE save cookies to FILE after session --keep-session-cookies load and save session (non-permanent) cookies --post-data=STRING use the POST method; send STRING as the data --post-file=FILE use the POST method; send contents of FILE --method=HTTPMethod use method "HTTPMethod" in the request --body-data=STRING send STRING as data. --method MUST be set --body-file=FILE send contents of FILE. --method MUST be set --content-disposition honor the Content-Disposition header when choosing local file names (EXPERIMENTAL) --content-on-error output the received content on server errors --auth-no-challenge send Basic HTTP authentication information without first waiting for the server's challenge HTTPS (SSL/TLS) options: --secure-protocol=PR choose secure protocol, one of auto, SSLv2, SSLv3, TLSv1, TLSv1_1, TLSv1_2, TLSv1_3 and PFS --https-only only follow secure HTTPS links --no-check-certificate don't validate the server's certificate --certificate=FILE client certificate file --certificate-type=TYPE client certificate type, PEM or DER --private-key=FILE private key file --private-key-type=TYPE private key type, PEM or DER --ca-certificate=FILE file with the bundle of CAs --ca-directory=DIR directory where hash list of CAs is stored --crl-file=FILE file with bundle of CRLs --pinnedpubkey=FILE/HASHES Public key (PEM/DER) file, or any number of base64 encoded sha256 hashes preceded by 'sha256//' and separated by ';', to verify peer against --random-file=FILE file with random data for seeding the SSL PRNG --ciphers=STR Set the priority string (GnuTLS) or cipher list string (OpenSSL) directly. Use with care. This option overrides --secure-protocol. The format and syntax of this string depend on the specific SSL/TLS engine. HSTS options: --no-hsts disable HSTS --hsts-file path of HSTS database (will override default) FTP options: --ftp-user=USER set ftp user to USER --ftp-password=PASS set ftp password to PASS --no-remove-listing don't remove '.listing' files --no-glob turn off FTP file name globbing --no-passive-ftp disable the "passive" transfer mode --preserve-permissions preserve remote file permissions --retr-symlinks when recursing, get linked-to files (not dir) FTPS options: --ftps-implicit use implicit FTPS (default port is 990) --ftps-resume-ssl resume the SSL/TLS session started in the control connection when opening a data connection --ftps-clear-data-connection cipher the control channel only; all the data will be in plaintext --ftps-fallback-to-ftp fall back to FTP if FTPS is not supported in the target server WARC options: --warc-file=FILENAME save request/response data to a .warc.gz file --warc-header=STRING insert STRING into the warcinfo record --warc-max-size=NUMBER set maximum size of WARC files to NUMBER --warc-cdx write CDX index files --warc-dedup=FILENAME do not store records listed in this CDX file --no-warc-compression do not compress WARC files with GZIP --no-warc-digests do not calculate SHA1 digests --no-warc-keep-log do not store the log file in a WARC record --warc-tempdir=DIRECTORY location for temporary files created by the WARC writer Recursive download: -r, --recursive specify recursive download -l, --level=NUMBER maximum recursion depth (inf or 0 for infinite) --delete-after delete files locally after downloading them -k, --convert-links make links in downloaded HTML or CSS point to local files --convert-file-only convert the file part of the URLs only (usually known as the basename) --backups=N before writing file X, rotate up to N backup files -K, --backup-converted before converting file X, back up as X.orig -m, --mirror shortcut for -N -r -l inf --no-remove-listing -p, --page-requisites get all images, etc. needed to display HTML page --strict-comments turn on strict (SGML) handling of HTML comments Recursive accept/reject: -A, --accept=LIST comma-separated list of accepted extensions -R, --reject=LIST comma-separated list of rejected extensions --accept-regex=REGEX regex matching accepted URLs --reject-regex=REGEX regex matching rejected URLs --regex-type=TYPE regex type (posix|pcre) -D, --domains=LIST comma-separated list of accepted domains --exclude-domains=LIST comma-separated list of rejected domains --follow-ftp follow FTP links from HTML documents --follow-tags=LIST comma-separated list of followed HTML tags --ignore-tags=LIST comma-separated list of ignored HTML tags -H, --span-hosts go to foreign hosts when recursive -L, --relative follow relative links only -I, --include-directories=LIST list of allowed directories --trust-server-names use the name specified by the redirection URL's last component -X, --exclude-directories=LIST list of excluded directories -np, --no-parent don't ascend to the parent directory Email bug reports, questions, discussions to <bug-wget@gnu.org> and/or open issues at https://savannah.gnu.org/bugs/?func=additem&group=wget. D:\SOFTWARE\Wget64> |
https://www.gnu.org/software/wget/manual/wget.html
********************************************************************************
‘--restrict-file-names=modes’
更改在生成本地文件名期間必須轉義遠程 URL 中的哪些字符。 受此選項限制的字符被轉義,即替換為‘%HH’,其中‘HH’是對應於受限字符的十六進制數。 此選項還可用於強制所有字母大小寫為小寫或大寫。
默認情況下,Wget 會轉義操作系統文件名中無效或不安全的字符,以及通常不可打印的控製字符。 此選項對於更改這些默認值很有用,可能是因為您正在下載到非本機分區,或者因為您想要禁用控製字符的轉義,或者您想要進一步將字符限制為僅在 ASCII 值範圍內的字符。
這些模式是一組以逗號分隔的文本值。 可接受的值為“unix”、“windows”、“nocontrol”、“ascii”、“lowercase”和“uppercase”。 值“unix”和“windows”是互斥的(一個將覆蓋另一個),“lowercase”和“uppercase”也是如此。 最後那些是特殊情況,因為它們不會更改將被轉義的字符集,而是強制將本地文件路徑轉換為小寫或大寫。
當指定“unix”時,Wget 轉義字符“/”和 0-31 和 128-159 範圍內的控製字符。 這是類 Unix 操作系統的默認設置。
當給出“windows”時,Wget 轉義字符 '\'、'|'、'/'、':'、'?'、'"'、'*'、'<'、'>' 和控件 0-31 和 128-159 範圍內的字符。除此之外,Windows 模式下的 Wget 使用“+”而不是“:”來分隔本地文件名中的主機和端口,並使用“@”而不是“?”。 將文件名的查詢部分與其餘部分分開。因此,在 Unix 模式下將保存為“www.xemacs.org:4300/search.pl?input=blah”的 URL 將保存為“www.xemacs” .org+4300/search.pl@input=blah' 在 Windows 模式下。此模式是 Windows 上的默認模式。
如果您指定“nocontrol”,那麼控製字符的轉義也會被關閉。 當您在可以以 UTF-8 格式保存和顯示文件名的系統上下載名稱包含 UTF-8 字符的 URL 時,此選項可能有意義(UTF-8 字節序列中使用的一些可能的字節值落在指定的值範圍內 由 Wget 作為“控件”)。
“ascii”模式用於指定值超出 ASCII 字符範圍(即大於 127)的任何字節都應進行轉義。 這在保存編碼與本地使用的編碼不匹配的文件名時很有用。
********************************************************************************
最終
wget --no-check-certificate --recursive --no-clobber --page-requisites --html-extension --convert-links --restrict-file-names=windows --domains www.xxx.idv.tw --no-parent https://www.xxx.idv.tw/
(完)
相關
[研究]wget 1.21.1 網站下載軟體安裝、測試 (Rocky Linux 9.1)
https://shaurong.blogspot.com/2023/03/wget-1211-rocky-linux-91.html
[研究]Wget for Windows 1.21.3試用
https://shaurong.blogspot.com/2023/03/wget-for-windows-1213.html
沒有留言:
張貼留言