python list

Wget: How To Scrap Any Website From Terminal?

Wget is a free utility for non-interactive download of files from the Web.  It supports HTTP, HTTPS, and FTP protocols, as well as retrieval through HTTP proxies. Wget is non-interactive, meaning that it can work in the background, while the user is not logged on.  This allows you to start a retrieval and disconnect from the system, letting Wget finish the work.  By contrast, most of the Web browsers require constant user’s presence, which can be a great hindrance when transferring a lot of data.

Today I will show you how I did the web scrapping of my website. Web Scrapping simply means downloading data from any online platform through automation of third party software. I’ve added a screenshot of the downloading activity created by wget. The command below looks kind of lengthy. The command I used is:

sudo wget –show-progress –progress=dot –verbose –recurseve –retryconnrefused -x –level=5 https://www.techjhola.com 

I’ll now break the command and describe the use of each parameters used within a single command.

Basically to download a website it is okey to provide the default command wget www.sitename.com. But at the same time you should make sure if the download is on progress so you add the parameter: –-show-progress.  So i used –show-progress command along with the dot displaying parameter –progress=dot. 

Now that a website will have different directories that it fetched tie file from so I added a –recursive parameter.  -x  is used such that I shouldn’t bother about any restrictions and I can download the site forcefully. 

–retry-connrefused This is useful parameter which helps to continuously retry connecting to the website and establish a connection. 

Wget has this special parameter level,  which you shouldn’t miss using wihle downloading a website which has many indepth urls within an article or a page.

Suppose you use –-level=3,  than the download begins with the doamin i.e. www.techjhola.com ans it starts downloading all data of the domain. If it finds any content with a weblink then it downloads everything inside that weblink. Same is done till it reaches its final level i.e. 3.

There are huge bunch of parameters that you can implement while executing the command. To learn about the tool type man wget in your terminal.

Offline Wikiepdia in HeNN E-Library Project

There are more than 35 E-Libraries  setup in Nepal by Help Nepal Network which runs on Linux Machine(Edubuntu : Educational Distribution of Ubuntu). What we the volunteers from Kathmandu University Open Source Community have done is, we have download GBs of wikipedia from the High Bandwidth University Internet and added those contents to the E-Libraries. And the volunteers update the content of wikipedia every three months they visit the site for Monitoring.