Optimizing Web Scraping from 15 Hours to 3 minutes.

Web Scraping is a pool to learn about full-stack development. How an API works? What happens when I hit a URL? How does content on my webpage gets loaded?

FUN FACT:
There’s no way to programmatically determine if a page is being scraped. (Image Generated by Aayush Ostwal)

An Example

The example I will demonstrate is an application of optimization in web scraping logic. Let us look at the problem. We need to scrape the current price of the TATA STEEL (Indian Stock) from this link. There are two ways to do that:

HIT URL AND PARSE HTML

I can simply hit the URL and parse the HTML data in python using Beautiful Soup. As, when I inspect the page, I can tell in which tag I’ll find the current price.

Code for Scraping URL Directly (Create by Aayush Ostwal)
Current Price of TATA STEEL is 902.10
Time for scraping using HTML Parsing 1.0405778884887695 sec
  • When you hit the URL, a lot of activity happens in the background
Console View After URL Gets Hit (Create by Aayush Ostwal)
  • If you closely review all the requests in the console, you will notice that there are the following type of requests:
    JavaScript: Which are used for fetching data from the database, WebSockets connection, Rendering page HTML at the client’s end, and much more.
    Document : This will contain some HTML templates and pre-rendered (rendered at Server’s end) HTML.
    Others : Which includes Logos, CSS for HTML, Ads, and others.
  1. Hitting a URL is the way for the server to show which script to run and with what parameters. Here
    moneycontrol.comis the domain.
    /india/stockpricequote/ironsteelrepresent the path of API
    /tatasteel/TISrepresent the parameters to code.
  2. Once that code runs, it sends a number of responses to the client. Those responses along with HTML are then rendered at the client's end.
  3. That rendered HTML is your webpage.
API Response which is responsible to populate the current price on Screen. (Create by Aayush Ostwal)
  1. The response is fast
    The server is just asked to load 1 request among a number of requests that we can see on the console. Hence, when the load on the server is less, it can process requests very fast.
  2. Saves Rendering Time
    You have to wait for those hundred requests to get loaded and get rendered HTML.
  3. Response Manipulations is easy
    The page HTML is around 3MB, while the API response is only 1.7 KB. Hence loading this response in a python code will save a lot of time.

HIT API AND LOAD RESPONSE

Let's scrape data using API Response.

Code for Scraping Page using API Call (Create by Aayush Ostwal)
Current Price of TATA STEEL is 902.10
Time for scraping using API Call 0.19942188262939453 sec

Does API call always work?

The straight answer is NO!

HTML is Rendered at Server

If the HTML gets rendered at the server then there will no API calls for different segments of the page. There will be just one call that will output the entire HTML to you.

Authentication in Navigation

Some websites need some cookies and authentication for an API Response. This is just a matter of security. Also, these are the only ways for websites to prevent a BOT from scraping the data.

Key Take-Aways

  • Before Scraping any website, you should understand the structure of the website.
  • The first preference should be to stimulate API calls, obviously!!

--

--

AI Engineer at Qure.ai| Enthusiastic ML practitioner | IIT Kanpur | Drama Lover | Subscribe https://www.youtube.com/channel/UCqq_T7ktsZO62k7CaibgQvA

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store