Optimizing Web Scraping from 15 Hours to 3 minutes.
Web Scraping is a pool to learn about full-stack development. How an API works? What happens when I hit a URL? How does content on my webpage gets loaded?
In a coding interview, while solving a problem, always one follow-up question that the interviewers ask, Can you further optimize the logic? And supposedly, you answered and got the job and BOOOM!! Nobody talks about the optimization of code during product development. And, all you have is a task, you nailed it, and the job is done. Then what is the point of learning all the stuff because in end you are just implementing a no-optimized, brute-forced algorithm?
Well, let me get it straight! If you think what I told you above is true, you are naive and new to the industry. It usually does not happen. Optimization is the key approach to making your product stand in this competitive world. Maybe during Proof of concept, we can implement a simple solution. But when it comes to production, optimizations, and scaling do matter.
Web Scraping is one such domain that, I find, is a pool to learn about full-stack development. How does an API work? What happens when I hit a URL? How does content on my webpage gets loaded? And answers to these questions can take you web scraping code from 15 hours to maybe 2 hours. A Massive Upgrade!!!
The example I will demonstrate is an application of optimization in web scraping logic. Let us look at the problem. We need to scrape the current price of the TATA STEEL (Indian Stock) from this link. There are two ways to do that:
HIT URL AND PARSE HTML
I can simply hit the URL and parse the HTML data in python using Beautiful Soup. As, when I inspect the page, I can tell in which tag I’ll find the current price.
Bellow is the snippet for the python code that follows the above logic.
The output of the above code snippet is: (time may vary as it depends on website traffic and internet speed)
Current Price of TATA STEEL is 902.10
Time for scraping using HTML Parsing 1.0405778884887695 sec
But when you closely look at the structure of the website, you may notice the following things.
- When you hit the URL, a lot of activity happens in the background
- If you closely review all the requests in the console, you will notice that there are the following type of requests:
Document: This will contain some HTML templates and pre-rendered (rendered at Server’s end) HTML.
Others: Which includes Logos, CSS for HTML, Ads, and others.
My point here is to build a story.
- Hitting a URL is the way for the server to show which script to run and with what parameters. Here
moneycontrol.comis the domain.
/india/stockpricequote/ironsteelrepresent the path of API
/tatasteel/TISrepresent the parameters to code.
- Once that code runs, it sends a number of responses to the client. Those responses along with HTML are then rendered at the client's end.
- That rendered HTML is your webpage.
Now, the beauty is, to get the current price you do not need to wait for the whole webpage to get rendered. Only the API, which is responsible for getting the current price can do the job.
When you filter along all the requests, you will a request named TIS, which contains information on the current Price. The Request URL for this request can be used as a direct API for getting the current price.
By hitting API, you have the following impacts:
- The response is fast
The server is just asked to load 1 request among a number of requests that we can see on the console. Hence, when the load on the server is less, it can process requests very fast.
- Saves Rendering Time
You have to wait for those hundred requests to get loaded and get rendered HTML.
- Response Manipulations is easy
The page HTML is around 3MB, while the API response is only 1.7 KB. Hence loading this response in a python code will save a lot of time.
HIT API AND LOAD RESPONSE
Let's scrape data using API Response.
As expected, the output of the code is:
Current Price of TATA STEEL is 902.10
Time for scraping using API Call 0.19942188262939453 sec
A massive upgrade from 1.04 Seconds to 0.2 Seconds. This is the power of API calls. It can dramatically reduce the time for response to get loaded. Also, the response is clean as it is not required to parse hundreds of tags to get data.
Does API call always work?
The straight answer is NO!
HTML is Rendered at Server
If the HTML gets rendered at the server then there will no API calls for different segments of the page. There will be just one call that will output the entire HTML to you.
In this case, you have to wait for full HTML to load and will have to parse HTML using Beautiful Soup-like tools.
Authentication in Navigation
Some websites need some cookies and authentication for an API Response. This is just a matter of security. Also, these are the only ways for websites to prevent a BOT from scraping the data.
In this case, you may use some automated clicks using selenium, so that the cookies once are loaded in the background, you can hit API millions of times. Hence, rendering the page becomes a one-time effort. Which is affordable!
- Before Scraping any website, you should understand the structure of the website.
- The first preference should be to stimulate API calls, obviously!!
I hope you find this article useful and value-adding. Thanks for reading!