Collecting data in an automated manner has numerous advantages, including:
orders of magnitude shorter data acquisition time – humans stand no chance against the speed and precision of the machine,
scalability - machines allow for the acquisition of data on a large scale,
low cost of obtaining information - depending on the scale, even the smallest VPS server for a few zlotys may be enough,
data integrity - a human could make a mistake when transferring data, and the machine will do it flawlessly if it has been very well prepared for this process,
flexibility - adapt the scraper to your needs in a virtually unlimited way,
low maintenance costs - the structure of the website or its API may change over time, but this usually requires only minor changes to the bot logic/rules,
structured data - thanks to the imposed structure, the data can be processed by other software, e.g. data analysis software.
Disadvantages of web scraping
Unfortunately, web scraping also has its drawbacks:
if our needs require a very personalized solution, creating a bot ourselves in a short time or finding a ready-made solution may exceed our desires or possibilities,
Data collection bots usually need to be monitored and controlled (especially at the initial stage of information acquisition) for the correctness of the collected data. Initial versions of the bot may not cover all cases, so there is a certain probability that sooner or later the bot will encounter a structure that it will not be able to handle correctly. Monitoring mainly applies to projects in which data will be collected for many days, weeks, months or even years,
some websites "don't like" being scraped and try to block suspicious activities - this is one of the most problematic factors when creating web scrapers, but new solutions are constantly being created to bypass security. Currently, a pool of proxy servers is quite commonly used, thanks to which it is possible to use intermediary servers in the process of obtaining data, which by definition should not arouse suspicions by the server on which the data we value is located.
Is web scraping legal?
The large availability of data on the Internet creates the temptation to use it commercially. It is quite easy to see that data obtained in this way could be used to create all kinds of price comparison websites. Despite the fact that such data is widely available to the public, the question arises whether its reuse (often for commercial purposes) will be legal.
Is it completely legal to collect and process data downloaded from the Internet for internal or external purposes? There are certain regulations that should be taken into account before we start collecting and storing data. Although web scraping seems to be completely legal, there is still some legal risk. If the database is protected under sui generis law , i.e. the owner of the database has made significant investments (and is able to prove it) in order to create or obtain the content of the database, then we cannot obtain all or a significant part of the content from such a database . Nevertheless, before starting automated data collection, it is worth familiarizing yourself with the following legal acts:
Act on Access to Public Information
Open Data and Reuse of Public Sector Information Act
Copyright and Related Rights Act
Database Protection Act
Chapter XXXIII of the Penal Code
It may happen that the entire undertaking of preparing a zalo database web scraper will be pointless, because the data we are interested in may be available somewhere, e.g. on government websites such as dane.gov.pl and danepubliczne.imgw.pl .
Another issue is the regulations, which are individual for each service. For example, the regulations of the opineo.pl service clearly state that Opineo prohibits the use of automatic IT solutions that allow for automatic downloading and aggregation of the service's content. The regulations also mention the prohibition of destabilizing the service using bots and other similar solutions. At this point, it is worth smoothly moving on to the topic of how not to expose yourself to services and obtain data in a way that will not negatively affect their operation.
Is web scraping harmful?
Unfortunately, web scraping is considered harmful. In order not to harm the server where the data is located, it is worth following a few important rules:
Bots should follow the rules in the robots.txt file - in this file, the creators of the website can specify which resources should not be downloaded by robots. It is also worth respecting the Crawl-delay directive , the value of which determines the maximum number of queries we can perform in a given time unit - exceeding this value most often results in a server response informing about too many queries (response code 429, or Too Many Requests).
If the service provides an API, it is definitely better to use this method of obtaining data, which is much less burdensome for the server than when we download the entire HTML structure. There may be a situation in which the API offers more possibilities (e.g. downloading more products) than what the website interface offers to the user.
It is best to scrape during the hours when the server is least busy - usually at night. The server will be serving significantly fewer real users at that time.
Do not use more resources than is reasonable - if some resources are downloaded from a CDN server, the risk of harming the service is much lower.
Servers are much more efficient these days and have more resources in the form of RAM and CPU processing power, but it is still good practice to keep the HTTP(S) connection alive, because the connection itself takes a relatively long time to establish. Keeping the connection alive, which we will use to download many different resources, is the preferred option, because it can download resources much faster.
Only obtain publicly available information and do not process personal data. More information on this can be found in the legal acts listed above.
The rest of the article should be of interest to those who would like to learn more about technical issues.
A quick overview of web scraping tools
Web scraping using Google Sheets
web scraping in Google Sheets
Using Google Sheets as a tool for scraping data from websites. In this case, downloading the titles of the main pages of popular news portals using the XPath language
Using Google Sheets - a popular cloud tool - we can create a simple web scraper that will extract the data we are interested in from the web. The tool gives you the opportunity to use several functions that allow you to download data located at a given URL. These are:
importhtml,
import feed,
import data,
importxml - the most flexible of the listed functions.
To use the above-mentioned functions, you need to know the basics of XPath , which allows you to indicate which data you want to extract from a given XML structure. Here are some additional functions that will be useful for creating web scrapers in Google Sheets: transpose , isurl , hyperlink , detectlanguage , index , regexreplace .
Google Sheets offers a surprising amount of functionality for occasional, small-scale web scraping. However, it's not something I would recommend for intermediate or advanced operations. Google Sheets typically allows for a few dozen to a few hundred URL fetch function calls per hour.
Manual web scraping using Chrome Devtools
I use this tool in my daily work because it not only allows me to quickly collect the data I need, but also allows me to analyze the page and prepare a set of selectors that I can then use to create web scrapers. It is worth familiarizing yourself with the Elements , Console , and Network tabs if you want to discover whether the page also uses an API that could be used.
web scraping in chrome devtools
Network tab preview, where you can see queries executed via the web browser and parsed server responses
In the Console tab, you can run JavaScript code that will be used to retrieve information much faster than manually copying individual pieces of information or copying a set of data and manually cleaning them up.
Custom Extraction in Screaming Frog
The popular SEO tool Screaming Frog allows you to not only obtain typical data from websites such as headers, meta tags or links to graphics, but also allows you to "extract" precisely specified resources from the website. This can be done in the Configuration > Custom > Extraction window
XPath selectors - for example, the // meta[@name='viewport']/@content selector is used to extract the contents of the content attribute for a meta element whose name attribute value is "viewport".
CSS selectors - for example, the #content > h1 selector will extract the content of the H1 heading, whose direct parent in the structure of the web page is the element with the content identifier. Using CSS selectors, you can extract, among other things, attribute values.
Regular expressions - Regular expressions are useful for extracting information that is not covered by the structure that is valid from the point of view of CSS or XPath selectors. A good example of this would be extracting information from JS scripts placed in the page code. Using the regular expression ["'](GTM-.*?)["'] we are able to extract the container ID from Google Tag Manager.
Examples of using this module have been described on the manufacturer's website of this useful tool - . Unfortunately, in the free version of the software we will not find this very useful functionality. Screaming Frog works quite well as a web scraper, but for specific cases we often have to reach for solutions that will not limit us in any way. Such a solution are web scrapers written in various programming languages by programmers.
Advantages of web scraping
-
- Posts: 30
- Joined: Mon Dec 09, 2024 3:42 am