Web crawling is a fundamental technique used by search engines and various data aggregation platforms to retrieve, index, and analyze information from the vast expanse of the World Wide Web.
Web crawlers provide various advantages, which include extensive data collection capabilities by systematically exploring and gathering information from diverse internet sources, including text, images, videos, and documents, ensuring a comprehensive dataset. Moreover, they facilitate real-time updates, keeping search engine indexes current with the latest online content for timely search results, and can efficiently handle large data volumes. Through structured indexing, they organize web content for easy searchability, forming the backbone of search engines and enhancing user search and discovery experiences. Moreover, web crawling enables data analysis, insights generation, customization of content delivery, and supports research and monitoring activities for market analysis, competitor tracking, and trend monitoring.
Overall, web crawling plays a crucial role in gathering, indexing, and analyzing the vast amount of information available on the web, empowering users with access to relevant and timely content and driving insights and innovation across various domains.
At Linknovate, we use web crawling as a core tool for finding and retrieving products or patent-related web documents; enabling businesses, researchers, and innovators to access valuable information, track industry trends, and make informed decisions. Our main goal in this project was to improve our crawler’s Harvest Rate (HR): in computer science, HR is the ratio of positives versus the total number of crawled pages. In this case, a positive page is a page that contains information about products and/or patents. With this in mind, let’s explain how we plan to improve the HR with slight changes in the crawling strategy.
Build Datasets
From the total number of organizations indexed in Linknovate, we selected a subset of those organizations that met the following conditions:
- Organizations that have URLs.
- The URL is correct (well-formed, does not redirect to another site, etc.)
- The organization has at least one document in LKN that talks about products (recent releases, product pages, patents, etc.).
- The document must be in English.
The selected URLs were divided into three splits in order to carry out experimental tests and draw conclusions, namely: training, testing, validation.
Extract statistics from data
From the above data selection, we crawled the different websites (take into account not every page in a website is related to products/patents). The studied statistics are:
- Total number of websites per data split.
- Total URLs per website.
- Mean URLs per website.
- Total product related-URLs per website.
- Mean product related-URLs per website.
- Mean product related-URLs depth.
- Typical deviation of product related-URLs depth.
The depth of a page is the number of “jumps” (clicks) that a user/crawler needs to perform from the seed URL to reach that page.We are interested in being able to better identify which pages talk about product launches or patents, which are good indicators of innovation. With these statistics, we carried out a study on how to improve the crawling strategy so that, of the web documents retrieved, we can improve the percentage of positive (i.e., documents related to products).
Results
The work carried out consisted of improving the crawling strategy by the average depth at which product-related pages are usually found. In the future we will continue to iterate on the improvement of crawling, based on this or new criteria resulting from experimentation.
Linknovate’s main crawling framework is Scrapy, an open source and collaborative framework for extracting data from websites. For this experiment, we have replaced Scrapyd’s default crawling strategy with the classic breadth first search algorithm.
The advantages of this approach are as follows:
- Systematic Coverage: BFS systematically explores the web starting from a seed URL and gradually expands outward to neighboring pages in a level-by-level fashion. This systematic approach ensures comprehensive coverage of the web, allowing crawlers to discover a wide range of product-related web pages efficiently.
- Faster Discovery of Product Pages: BFS prioritizes the exploration of pages closest to the seed URL before delving deeper into the web graph. This strategy increases the likelihood of encountering product-related pages early in the crawling process, enabling crawlers to find and retrieve product pages sooner compared to other crawling strategies.
- Early Identification of Product Categories: BFS enables crawlers to identify product categories and subcategories early in the crawling process by exploring links from the seed URL in a breadth-first manner. This early identification of product categories allows crawlers to focus their efforts on relevant areas of the web, leading to faster discovery of product-related content.
- Optimized Resource Utilization: BFS avoids deep exploration of individual branches of the web graph until the immediate neighborhood has been thoroughly explored. By prioritizing breadth over depth, BFS optimizes resource utilization and reduces the likelihood of getting stuck in less relevant areas of the web, thus maximizing the efficiency of the crawling process.
- Improved User Experience: By finding product-related web pages sooner, BFS enhances the user experience for consumers searching for products online. Search engines and e-commerce platforms can deliver more relevant and timely search results, leading to improved satisfaction and engagement among users.
- Enhanced SEO Performance: BFS can improve the search engine optimization (SEO) performance of product-related web pages by ensuring their timely discovery and indexing. Websites selling products can benefit from increased visibility in search engine results pages, driving organic traffic and potential conversions.
- Scalability: BFS is scalable and can be adapted to crawl large volumes of web pages efficiently. Whether crawling a specific website or the entire internet, BFS can handle the increasing complexity and scale of web crawling tasks, making it suitable for both small-scale and large-scale crawling operations.
The obtained results reflect an improvement of 2 points in the HR, with a minimal algorithmic cost.
In summary, by refining our web crawling strategy with the breadth-first search (BFS) algorithm, we’ve significantly increased the Harvest Rate (HR) for product and patent-related pages. This approach offers systematic coverage, faster discovery of relevant content, and optimized resource utilization. With a notable 2-point improvement in HR at minimal cost, our efforts underscore the importance of innovative crawling techniques in enhancing data accessibility and driving informed decision-making.
This study has been led under the project “Intelligent text mining methods for radical search improvement in Technology Watch” (2021/C005/00150574) funded by The Ministry of Science (MCIN/AEI/10.13039/501100011033) and the European Union (NextGenerationEU/PRTR).