Building a Web Crawler for Scraping Index Files from Common Crawl Server.
In this article, we'll go into detail about how to make a crawling controller that is particularly made to scrape index files from the Common Crawl server. We will examine the essential ideas and methods needed to construct a powerful web crawler designed specifically to retrieve OpenAPI definitions from the Common Crawl index.
Let's begin by examining the current folder structure of the Project:
├── CODE_OF_CONDUCT.md
├── CODEOWNERS
├── LICENSE
├── README.md
└── src
└── server
├── api
│ ├── constants
│ │ └── Constants.js
│ ├── controllers
│ │ └── CrawlingController.js
│ ├── drivers
│ │ ├── CommonCrawlDriver.js
│ │ ├── GithubCrawlerDriver.js
│ │ └── GoogleBigQueryDriver.js
│ ├── helpers
│ ├── keywords.json
│ ├── models
│ ├── policies
│ ├── services
│ │ ├── DownloadService.js
│ │ ├── ParserService.js
│ │ └── ValidateService.js
│ └── utils
│ ├── CommonCrawlDriverUtil.js
│ ├── processDirectoriesWithExponentialRetry.js
│ └── SelectDataSources.js
├── app.js
├── config
├── data
├── package.json
├── package-lock.json
└── README.md
We are planning to use three datasets for our OpenAPI web search project. I explained the project details in a previous blog post. Let me give you a little bit of an introduction to the project: Our main goal is to search for validated OpenAPI definitions by crawling datasets from various sources. To achieve this, I need to start by building a crawling component, which is what I'm currently working on this week. At the moment, my focus is on the common crawl dataset. To crawl the datasets, I'm creating drivers for each one. These drivers will help fetch data from the datasets. Currently, I'm working on a file called CommonCrawlDriver.js
, which is used to crawl the common crawl dataset, This file is the most important thing in the current state of the project.
Let's discuss the CommonCrawlDriver.js
file, which serves as a driver for the Common Crawl dataset:
This file is where most of the code is currently being written, and it plays a crucial role at the moment. It has two functions that I want to talk about in detail.
The function
retrieveDirectoriesUrlsFromCCServer(CC_SERVER_URL, latest)
is used to extract URLs from the Common Crawl index server. These URLs provide access to files that store information about thepaths
of Common Crawl Index files. By utilizing these URLs and performing some additional processing, we can download the index files. This allows us to obtain the necessary information and data from the Common Crawl project.What are common crawl index files BTW?
Common Crawl index files are a collection of data that serve as an index or directory for the content stored in the Common Crawl dataset. Common Crawl is a project that periodically crawls the internet and makes the crawled data freely available to the public. The index files provide information about the content and metadata of web pages that have been crawled. They include details such as the URL of the webpage, its timestamp, the size of the page, and other relevant information. The index files help researchers, developers, and data scientists locate specific web pages or types of content within the vast Common Crawl dataset efficiently.
This function has two input arguments. The first argument is
CC_SERVER_URL
, which is a URLhttps://index.commoncrawl.org/
. This URL points to the Common Crawl index server. The second argument islatest
, which serves as a flag to determine whether we want to retrieve the latest data or historical data from the server. By setting thelatest
flag accordingly, we can fetch the desired type of data from the Common Crawl index server.The function
retrieveDirectoriesUrlsFromCCServer(CC_SERVER_URL, latest)
returns an array of URLs. These URLs are references to files hosted on the internet, and these files contain information about thepaths
of Common Crawl index files. By retrieving this array of URLs, we can access and work with the specific index files within the Common Crawl dataset.The function
processDirectoriesWithExponentialRetry(crawledDirectories)
takes one parameter calledcrawledDirectories
, which represents the results obtained from the first function,retrieveDirectoriesUrlsFromCCServer
.In this function, we iterate over the array of URLs obtained from
retrieveDirectoriesUrlsFromCCServer
. Within the loop, we call another function calledretrieveIndexFilesUrlsFromDirs
. This function processes the fetched data from the URLs and transforms it into appropriate endpoints for further processing.In the
retrieveIndexFilesUrlsFromDirs
function, we perform the following steps:Download the file that contains the
paths
of the Common Crawl index files.After downloading the file, we use the
zlib
library to unzip it.Once the file is unzipped, we append the extracted paths to the end of the URL "https://data.commoncrawl.org".
The purpose of the processDirectoriesWithExponentialRetry
the function is to handle the retry logic in case of any errors or failures that may occur during the processing of the directories. While this function may seem simple, it plays a crucial role in ensuring the successful retrieval and processing of the Common Crawl index files.
This function performs a series of asynchronous operations using Promises. It starts by declaring a constant variable called
backOffResults
.The
await Promise.all()
function is used to wait for all the Promises in thecrawledDirectories
array to resolve or reject. ThecrawledDirectories.map()
function is then used to iterate over each element in thecrawledDirectories
array and create a new array.Inside the
map()
function, anasync
function is defined that represents an asynchronous operation to be performed for each element incrawledDirectories
.Within this
async
function, thebackOff()
function is called with a callback function() => retrieveIndexFilesUrlsFromDirs(r)
as its first argument. The purpose of thebackOff()
function is to retry the provided callback function using a backoff strategy if it encounters any errors.The
backOff()
function is configured using an object with various options such astimeMultiple
,maxDelay
,numOfAttempts
,delayFirstAttempt
,jitter
, andretry
. These options control the behavior of the backoff strategy, including factors like the maximum delay between attempts and the number of attempts to make.If an error occurs during the execution of the callback function passed to
backOff()
, theretry()
function is called with the error and the number of attempts made so far. In this code, theretry()
function logs the error message and the number of attempts to the console and returnstrue
, indicating that the backoff strategy should continue retrying.The result of the
backOff()
function is stored in theinnerResults
variable. Then, theresolvedInnerResults
variable is assigned the result ofPromise.all(innerResults)
, which waits for all Promises in theinnerResults
array to resolve.If an error occurs during the execution of
Promise.all(innerResults)
, it is thrown. Finally, theresolvedInnerResults
are returned from theasync
function.The array of results from each iteration of the
map()
function is assigned to thebackOffResults
constant variable, which holds the final results of all the asynchronous operations performed on the elements of thecrawledDirectories
array.
The other files in the current state of the project are relatively simple and easy to understand. They likely consist of straightforward code with clear functionalities and purposes. These files may not involve complex or intricate logic, making them more approachable and less involved compared to other parts of the project.
Understanding the Data Flow in the Current State of the Project:
The user sends a POST request to the server, including the
dataSource
in the request body and thelatest
flag as a query parameter.The
dataSource
is a variable that informs the server about the desired data source. It can take one of three values:commonCrawl
,github
, orgoogleBigQuery
. Each value represents a specific type of data source.On the other hand, the
latest
flag is a binary indicator with two possible values: true or false. It indicates whether the user wants to retrieve the latest data or historical data from the server.Once the request undergoes validation, it proceeds to the
selectDataSources(dataSource, latest)
function. This function determines the appropriate data source to use by checking the value of thedataSource
variable within a switch statement.The
selectDataSources
the function evaluates the value ofdataSource
and performs different actions based on the specified data source. It uses a switch statement to handle each possible valuedataSource
and execute the corresponding logic or retrieve data accordingly.By utilizing the switch statement, the
selectDataSources
function effectively decides which data source to utilize based on the value of thedataSource
variable.After the data source is selected, let's say we choose the Common Crawl dataset. We use the
retrieveDefinitionsFromCC
function, which serves as the driver we discussed earlier in detail. This function is responsible for retrieving the necessary definitions or information related to the Common Crawl dataset.By invoking the
retrieveDefinitionsFromCC
function, we initiate the process of fetching the required definitions specific to the Common Crawl dataset. This function utilizes the functionalities and logic we previously discussed to obtain the relevant data and perform any additional processing required.The
retrieveDefinitionsFromCC
function, as mentioned earlier, is a driver function that retrieves the array of URLs representing Common Crawl index files. After obtaining these URLs, the function returns them to theselectDataSources
function.In the
selectDataSources
function, the received array of URLsretrieveDefinitionsFromCC
is passed on to the controller. The controller, which handles the logic and processing of the data, takes this array of URLs as input and returns it to the user as a response.
Conclusion:
When building the web crawler, your primary focus is on writing code to scrape the Common Crawl website and extract the URLs. These URLs are then processed and organized into an array, which serves as a collection of endpoints leading to index files. These index files are crucial for further processing and analysis. The resulting array of URLs can vary in length depending on whether you fetch the latest data or historical data. If you fetch the latest data, the list may contain around 300 items. However, if you retrieve historical data, the list can expand to approximately 30,000 items. With the array of URLs in hand, you can proceed to the next steps of the web crawling process, which involve downloading, parsing, and validating the content accessible through these endpoints. This allows you to access and work with the desired data contained within the Common Crawl index files.