Building a Web Crawler for Scraping Index Files from Common Crawl Server.

Building a Web Crawler for Scraping Index Files from Common Crawl Server.

In this article, we'll go into detail about how to make a crawling controller that is particularly made to scrape index files from the Common Crawl server. We will examine the essential ideas and methods needed to construct a powerful web crawler designed specifically to retrieve OpenAPI definitions from the Common Crawl index.

Let's begin by examining the current folder structure of the Project:

├── CODE_OF_CONDUCT.md
├── CODEOWNERS
├── LICENSE
├── README.md
└── src
    └── server
        ├── api
        │   ├── constants
        │   │   └── Constants.js
        │   ├── controllers
        │   │   └── CrawlingController.js
        │   ├── drivers
        │   │   ├── CommonCrawlDriver.js
        │   │   ├── GithubCrawlerDriver.js
        │   │   └── GoogleBigQueryDriver.js
        │   ├── helpers
        │   ├── keywords.json
        │   ├── models
        │   ├── policies
        │   ├── services
        │   │   ├── DownloadService.js
        │   │   ├── ParserService.js
        │   │   └── ValidateService.js
        │   └── utils
        │       ├── CommonCrawlDriverUtil.js
        │       ├── processDirectoriesWithExponentialRetry.js
        │       └── SelectDataSources.js
        ├── app.js
        ├── config
        ├── data
        ├── package.json
        ├── package-lock.json
        └── README.md

We are planning to use three datasets for our OpenAPI web search project. I explained the project details in a previous blog post. Let me give you a little bit of an introduction to the project: Our main goal is to search for validated OpenAPI definitions by crawling datasets from various sources. To achieve this, I need to start by building a crawling component, which is what I'm currently working on this week. At the moment, my focus is on the common crawl dataset. To crawl the datasets, I'm creating drivers for each one. These drivers will help fetch data from the datasets. Currently, I'm working on a file called CommonCrawlDriver.js, which is used to crawl the common crawl dataset, This file is the most important thing in the current state of the project.

Let's discuss the CommonCrawlDriver.js file, which serves as a driver for the Common Crawl dataset:

This file is where most of the code is currently being written, and it plays a crucial role at the moment. It has two functions that I want to talk about in detail.

  1. The function retrieveDirectoriesUrlsFromCCServer(CC_SERVER_URL, latest) is used to extract URLs from the Common Crawl index server. These URLs provide access to files that store information about the paths of Common Crawl Index files. By utilizing these URLs and performing some additional processing, we can download the index files. This allows us to obtain the necessary information and data from the Common Crawl project.

    What are common crawl index files BTW?

    Common Crawl index files are a collection of data that serve as an index or directory for the content stored in the Common Crawl dataset. Common Crawl is a project that periodically crawls the internet and makes the crawled data freely available to the public. The index files provide information about the content and metadata of web pages that have been crawled. They include details such as the URL of the webpage, its timestamp, the size of the page, and other relevant information. The index files help researchers, developers, and data scientists locate specific web pages or types of content within the vast Common Crawl dataset efficiently.

    This function has two input arguments. The first argument is CC_SERVER_URL, which is a URL https://index.commoncrawl.org/. This URL points to the Common Crawl index server. The second argument is latest, which serves as a flag to determine whether we want to retrieve the latest data or historical data from the server. By setting the latest flag accordingly, we can fetch the desired type of data from the Common Crawl index server.

    The function retrieveDirectoriesUrlsFromCCServer(CC_SERVER_URL, latest) returns an array of URLs. These URLs are references to files hosted on the internet, and these files contain information about the paths of Common Crawl index files. By retrieving this array of URLs, we can access and work with the specific index files within the Common Crawl dataset.

  2. The function processDirectoriesWithExponentialRetry(crawledDirectories) takes one parameter called crawledDirectories, which represents the results obtained from the first function, retrieveDirectoriesUrlsFromCCServer.

    In this function, we iterate over the array of URLs obtained from retrieveDirectoriesUrlsFromCCServer. Within the loop, we call another function called retrieveIndexFilesUrlsFromDirs. This function processes the fetched data from the URLs and transforms it into appropriate endpoints for further processing.

    In the retrieveIndexFilesUrlsFromDirs function, we perform the following steps:

    1. Download the file that contains the paths of the Common Crawl index files.

    2. After downloading the file, we use the zlib library to unzip it.

    3. Once the file is unzipped, we append the extracted paths to the end of the URL "https://data.commoncrawl.org".

The purpose of the processDirectoriesWithExponentialRetry the function is to handle the retry logic in case of any errors or failures that may occur during the processing of the directories. While this function may seem simple, it plays a crucial role in ensuring the successful retrieval and processing of the Common Crawl index files.

This function performs a series of asynchronous operations using Promises. It starts by declaring a constant variable called backOffResults.

The await Promise.all() function is used to wait for all the Promises in the crawledDirectories array to resolve or reject. The crawledDirectories.map() function is then used to iterate over each element in the crawledDirectories array and create a new array.

Inside the map() function, an async function is defined that represents an asynchronous operation to be performed for each element in crawledDirectories.

Within this async function, the backOff() function is called with a callback function () => retrieveIndexFilesUrlsFromDirs(r) as its first argument. The purpose of the backOff() function is to retry the provided callback function using a backoff strategy if it encounters any errors.

The backOff() function is configured using an object with various options such as timeMultiple, maxDelay, numOfAttempts, delayFirstAttempt, jitter, and retry. These options control the behavior of the backoff strategy, including factors like the maximum delay between attempts and the number of attempts to make.

If an error occurs during the execution of the callback function passed to backOff(), the retry() function is called with the error and the number of attempts made so far. In this code, the retry() function logs the error message and the number of attempts to the console and returns true, indicating that the backoff strategy should continue retrying.

The result of the backOff() function is stored in the innerResults variable. Then, the resolvedInnerResults variable is assigned the result of Promise.all(innerResults), which waits for all Promises in the innerResults array to resolve.

If an error occurs during the execution of Promise.all(innerResults), it is thrown. Finally, the resolvedInnerResults are returned from the async function.

The array of results from each iteration of the map() function is assigned to the backOffResults constant variable, which holds the final results of all the asynchronous operations performed on the elements of the crawledDirectories array.

The other files in the current state of the project are relatively simple and easy to understand. They likely consist of straightforward code with clear functionalities and purposes. These files may not involve complex or intricate logic, making them more approachable and less involved compared to other parts of the project.

Understanding the Data Flow in the Current State of the Project:

  1. The user sends a POST request to the server, including the dataSource in the request body and the latest flag as a query parameter.

    The dataSource is a variable that informs the server about the desired data source. It can take one of three values: commonCrawl, github, or googleBigQuery. Each value represents a specific type of data source.

    On the other hand, the latest flag is a binary indicator with two possible values: true or false. It indicates whether the user wants to retrieve the latest data or historical data from the server.

  2. Once the request undergoes validation, it proceeds to the selectDataSources(dataSource, latest) function. This function determines the appropriate data source to use by checking the value of the dataSource variable within a switch statement.

    The selectDataSources the function evaluates the value of dataSource and performs different actions based on the specified data source. It uses a switch statement to handle each possible value dataSource and execute the corresponding logic or retrieve data accordingly.

    By utilizing the switch statement, the selectDataSources function effectively decides which data source to utilize based on the value of the dataSource variable.

  3. After the data source is selected, let's say we choose the Common Crawl dataset. We use the retrieveDefinitionsFromCC function, which serves as the driver we discussed earlier in detail. This function is responsible for retrieving the necessary definitions or information related to the Common Crawl dataset.

    By invoking the retrieveDefinitionsFromCC function, we initiate the process of fetching the required definitions specific to the Common Crawl dataset. This function utilizes the functionalities and logic we previously discussed to obtain the relevant data and perform any additional processing required.

  4. The retrieveDefinitionsFromCC function, as mentioned earlier, is a driver function that retrieves the array of URLs representing Common Crawl index files. After obtaining these URLs, the function returns them to the selectDataSources function.

    In the selectDataSources function, the received array of URLs retrieveDefinitionsFromCC is passed on to the controller. The controller, which handles the logic and processing of the data, takes this array of URLs as input and returns it to the user as a response.

Conclusion:

When building the web crawler, your primary focus is on writing code to scrape the Common Crawl website and extract the URLs. These URLs are then processed and organized into an array, which serves as a collection of endpoints leading to index files. These index files are crucial for further processing and analysis. The resulting array of URLs can vary in length depending on whether you fetch the latest data or historical data. If you fetch the latest data, the list may contain around 300 items. However, if you retrieve historical data, the list can expand to approximately 30,000 items. With the array of URLs in hand, you can proceed to the next steps of the web crawling process, which involve downloading, parsing, and validating the content accessible through these endpoints. This allows you to access and work with the desired data contained within the Common Crawl index files.