Skip to main content

Sequentum Cloud Activity

Most websites now-a-days use web frameworks that separate layout from data. JavaScript is used to generate the final web page displayed on the web browser. When visiting such a website, it often loads some layout and JavaScript initially, and later loads the data asynchronously afterwards and updates the data into the designed layout.

Simple web scraping tools that don’t execute JavaScript will not be able to extract data from these websites at all, even advanced tools will have difficulties with many of these websites unless the web scraping bots are carefully designed.

Advanced web scraping tools can use embedded web browsers to load websites and execute JavaScript to process most of these websites, but web browsers are very slow and are known to crash occasionally thus, should be avoided using them whenever possible. Furthermore, many websites load data asynchronously, sometimes depending on scrolling down a web page. High-end web scraping tools can deal with these scenarios, but it can be very difficult to create reliable bots for such websites, and they’re certain to be very slow.

The solution lies in the asynchronous calls modern websites make to load the data. The web server functionality that provides the data is often called a Web API, so the asynchronous calls are often referred to as Web API requests. The Web API normally provides structured data in JSON format which is very easy to work with, and the Web API requests are very fast compared to loading a full web page. Sequentum Cloud's Activity feature enables users to monitor APIs and other asynchronous network requests to the server, facilitating the development of more robust agents.

Sequentum Cloud's Activity feature provides real-time insights into network activity and backend processes, allowing users to monitor and analyze client-server interactions for better understanding during web scraping or data extraction.

The Activity tab in Sequentum Cloud, located in the bottom pane next to the 'Schema' tab, displays columns like 'URL', 'Content', and 'Timing (ms)'. The 'URL' shows the requested site URL, 'Content' represents the data fetched, and 'Timing (ms)' indicates the request's completion time in milliseconds.

The Activity tab in Sequentum Cloud allows building CURL URLs, which can be used in agents to hit URLs or capture responses for later use.

One way is to simply right-click on the desired request URL and select from different options Copy URL or Copy cURL Request as shown in the following snapshot.

image-20241212-081549.png

The other was is using the CURL button which is located in the Request tab at the top right corner of the Activity window's right pane, as shown in the snapshot.

Clicking the ‘CURL’ button generates the request URL in CURL format, ready for use in the agent. Additionally, the ‘Plain’ option allows converting the CURL format back into plain text.

We can also visualize the content received from a request in the Activity tab by selecting the Visualize button, located in the Content tab on the top-right corner of the Activity window’s right pane, as shown in the following snapshot:

The other simpler way is to right-click on the request URL and select from the options View JSON or View Raw JSON.

image-20241212-082240.png

Kindly note, based on the request URL, the options change to View Raw HTML or View HTML Page and View Content in case the request URL contains .js content.

From the snapshot below, we can observe the structure of the content received from the request CURL URL even before executing it in the agent. This helps determine if the requested URL contains the necessary content and assess the feasibility of extracting that content from the requested URL.

We can also filter requests or view specific request types using the Show options at the top of the Activity tab.
The default filter is applied on Frames, XHR and Scripts.


Below are the definitions of each option:

  • All: Displays all requests sent and received between the client and server.

  • Frames: Shows requests originating from frames within the main page.

  • XHR: Displays asynchronous requests, which often return data in JSON format. This is especially useful when a website provides an API, allowing for quick identification of relevant requests by selecting this option.

  • Scripts: Shows script requests, which may contain embedded JSON that can be leveraged to optimize the agent.

  • External: Displays requests made to domains outside the main domain. These often come from social media platforms, ad servers, or analytics plugins.

  • Loading: Shows URLs that are currently in the loading state.

  • Any Content: Filters to show only requests that have response content.

  • JSON Content: Displays only the requests that return JSON response content.

JavaScript errors detected

Please note, these errors can depend on your browser setup.

If this problem persists, please contact our support.