Skip to main content

Parse All Content in Sequentum Cloud

The Parse All Content feature in Sequentum Cloud plays a crucial role in ensuring the accuracy and completeness of data extraction from web pages, especially those with complex or dynamically loaded content. Sometimes, when a URL is loaded in the browser, certain frames or elements on the web page may not fully load due to high response times, network delays, or the presence of dynamic content that takes longer to render. This can result in incomplete or missing data during the scraping process.

The Parse All Content option addresses this issue by allowing the agent to reload and reprocess all the frames and elements on the page, ensuring that the entire content is properly captured. Here's a more detailed explanation of how this functionality works :

  1. Handling Slow or Incomplete Page Loads:
    When web pages take longer to load due to network latency or server-side processing, some elements such as images, scripts or embedded frames may not appear fully in the initial page load. The Parse All Content command forces the browser to reload all the content, ensuring that every element, including dynamically generated content, is fully rendered before data extraction begins. This is particularly important for pages that rely heavily on JavaScript to display content after the initial load.

  2. Reparsing Dynamic Frames: Many modern websites utilize frames or iframes to load additional content from external sources. These frames may not load correctly during the first pass, causing incomplete data capture. By using the Parse All Content feature, the agent refreshes these frames, ensuring that all  the content is fully loaded and available for scraping. This is crucial for extracting data from web pages with multiple layers of embedded content, such as advertisements, maps, or video players.

  3. Improving Data Accuracy:
    In scenarios where incomplete page loads result in partial data extraction, using the Parse All Content functionality, users can significantly improve the accuracy and reliability of the extracted data. By ensuring that all content is fully reloaded, the agent can capture the entire data set, preventing issues like missing fields or incomplete records in the output.

  4. Error Recovery: If the agent encounters a page where not all elements have loaded correctly, the Parse All Content command acts as an automatic recovery mechanism. Instead of allowing the agent to proceed with incomplete data or logging an error, this command reattempts the loading of all content on the page, providing a second chance to fully capture the data without manual intervention.

  5. Optimization for Complex Pages: For websites with highly dynamic content—such as those using AJAX, asynchronous JavaScript, or complex CSS rendering—the initial page load may only include the basic structure, with additional elements being loaded in the background. The Parse All Content functionality ensures that even after the main content is loaded, the agent will reprocess and capture any additional elements that may have appeared later.

  6. Better Handling of Web Scraping in Real-Time: The Parse All Content feature is also useful for real-time web scraping scenarios where the page content changes rapidly, such as news feeds, stock tickers, or social media pages. By forcing the agent to reload all the content, it ensures that even late-arriving data is captured, maintaining the timeliness and completeness of the scraped data.

  7. User-Friendly Status Bar Integration: The Parse All Content option is easily accessible via the Sequentum Cloud status bar, making it convenient for users to manually trigger a full content reload if they notice incomplete content during agent execution. This seamless integration allows users to take corrective actions in real-time, without having the need to restart the entire scraping session.

  8. Enhancing Agent Robustness: By incorporating the Parse All Content command into an agent’s workflow, users can create more robust scraping agents that are capable of handling unpredictable web behavior, such as slow server responses or dynamic content rendering. This ensures that even in less-than-optimal network conditions, the agent can still extract complete and accurate data sets.

Example - 

  1. As per the below screenshot, we tried to open a website that requires additional frames to be loaded, when we did an initial hit for the same, we can see that the content isn’t parsed completely and is shown in “red” color in the editor.


  2. Whenever we try to extract the content that is not parsed properly or completely, then it gets highlighted red in the editor while doing a selection.

  3. Then, we clicked on the “Parse All Content” button in the status bar(refer to the below screenshot)

  4. After reparsing, as you can see, we are able to extract the required content properly.

Conclusion

The Parse All Content functionality in Sequentum Cloud is an essential tool for ensuring the reliability and accuracy of web scraping agents, especially when dealing with slow-loading or dynamic web pages. By allowing the agent to reprocess and reload frames and content that may not have been fully captured during the initial page load, this command helps prevent incomplete data extraction, improves agent performance and ensures that all relevant information is available for output. Whether dealing with dynamic websites, embedded frames or delayed content, this feature provides a powerful solution for handling complex web scraping challenges.

JavaScript errors detected

Please note, these errors can depend on your browser setup.

If this problem persists, please contact our support.