Web Data Extraction Techniques

Sequentum Cloud is designed to work seamlessly with a variety of websites, regardless of complexity. Still, understanding a few key data extraction techniques can enhance your success.

Key Concepts:

HTML Content: Understanding HTML is fundamental, as it forms the backbone of most web pages. HTML tags define the structure and content of a page and Sequentum Cloud uses them to extract data efficiently.

There are many websites that have HTML tutorials. Here is one example: http://www.w3schools.com/html/html_intro.asp

Dynamic Websites: Dynamic websites rely on JavaScript to load content in real time. Sequentum Cloud can detect these changes, ensuring data is extracted correctly, even as new elements load.

Familiarity with JavaScript can make it much easier to configure a web data extraction agent to extract data from dynamic websites when Sequentum Cloud is unable to configure the agent automatically. You can learn more from various JavaScript tutorials available on the web, such as: https://www.w3schools.com/js/default.asp

XPath: This selection syntax allows precise targeting of specific elements within a web page’s HTML structure, improving data extraction accuracy. We recommend this as a good place to learn more about XPath: https://www.w3schools.com/xml/xpath_syntax.asp

We also recommend this reference for common XPath patterns that is popular among selenium users:

https://www.red-gate.com/simple-talk/wp-content/uploads/imported/1269-Locators_table_1_0_2.pdf?file=4937

Regular Expressions: When dealing with complex or embedded content, regular expressions help isolate specific data points from larger blocks of text.

For more information and resources on regular expressions, you can visit:http://www.regular-expressions.info/reference.html

With Sequentum Cloud’s user-friendly interface, users can efficiently manage web data extraction tasks. For those with advanced technical knowledge, tools like XPath and regular expressions are available for more customized configurations.