Building Your First Agent
The following diagram shows the key steps for building, maintaining, scheduling, and monitoring Sequentum Cloud agents . We provide links for other topics which will explain each of these steps in further detail.
Basic web data extraction agent creation process
In this section, we will cover the identification of data elements on your target website and create your first Sequentum Cloud agent. We'll work through each of the above steps with examples that match common web data extraction usage, so you can get comfortable building agents on your own and at your own pace.
Choosing a Start URL
The Start URL is the place where you begin data collection and corresponds to the starting point of your web data extraction agent.
In the following sections, we'll use the Cruise Direct website for our example.
Note: In this example, we start from the Cruise Direct home page, however, if the data you require is not located on a website home page, you can start the agent from a website sub-page. This approach will make the agent more efficient, so it’s worth taking the time to be more specific.
We start by pasting the start web page URL from the target website (
http://www.cruisedirect.com) into the Sequentum Cloud Address Bar.
In the next section, Select the Content to Capture, we will continue to use the Cruise Direct website data for our example.
Select the Content to Capture
In the previous section, we selected our Start URL and loaded the web page into the Sequentum Cloud web editor. Next, you can select the data you want to capture and start building your first web data extraction agent. In our Cruise Direct example, we plan to search for available cruise vacations and then extract details about each cruise.
First, we need to perform a search to retrieve the data for the available cruises. For this we will enter in a destination and the month of departure. Click once on the “Select Destination” on the web page, then click the “Selection” tab at the bottom of the screen. Then click the “Select” button to pop up the drop down selections. Select “Bahamas“ and click on “Add Command”.
Now rename your command to something intuitive like “Destinations” by clicking on the command name and typing a new name, then hitting enter.
Now click on the “Options” button in the top right corner to configure some additional settings for the “Destinations” command.
Click on the tab called “Browser” and scroll down to the category called “Dynamic browser settings”.
From here, we’ll change the wait settings to wait for async loaded content. You can find out about what each property performs in the text manual that appears in the right pane, but for now, just know that we need to make this change since there is new async content being loaded every time we make a change on the website.
After pressing save, click back on the “Browser” button in the top right corner to go back to the interactive browser screen.
Next hover over the “Destinations” command, and press the lighting bolt icon to execute the newly generated command.
We will now do the same for our “Select Month” form field. Select the latest date and change the name of the command to Months, and make the same changes in the “Options” setting as we did for the Destinations command. Press the lighting bolt again to execute the command.
Now we’ll perform the search to get cruise details. Click on the orange “Search” button, and from the “Extracted Content” tab below, click on the button “Action”. Rename the command “Search”.
While hovering over the “Search” command, you’ll notice a lighting bolt icon. Click on the lighting bolt icon to manually execute the new search command.
We are now redirected to the result page with cruises that match the filters that we previously set. To select any information from this page, we simply need to point and click to select the information we want to extract. In the bottom, click away from the default “Extracted Content” tab to the “Selection” Tab. This “Selection” view allows us to select elements on the page even more easily than clicking on elements on the page.
Using your mouse, try clicking the object on the page which has all of the cruise information. As you click, the object on the page is highlighted and surrounded by a bounding box. On a site like this, it can be tricky to click on the whole object you want since it is running off screen so another method which might be easier in this case is to select from the page elements behind the scenes. To view these, go to the “Selection” tab and select a div which contains all of the information that we want to extract (e.g. Cruise name, ratings, duration, description, prices, etc). In our case, this is the div with a class of “views-row“.
Once we have selected the <div class=”views-row”>, scroll down on the page, hold down the “Shift” key, and click on the same div in the second element. If you performed this step correctly, you will notice in the selection box that there are “5 Selected” elements.
Click on the “Extract” command and you will notice that a new “Selection List” command along with a “Section” command gets generated. Rename the command “Name” and click on the name of the cruise while the command is still in edit mode and press “Save Selection” in the bottom right corner to save the changes.
Now we will add in more commands to capture more relevant information. Click on the departure location and click once. In the Selection tab, click the “Extract” button to generate the command and rename the command to “Departure”.
After the departure command has been generated, click back to the “Extracted Content” tab in the bottom left corner. From here, we can highlight the text that we want to extract (e.g Miami, Port Canaveral, etc, and parse out the string “Departing From”). After highlighting the text, click on the “Extract selected” button to parse out the data. A regular expression automatically got generated and parsed out the string “Departing from”. [Note: Firefox browser not currently supported for this action.]
Repeat the Select, Extract, Extract Selected steps above for the fields “Duration”, ““Ship”, “Port of Call“, and “Best For“
Now we want to extract pricing information for each cruise. Since there are multiple dates with different prices, we will need to generate a list like we previously did with the list of cruises. Find a list item on the page with multiple rows of data, then click on the first object that contains a cruise date, hold down the shift key, and select the second page object that contains a cruise date. Now click on the “Extract” button and a new sub list gets generated underneath the earlier commands. Rename the “Section” command as “Departure Date”, and click on the departure date column and press “Save Selection”.
Click on the other columns and capture the interior, ocean view, balcony, and suite prices. Use our previously taught method of generating an automatic regular expression to parse out the “USD” part of the prices.
We will now try to extract some information from the “Bonus” column. Since they are icons and not text, if we click on the bonus column, you’ll notice that there is no data being returned in the extracted content section.
To bypass this, we simply need to change what we are extracting from the column. On the “Extracted Content” tab, click on the dropdown list labeled “Text” to see what other types of attributes you can perform. Change the selection to “Styled HTML“ to get a bit more information out of the icons such as the name of the icons or any other specific data that might be on the backend.
We have now extracted all of the fields we want to extract and are almost done with creating our first agent.
Next let’s add some pagination. To do this, we first need to determine where to add the pagination command.
In this case, we want to repeat everything after the “Section List” command since on every page there is new relevant information that needs to be extracted. In order to do this, click on the parent command of the “Section List” command in the agent explorer, which is the “Search” command.
From here, click on the “Action“ command on the left navigation menu to expand a list of action commands.Mouseover the commands that are in this section to display their names and click on the one labeled “Action Repeater” :
This creates the “Action Repeater” command”. Rename this command to “Pagination” and scroll down the page and select where the pagination should be (in this case, the “Next” button at the bottom of the page) and press “Save Selection”
Press the lighting bolt icon next to the pagination command to manually execute the command and if everything is done correctly, the pagination will work and you will see the second page.
Now that we have confirmed that the pagination command works, we will need to move the command to the correct position. Drag the pagination up above the “Section List” command and just below the “Search” command and drop it there:
Then drag the “Section List” command onto the pagination command so that the “Section List” command becomes nested beneath the pagination command.
If done correctly, the your pagination command will should look like the following image.
We will now use the Sequentum Cloud Visual debugger to debug our agent and see if there are any errors. First, let’s publish this agent before we debug it. Press the “Publish” button highlighted on top. Name the agent Cruisedirect and press Publish with any comments (optional).
Now let’s debug our agent! Press the “Clear Storage” button next to the “Publish” button on the top to clear all storage before we start. Next, click on the “Play” button to start debugging.
Congratulations! You have built your first agent using Sequentum Cloud! Let’s stop debugging now and move onto the next step of refining your data, formatting the output and eventually running your agent and delivering the data to your specific endpoint.
Data Validation
Creating an agent to extract data is only the first step in your ETL process! Now we will apply data validation in our agents in the form of validation rules/success criteria to ensure that the data we are extracting every run is actually useable data. We will now click on the “Export” command in the agent explorer to pull open the settings in the command.
Sequentum Cloud by default exports in csv format. You can also export in JSON format by clicking on expanding the “Export” section on the left land side of the agent explorer and selecting JSON.
From this page, we’re able to view all of the fields in our schema along with any default data type requirements, null values, format styles, format type, and time zones. Please feel free to make any changes that you feel applicable.
Next, we’ll move onto our success criteria definitions. Click on the main agent command in the agent explorer, and then click on success criteria. Success criteria defines a successful run based on a predetermined number of criteria which can be defined as a minimum number of page loads, minimum number of data count, minimum export count, and/or a maximum number of errors. These can also be a percentage of the previous run if a number between 0-100 is inserted in the allowed page load variation column.
Configure any set of success criteria and we’ll move onto the next section.
Data Delivery
Now that we’ve set up data validation to ensure that we’re extracting high quality data, we will configure our data delivery target so that we can receive data at our endpoints to use for processing later. The default export format is csv format. Click on the “Export” icon on the left hand side of the agent explorer to expand the full list of export commands.
This list consist of csv, json, s3, google sheets, google drive, ftp/sftp, snowflake, and custom script as of version 1.16. In this example, we will export a csv file to an internal s3 bucket. S3 export needs to be configured from the main Sequentum Cloud and then used in agent.
Click back to the control center, in the bottom left corner, click on “Organization”, then the “Destinations” tab, and click on “New Destination”
Enter in a destination name and description for the new destination. For destination type, keep as the default S3 bucket. Now enter in the bucket name and folder path in which you want to deliver the data. A policy will be generated for you which looks something like the screenshot below.
Scroll down on the page and let’s fill out the Role ARN
Copy these policies and trust relationship policies to your external IAM role and press test connection to upload a test file to your S3 bucket. After this is finished, we can head back over to our agent, click a S3 export, and change the destination name to our newly created destination.
Running, Scheduling, and Monitoring Your Agent
Once we are fully done configuring our export targets, we can start to run, schedule, and monitor our agent through the Sequentum Cloud Control Center. Head back over to the control center by clicking on the Control Center icon in the top right corner. Once there, click on our personal directory (if that’s where you saved your agent), and from that dashboard, you’ll be able to view the status of all of your agents.
The name, versions, updated date, activity date, and status of our runs can be viewed for all agents directly on this page. When we want to edit our agent, simply click on the agent name, and we’ll be redirected to the runs page where we can click on “Edit Agent” in the right corner.
Note that while the agent is running, you can see the same visual playback you saw when you pressed “Play” to test your agent. Click the “View Live Information” button:
And this will show you the agent run in progress:
Toggle the Visual icon and Log icon back and forth to see the logs or visual playback in each instance. This can be viewed for as many sessions as you have running in parallel.
From this runs page, we’re also able to click on the “Info” tab to add descriptions, documentation, icons, and view the schema of the run.
We can also click on the “Run History” tab to view all of our previous runs (*Note that the “Runs” tab only shows the status of the latest run, and all previous runs before that are located in the “Run History” tab).
The run history page also provides detailed info on the status of the run in terms of actions performed, data points extracted, errors (includes page load errors and also errors from data validation), pages loaded, dynamic pages loaded, rate of request, number of requests, and also traffic.
From here, we’ll now schedule our agent to run on a daily basis! Click on the “Setup Run” icon on the right.
From here, we can set up a set of proxies to use for the run. Click on the “Schedule” checkbox to load up more info on scheduling the agent.
From here, we’re able to change the start date & time, along with any specific schedule type (run once, run every day, or a custom CRON expression). Press the “Save Task” button when you are done setting up your schedule. In addition, you can also press the “Run Now” button if you simply want to run the agent once instead of setting up a schedule. You have now set up your first schedule and will be redirected to the “Tasks” tab where you can monitor your schedules.
Congratulations, you have fully completed your first agent! In this guide, we learned how to build our first agent, setup our data validation schemas to ensure we’re extracting high quality data, configured our export target configuration so that we’re receiving data in our s3 bucket, and finally scheduled our agent to run on a daily basis! If you have any additional questions, please reach out to us at support@sequentum.com