How to Scrape PDF data

Extracting text from PDFs is an essential step in data extraction and automation workflows. This guide outlines the process using built-in commands and configurations.

Overview

This process involves reading the PDF, converting the content into an HTML page and then extracting the content for further processing. Follow the below steps to extract and process text from PDFs efficiently:

Loading and reading the PDF:

There are two ways to load a PDF in an agent
- The PDF file is located in the agent
  - Upload the file into the agent. You can either drag and drop the file or upload using the ‘Upload File’ button available on the top right (refer attached snapshot)
  - Input Parameters:
    FileName is the name given to the Input Parameter and Value holds the Name of the PDF File with extension.
  - Add Open Page command: To locate the PDF file path, select the Script option from the HTML dropdown menu in the Page Section and add the following C# code snippet:
    CODE
```
#r System.IO.dll


using System.IO


public string GetData (RunContext context){
        var fileName = context.GlobalData.GetString("FileName");
        var filePath = context.PrivateFiles.GetFile(fileName);
        return filePath;
}
```
- The PDF is present in a space as shared file
  - In the Agent, add Input Parameters
    FileName is the name given to the Input Parameter and Value holds the Name of the PDF File with extension.
  - Add Open Page command: To locate the PDF file path, select the Script option from the HTML dropdown menu in the Page Section and add the following C# code snippet:
    CODE
```
#r System.IO.dll


using System.IO


public string GetData (RunContext context){
        var fileName = context.GlobalData.GetString("FileName");
        var filePath = context.SharedFiles.GetFile(fileName);
        return filePath;
}
```
    NOTE: In line 7 of the code in the snapshot above, SharedFiles is used to reference files stored in the agent's space rather than within the agent itself.
Converting content to HTML page
- Test the script in the Open Page command, it should return the path of the file.
- Execute the Open Page command to automatically convert the content of the file into an HTML page.
- Then you can use the Content command to select the HTML text and capture the desired content.
  NOTE: The design of most file formats, including PDFs, doesn't include ease-of-conversion to HTML. So, the conversion output is considerably more difficult to manage than standard HTML. In such cases, you'll have to select the entire HTML page and then use Regular Expressions to extract the target content.