Extract

Extract structured data from PDFs using JSON schema definitions

Get API Key

Introduction

The extract endpoint allows you to extract structured data from any PDF document using JSON schema definitions.

Instead of building complex PDF parsing logic, you can define your data extraction needs using standard JSON schema format and process thousands of documents consistently. The API handles all the complex extraction work, delivering clean, structured JSON data ready for your application.

How It Works

1

Define Your Schema

Create a JSON schema that describes what data you want to extract from your PDF.

2

Upload Your PDF

Send your PDF file along with the schema to the extract endpoint.

3

Get Structured Data

Receive a structured JSON response that matches your schema definition.

4

Integrate & Scale

Use the extracted data in your application and process thousands of documents with the same schema.

Endpoint Details

POST https://pdf-toolkit-apis.p.rapidapi.com/extract

Headers

Key Value
Content-Type multipart/form-data
x-rapidapi-host pdf-toolkit-apis.p.rapidapi.com
x-rapidapi-key YOUR_RAPIDAPI_KEY
Authorization Bearer YOUR_AUTH_TOKEN

Request Body

Parameter Type Description Constraints
file File The PDF file to extract data from Size: 0-10240 KB, Required
schema String The JSON schema definition Valid JSON, Required
start_page Integer Page to start extraction from Min: 0, Optional
end_page Integer Page to end extraction at Min: 0, Optional
language String The language for extraction Values: 'en' or 'es', Default: 'en', Optional

JSON Schema Definition

The JSON schema defines what data to extract from your PDF document. It uses the standard JSON Schema format (draft-07) to specify the structure of data you expect to receive.

Schema Format
{
  "$schema": "http://json-schema.org/draft-07/schema#",
  "type": "object",
  "properties": {
    "property_name": {
      "type": "string | number | object | array",
      "description": "Optional description of the property"
    }
  },
  "required": ["property_name"]
}

You can define complex nested objects, arrays, and specify data types for each field to ensure the extracted data matches your requirements.

Example Usage

Here's how to use the extract endpoint to extract data from an invoice PDF:

cURL Example
curl --request POST \
  --url https://pdf-toolkit-apis.p.rapidapi.com/extract \
  --header 'Content-Type: multipart/form-data' \
  --header 'x-rapidapi-host: pdf-toolkit-apis.p.rapidapi.com' \
  --header 'x-rapidapi-key: YOUR_RAPIDAPI_KEY' \
  --header 'Authorization: Bearer YOUR_AUTH_TOKEN' \
  --form 'file=@/path/to/invoice.pdf' \
  --form 'schema={
    "$schema": "http://json-schema.org/draft-07/schema#",
    "type": "object",
    "properties": {
      "invoice": {
        "type": "object",
        "properties": {
          "items": {
            "type": "array",
            "items": {
              "type": "object",
              "properties": {
                "total": { "type": "number" },
                "name": { "type": "string" }
              },
              "required": ["total", "name"]
            }
          },
          "total": { "type": "number" }
        },
        "required": ["items", "total"]
      }
    },
    "required": ["invoice"]
  }'
Response (JSON)
{
  "invoice": {
    "items": [
      {
        "name": "Web Design Services",
        "total": 2500.00
      },
      {
        "name": "Hosting (Annual)",
        "total": 1200.00
      },
      {
        "name": "SEO Package",
        "total": 4200.00
      },
      {
        "name": "Content Creation",
        "total": 1800.00
      },
      {
        "name": "Arabic Ceramic Vase - Arabic Ceramic Vase",
        "total": 3200.00
      }
    ],
    "total": 12900.00
  }
}

Use Cases

Financial Documents

Extract data from invoices, receipts, and financial reports with precise JSON schema definitions.

  • Invoice data including items, prices, and totals
  • Financial statements with structured data fields
  • Purchase orders and payment information
Invoice Schema Example
{
  "$schema": "http://json-schema.org/draft-07/schema#",
  "type": "object",
  "properties": {
    "invoice": {
      "type": "object",
      "properties": {
        "number": { "type": "string" },
        "date": { "type": "string" },
        "due_date": { "type": "string" },
        "items": {
          "type": "array",
          "items": {
            "type": "object",
            "properties": {
              "description": { "type": "string" },
              "quantity": { "type": "number" },
              "price": { "type": "number" },
              "amount": { "type": "number" }
            }
          }
        },
        "subtotal": { "type": "number" },
        "tax": { "type": "number" },
        "total": { "type": "number" }
      }
    }
  }
}

Data Visualization

Extract categories, labels, and data from charts and graphs in PDF documents.

  • Chart categories and labels
  • Numerical data from visualizations
  • Data series and trends
Chart Data Schema Example
{
  "$schema": "http://json-schema.org/draft-07/schema#",
  "type": "object",
  "properties": {
    "categories": {
      "type": "array",
      "items": {
        "type": "string",
        "description": "Unique label of the data, without percentages"
      }
    }
  },
  "required": [
    "categories"
  ]
}

Text Documents

Extract titles, sections, and paragraphs from text-heavy documents.

  • Document title and subtitle
  • Section headers and content
  • Paragraphs and text blocks
Document Schema Example
{
  "$schema": "http://json-schema.org/draft-07/schema#",
  "type": "object",
  "properties": {
    "title": {
      "type": "string"
    },
    "subtitle": {
      "type": "string"
    },
    "paragraphs": {
      "type": "array",
      "items": {
        "type": "string"
      }
    }
  },
  "required": [
    "title",
    "subtitle",
    "paragraphs"
  ]
}