The extract
endpoint allows you to extract structured data from any PDF document using JSON schema definitions.
Instead of building complex PDF parsing logic, you can define your data extraction needs using standard JSON schema format and process thousands of documents consistently. The API handles all the complex extraction work, delivering clean, structured JSON data ready for your application.
Create a JSON schema that describes what data you want to extract from your PDF.
Send your PDF file along with the schema to the extract endpoint.
Receive a structured JSON response that matches your schema definition.
Use the extracted data in your application and process thousands of documents with the same schema.
Key | Value |
---|---|
Content-Type | multipart/form-data |
x-rapidapi-host | pdf-toolkit-apis.p.rapidapi.com |
x-rapidapi-key | YOUR_RAPIDAPI_KEY |
Authorization | Bearer YOUR_AUTH_TOKEN |
Parameter | Type | Description | Constraints |
---|---|---|---|
file | File | The PDF file to extract data from | Size: 0-10240 KB, Required |
schema | String | The JSON schema definition | Valid JSON, Required |
start_page | Integer | Page to start extraction from | Min: 0, Optional |
end_page | Integer | Page to end extraction at | Min: 0, Optional |
language | String | The language for extraction | Values: 'en' or 'es', Default: 'en', Optional |
The JSON schema defines what data to extract from your PDF document. It uses the standard JSON Schema format (draft-07) to specify the structure of data you expect to receive.
{
"$schema": "http://json-schema.org/draft-07/schema#",
"type": "object",
"properties": {
"property_name": {
"type": "string | number | object | array",
"description": "Optional description of the property"
}
},
"required": ["property_name"]
}
You can define complex nested objects, arrays, and specify data types for each field to ensure the extracted data matches your requirements.
Here's how to use the extract
endpoint to extract data from an invoice PDF:
curl --request POST \
--url https://pdf-toolkit-apis.p.rapidapi.com/extract \
--header 'Content-Type: multipart/form-data' \
--header 'x-rapidapi-host: pdf-toolkit-apis.p.rapidapi.com' \
--header 'x-rapidapi-key: YOUR_RAPIDAPI_KEY' \
--header 'Authorization: Bearer YOUR_AUTH_TOKEN' \
--form 'file=@/path/to/invoice.pdf' \
--form 'schema={
"$schema": "http://json-schema.org/draft-07/schema#",
"type": "object",
"properties": {
"invoice": {
"type": "object",
"properties": {
"items": {
"type": "array",
"items": {
"type": "object",
"properties": {
"total": { "type": "number" },
"name": { "type": "string" }
},
"required": ["total", "name"]
}
},
"total": { "type": "number" }
},
"required": ["items", "total"]
}
},
"required": ["invoice"]
}'
{
"invoice": {
"items": [
{
"name": "Web Design Services",
"total": 2500.00
},
{
"name": "Hosting (Annual)",
"total": 1200.00
},
{
"name": "SEO Package",
"total": 4200.00
},
{
"name": "Content Creation",
"total": 1800.00
},
{
"name": "Arabic Ceramic Vase - Arabic Ceramic Vase",
"total": 3200.00
}
],
"total": 12900.00
}
}
Extract data from invoices, receipts, and financial reports with precise JSON schema definitions.
{
"$schema": "http://json-schema.org/draft-07/schema#",
"type": "object",
"properties": {
"invoice": {
"type": "object",
"properties": {
"number": { "type": "string" },
"date": { "type": "string" },
"due_date": { "type": "string" },
"items": {
"type": "array",
"items": {
"type": "object",
"properties": {
"description": { "type": "string" },
"quantity": { "type": "number" },
"price": { "type": "number" },
"amount": { "type": "number" }
}
}
},
"subtotal": { "type": "number" },
"tax": { "type": "number" },
"total": { "type": "number" }
}
}
}
}
Extract categories, labels, and data from charts and graphs in PDF documents.
{
"$schema": "http://json-schema.org/draft-07/schema#",
"type": "object",
"properties": {
"categories": {
"type": "array",
"items": {
"type": "string",
"description": "Unique label of the data, without percentages"
}
}
},
"required": [
"categories"
]
}
Extract titles, sections, and paragraphs from text-heavy documents.
{
"$schema": "http://json-schema.org/draft-07/schema#",
"type": "object",
"properties": {
"title": {
"type": "string"
},
"subtitle": {
"type": "string"
},
"paragraphs": {
"type": "array",
"items": {
"type": "string"
}
}
},
"required": [
"title",
"subtitle",
"paragraphs"
]
}