MegaPDF - Free PDF Converter, Editor, OCR & Unlock PDF

The Text Extraction API uses Optical Character Recognition (OCR) to extract text from scanned documents, images, and non-searchable PDFs. This powerful feature converts visual text content into machine-readable text that can be searched, copied, and analyzed.

Endpoint

POST https://api.mega-pdf.com/api/pdf/extract-text

Authentication

Authenticate requests using an API key in the x-api-key header.

// Header example
x-api-key: your-api-key

Request Parameters

The API accepts multipart/form-data requests with the following parameters:

Parameter	Type	Description	Required
`file`	File	PDF file or image to extract text from (max 50MB)	Yes
`language`	String	OCR language code (e.g., 'eng' for English, 'fra' for French)	No (default: eng)
`outputFormat`	String	Output format: 'txt', 'json', 'xml', or 'html'	No (default: txt)
`enhanceImages`	Boolean	Preprocess images to improve OCR accuracy	No (default: false)

Example Request

Extract text from a scanned PDF using cURL:

curl -X POST https://api.mega-pdf.com/api/pdf/extract-text \
  -H "x-api-key: your-api-key" \
  -F "file=@/path/to/scanned-document.pdf" \
  -F "language=eng" \
  -F "outputFormat=txt" \
  -F "enhanceImages=true"

Response Format

Successful responses include the extracted text content:

{
  "success": true,
  "message": "Text extracted successfully",
  "fileUrl": "/api/file?folder=extracted&filename=uuid-extracted.txt",
  "filename": "uuid-extracted.txt",
  "originalName": "scanned-document.pdf",
  "pageCount": 5,
  "characterCount": 15230,
  "detectedLanguage": "English",
  "previewText": "This is a preview of the extracted text content...",
  "billing": {
    "usedFreeOperation": true,
    "freeOperationsRemaining": 9,
    "currentBalance": 10.50,
    "operationCost": 0.00
  }
}

For JSON output format:

{
  "success": true,
  "message": "Text extracted successfully",
  "fileUrl": "/api/file?folder=extracted&filename=uuid-extracted.json",
  "filename": "uuid-extracted.json",
  "originalName": "scanned-document.pdf",
  "pageCount": 5,
  "data": {
    "pages": [
      {
        "pageNumber": 1,
        "text": "Content of page 1...",
        "blocks": [
          {
            "text": "Block of text",
            "bbox": [100, 200, 300, 250],
            "confidence": 0.95
          },
          // more text blocks...
        ]
      },
      // more pages...
    ]
  },
  "billing": {
    "usedFreeOperation": true,
    "freeOperationsRemaining": 9,
    "currentBalance": 10.50,
    "operationCost": 0.00
  }
}

Error responses:

{
  "success": false,
  "error": "Failed to extract text: The document appears to be password protected"
}

Code Examples

Using the Text Extraction API with JavaScript: