Extract, edit, and update text content in PDF documents with powerful text editing capabilities while preserving layout and images.
The Extract Content API extracts text blocks and images from PDF documents with precise positioning information. This enables you to create a rich editing experience that maintains the original document layout and allows targeted text modifications.
POST https://api.mega-pdf.com/api/pdf/extract-text
Authenticate requests using an API key in the x-api-key
header.
// Header example
x-api-key: your-api-key
The API accepts multipart/form-data
requests with the following parameters:
Parameter | Type | Description | Required |
---|---|---|---|
file | File | PDF file to extract content from (max 50MB) | Yes |
Extract content from a PDF using cURL:
curl -X POST https://api.mega-pdf.com/api/pdf/extract-text \
-H "x-api-key: your-api-key" \
-F "file=@/path/to/document.pdf"
Successful responses include detailed text and image content with positioning:
{
"success": true,
"message": "Content extracted successfully from 3 pages with 125 text blocks and 4 images",
"extractedData": {
"pages": [
{
"page_number": 1,
"width": 612,
"height": 792,
"texts": [
{
"text": "Sample document title",
"x0": 100.5,
"y0": 50.2,
"x1": 400.8,
"y1": 75.3,
"font": "Helvetica-Bold",
"size": 18.0,
"color": 0
},
// More text blocks...
],
"images": [
{
"x0": 50.0,
"y0": 100.0,
"x1": 250.0,
"y1": 300.0,
"width": 200.0,
"height": 200.0,
"image_data": "base64-encoded-image-data...",
"format": "jpeg",
"image_id": "session_id_page1_img0"
},
// More images...
]
},
// More pages...
],
"metadata": {
"total_pages": 3,
"total_text_blocks": 125,
"total_images": 4,
"extraction_method": "PyMuPDF Enhanced with Images"
}
},
"sessionId": "unique-session-identifier",
"originalName": "document.pdf",
"billing": {
"usedFreeOperation": true,
"freeOperationsRemaining": 9,
"currentBalance": 10.50,
"operationCost": 0.00
}
}
Error responses:
{
"success": false,
"error": "No content found in the PDF. The PDF may be empty or password protected."
}
The response includes detailed information about each text block and image:
Property | Type | Description |
---|---|---|
text | String | The actual text content |
x0 , y0 | Float | Top-left corner coordinates |
x1 , y1 | Float | Bottom-right corner coordinates |
font | String | Font family name |
size | Float | Font size in points |
color | Integer | RGB color value as an integer |
Property | Type | Description |
---|---|---|
x0 , y0 | Float | Top-left corner coordinates |
x1 , y1 | Float | Bottom-right corner coordinates |
width , height | Float | Image dimensions in points |
image_data | String | Base64-encoded image data |
format | String | Image format (jpeg, png, etc.) |
image_id | String | Unique identifier for the image |
Using the Extract Content API with JavaScript:
const formData = new FormData();
formData.append('file', fs.createReadStream('document.pdf'));
fetch('https://api.mega-pdf.com/api/pdf/extract-text', {
method: 'POST',
headers: {
'x-api-key': 'your-api-key'
},
body: formData
})
.then(response => response.json())
.then(data => {
if (data.success) {
console.log('Content extracted successfully');
console.log('Total pages:', data.extractedData.metadata.total_pages);
console.log('Total text blocks:', data.extractedData.metadata.total_text_blocks);
console.log('Total images:', data.extractedData.metadata.total_images);
// Store the session ID for later use when saving edits
const sessionId = data.sessionId;
// Process the extracted data
data.extractedData.pages.forEach(page => {
console.log(`Page ${page.page_number} has ${page.texts.length} text blocks and ${page.images?.length || 0} images`);
// Access text blocks for editing
page.texts.forEach(textBlock => {
console.log(`Text: "${textBlock.text.substring(0, 50)}..."`);
console.log(`Position: (${textBlock.x0}, ${textBlock.y0}) to (${textBlock.x1}, ${textBlock.y1})`);
console.log(`Font: ${textBlock.font} at ${textBlock.size}pt`);
});
});
} else {
console.error('Failed to extract content:', data.error);
}
})
.catch(error => console.error('Error:', error));