JSON Mode
In JSON mode, LlamaParse will return a data structure representing the parsed object.
Usage
For Json mode, you need to use loadJson
. The resultType
is automatically set with this method. Currently it can't be used with SimpleDirectoryReader
.
More information about indexing the results on the next page.
const reader = new LlamaParseReader();
async function main() {
// Load the file and return an array of json objects
const jsonObjs = await reader.loadJson("../data/uber_10q_march_2022.pdf");
// Access the first "pages" (=a single parsed file) object in the array
const jsonList = jsonObjs[0]["pages"];
// Further process the jsonList object as needed.
}
Output
The result format of the response, written to jsonObjs
in the example, follows this structure:
{
"pages": [
..page objects..
],
"job_metadata": {
"credits_used": int,
"credits_max": int,
"job_credits_usage": int,
"job_pages": int,
"job_is_cache_hit": boolean
},
"job_id": string ,
"file_path": string,
}
}
Page objects
Within page objects, the following keys may be present depending on your document.
page
: The page number of the document.text
: The text extracted from the page.md
: The markdown version of the extracted text.images
: Any images extracted from the page.items
: An array of heading, text and table objects in the order they appear on the page.
JSON Mode with SimpleDirectoryReader
All Readers share a loadData
method with SimpleDirectoryReader
that promises to return a uniform Document with Metadata. This makes JSON mode incompatible with SimpleDirectoryReader.
However, a simple work around is to create a new reader class that extends LlamaParseReader
and adds a new method or overrides loadData
, wrapping around JSON mode, extracting the required values, and returning a Document object.
import { LlamaParseReader, Document } from "llamaindex";
class LlamaParseReaderWithJson extends LlamaParseReader {
// Override the loadData method
override async loadData(filePath: string): Promise<Document[]> {
// Call loadJson method that was inherited by LlamaParseReader
const jsonObjs = await super.loadJson(filePath);
let documents: Document[] = [];
jsonObjs.forEach((jsonObj) => {
// Making sure it's an array before iterating over it
if (Array.isArray(jsonObj.pages)) {
}
const docs = jsonObj.pages.map(
(page: { text: string; page: number }) =>
new Document({ text: page.text, metadata: { page: page.page } }),
);
documents = documents.concat(docs);
});
return documents;
}
}
Now we have documents with page number as metadata. This new reader can be used like any other and be integrated with SimpleDirectoryReader. Since it extends LlamaParseReader
, you can use the same params.
You can assign any other values of the JSON response to the Document as needed.