DocAIParser#

class langchain_google_community.docai.DocAIParser(*, client: DocumentProcessorServiceClient | None = None, location: str | None = None, gcs_output_path: str | None = None, processor_name: str | None = None)[source]#

Google Cloud Document AI parser.

For a detailed explanation of Document AI, refer to the product documentation. https://cloud.google.com/document-ai/docs/overview

Initializes the parser.

Parameters:
  • client (DocumentProcessorServiceClient | None) – a DocumentProcessorServiceClient to use

  • location (str | None) – a Google Cloud location where a Document AI processor is located

  • gcs_output_path (str | None) – a path on Google Cloud Storage to store parsing results

  • processor_name (str | None) – full resource name of a Document AI processor or processor version

You should provide either a client or location (and then a client

would be instantiated).

Methods

__init__(*[, client, location, ...])

Initializes the parser.

batch_parse(blobs[, gcs_output_path, ...])

Parses a list of blobs lazily.

docai_parse(blobs, *[, gcs_output_path, ...])

Runs Google Document AI PDF Batch Processing on a list of blobs.

get_results(operations)

is_running(operations)

lazy_parse(blob)

Parses a blob lazily.

online_process(blob[, ...])

Parses a blob lazily using online processing.

operations_from_names(operation_names)

Initializes Long-Running Operations from their names.

parse(blob)

Eagerly parse the blob into a document or documents.

parse_from_results(results)

__init__(*, client: DocumentProcessorServiceClient | None = None, location: str | None = None, gcs_output_path: str | None = None, processor_name: str | None = None)[source]#

Initializes the parser.

Parameters:
  • client (DocumentProcessorServiceClient | None) – a DocumentProcessorServiceClient to use

  • location (str | None) – a Google Cloud location where a Document AI processor is located

  • gcs_output_path (str | None) – a path on Google Cloud Storage to store parsing results

  • processor_name (str | None) – full resource name of a Document AI processor or processor version

You should provide either a client or location (and then a client

would be instantiated).

batch_parse(blobs: Sequence[Blob], gcs_output_path: str | None = None, timeout_sec: int = 3600, check_in_interval_sec: int = 60) Iterator[Document][source]#

Parses a list of blobs lazily.

Parameters:
  • blobs (Sequence[Blob]) – a list of blobs to parse.

  • gcs_output_path (str | None) – a path on Google Cloud Storage to store parsing results.

  • timeout_sec (int) – a timeout to wait for Document AI to complete, in seconds.

  • check_in_interval_sec (int) – an interval to wait until next check whether parsing operations have been completed, in seconds

Return type:

Iterator[Document]

This is a long-running operation. A recommended way is to decouple

parsing from creating LangChain Documents: >>> operations = parser.docai_parse(blobs, gcs_path) >>> parser.is_running(operations) You can get operations names and save them: >>> names = [op.operation.name for op in operations] And when all operations are finished, you can use their results: >>> operations = parser.operations_from_names(operation_names) >>> results = parser.get_results(operations) >>> docs = parser.parse_from_results(results)

docai_parse(blobs: Sequence[Blob], *, gcs_output_path: str | None = None, processor_name: str | None = None, batch_size: int = 1000, enable_native_pdf_parsing: bool = True, field_mask: str | None = None) List[Operation][source]#

Runs Google Document AI PDF Batch Processing on a list of blobs.

Parameters:
  • blobs (Sequence[Blob]) – a list of blobs to be parsed

  • gcs_output_path (str | None) – a path (folder) on GCS to store results

  • processor_name (str | None) – name of a Document AI processor.

  • batch_size (int) – amount of documents per batch

  • enable_native_pdf_parsing (bool) – a config option for the parser

  • field_mask (str | None) – a comma-separated list of which fields to include in the Document AI response. suggested: “text,pages.pageNumber,pages.layout”

Return type:

List[Operation]

Document AI has a 1000 file limit per batch, so batches larger than that need to be split into multiple requests. Batch processing is an async long-running operation and results are stored in a output GCS bucket.

get_results(operations: List[Operation]) List[DocAIParsingResults][source]#
Parameters:

operations (List[Operation])

Return type:

List[DocAIParsingResults]

is_running(operations: List[Operation]) bool[source]#
Parameters:

operations (List[Operation])

Return type:

bool

lazy_parse(blob: Blob) Iterator[Document][source]#

Parses a blob lazily.

Parameters:
  • blobs – a Blob to parse

  • blob (Blob)

Return type:

Iterator[Document]

This is a long-running operation. A recommended way is to batch

documents together and use the batch_parse() method.

online_process(blob: Blob, enable_native_pdf_parsing: bool = True, field_mask: str | None = None, page_range: List[int] | None = None) Iterator[Document][source]#

Parses a blob lazily using online processing.

Parameters:
  • blob (Blob) – a blob to parse.

  • enable_native_pdf_parsing (bool) – enable pdf embedded text extraction

  • field_mask (str | None) – a comma-separated list of which fields to include in the Document AI response. suggested: “text,pages.pageNumber,pages.layout”

  • page_range (List[int] | None) – list of page numbers to parse. If None, entire document will be parsed.

Return type:

Iterator[Document]

operations_from_names(operation_names: List[str]) List[Operation][source]#

Initializes Long-Running Operations from their names.

Parameters:

operation_names (List[str])

Return type:

List[Operation]

parse(blob: Blob) List[Document]#

Eagerly parse the blob into a document or documents.

This is a convenience method for interactive development environment.

Production applications should favor the lazy_parse method instead.

Subclasses should generally not over-ride this parse method.

Parameters:

blob (Blob) – Blob instance

Returns:

List of documents

Return type:

List[Document]

parse_from_results(results: List[DocAIParsingResults]) Iterator[Document][source]#
Parameters:

results (List[DocAIParsingResults])

Return type:

Iterator[Document]