HTMLHeaderTextSplitter#
- class langchain_text_splitters.html.HTMLHeaderTextSplitter(headers_to_split_on: List[Tuple[str, str]], return_each_element: bool = False)[source]#
Splitting HTML files based on specified headers. Requires lxml package.
Create a new HTMLHeaderTextSplitter.
- Parameters:
headers_to_split_on (List[Tuple[str, str]]) β list of tuples of headers we want to track mapped to (arbitrary) keys for metadata. Allowed header values: h1, h2, h3, h4, h5, h6 e.g. [(βh1β, βHeader 1β), (βh2β, βHeader 2)].
return_each_element (bool) β Return each element w/ associated headers.
Methods
__init__
(headers_to_split_on[,Β ...])Create a new HTMLHeaderTextSplitter.
aggregate_elements_to_chunks
(elements)Combine elements with common metadata into chunks
split_text
(text)Split HTML text string
split_text_from_file
(file)Split HTML file
split_text_from_url
(url,Β **kwargs)Split HTML from web URL
- __init__(headers_to_split_on: List[Tuple[str, str]], return_each_element: bool = False)[source]#
Create a new HTMLHeaderTextSplitter.
- Parameters:
headers_to_split_on (List[Tuple[str, str]]) β list of tuples of headers we want to track mapped to (arbitrary) keys for metadata. Allowed header values: h1, h2, h3, h4, h5, h6 e.g. [(βh1β, βHeader 1β), (βh2β, βHeader 2)].
return_each_element (bool) β Return each element w/ associated headers.
- aggregate_elements_to_chunks(elements: List[ElementType]) List[Document] [source]#
Combine elements with common metadata into chunks
- Parameters:
elements (List[ElementType]) β HTML element content with associated identifying info and metadata
- Return type:
List[Document]
- split_text(text: str) List[Document] [source]#
Split HTML text string
- Parameters:
text (str) β HTML text
- Return type:
List[Document]