Skip to main content

How to split by character

This is the simplest method. This splits based on a given character sequence, which defaults to "\n\n". Chunk length is measured by number of characters.

  1. How the text is split: by single character separator.
  2. How the chunk size is measured: by number of characters.

To obtain the string content directly, use .split_text.

To create LangChain Document objects (e.g., for use in downstream tasks), use .createDocuments.

import { CharacterTextSplitter } from "@langchain/textsplitters";

// Load an example document
const stateOfTheUnion = await Deno.readTextFile(
"../../../../examples/state_of_the_union.txt"
);

const textSplitter = new CharacterTextSplitter({
separator: "\n\n",
chunkSize: 1000,
chunkOverlap: 200,
});
const texts = await textSplitter.createDocuments([stateOfTheUnion]);
console.log(texts[0]);
Document {
pageContent: "Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and th"... 839 more characters,
metadata: { loc: { lines: { from: 1, to: 17 } } }
}

Use .createDocuments to propagate metadata associated with each document to the output chunks:

const metadatas = [{ document: 1 }, { document: 2 }];
const documents = await textSplitter.createDocuments(
[stateOfTheUnion, stateOfTheUnion],
metadatas
);
console.log(documents[0]);
Document {
pageContent: "Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and th"... 839 more characters,
metadata: { document: 1, loc: { lines: { from: 1, to: 17 } } }
}

Use .splitText to obtain the string content directly:

(await textSplitter.splitText(stateOfTheUnion))[0];
"Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and th"... 839 more characters

Help us out by providing feedback on this documentation page: