Triplan · Fall 2025 · Completed

Hierarchical document classification

Researched and tested approaches to simplify Triplan's document labelling workflow — extracting text from PDFs and comparing LLM, classical ML, and encoder fine-tuning options.

Document classificationNLPLLMs

Abstract visualization of document classification

In the fall of 2025, ReLU was presented with Triplan’s issue, where their clients have to label documents for storage. However, this labeling can be confusing and time-consuming, as many of the labels have similar names, and users have to filter down a large class tree where there are often thousands of leaf nodes at the bottom. ReLU was tasked with simplifying this process by using AI to reduce the number of candidate classes.

Improving document classification workflow

Throughout the fall semester, Team Triplan focused on extracting text data from PDFs without violating personal privacy policies. To identify the most viable and promising option for classification, the team conducted research on several popular, current-day approaches. These approaches were then tested in practice to get tangible results. Among the approaches explored were large language models, classical machine learning, and fine-tuning different types of encoders.

The spring semester focused on further testing and refinement of the method that proved to be the most promising.

Data: Customer documents
Methods: LLMs, classical machine learning, encoder fine-tuning, PDF text extraction
Handoff: approach comparison, prototype results, privacy-safe extraction pipeline