Building a RAG Pipeline to Reduce LLM Hallucination

Posted Apr 15, 2025 Updated Jan 10, 2026

By Junyoung Yang

read 1 min

Problem

At PNU x Upstage DOCUMENT AI CHALLENGE, I developed “DocDoc,” an AI service for overseas medical teams. It was a chatbot that finds relevant papers and answers questions from medical professionals.

At first, I just passed questions directly to the LLM. But problems quickly emerged.

Hallucination:

The LLM would give plausible-sounding answers that were completely wrong
Sometimes it even cited papers that didn’t exist
In the medical field, wrong information can be fatal

Solution: RAG (Retrieval-Augmented Generation)

The idea is to search for relevant documents first and feed them as context before the LLM generates a response.

Pipeline:

Convert the user’s question into an embedding vector
Search for similar paper chunks in Pinecone (Vector DB)
Pass the retrieved content as context to the LLM
LLM generates an answer based on the context

Tech Stack

Vector DB: Pinecone
Embedding: Upstage Embeddings
LLM: Upstage Solar Pro 2
Framework: LangChain
Backend: Node.js, Express

UX Improvement: Server-Sent Events

LLM response generation was slow. Users staring at a blank screen might leave.

Solution: EventStream (SSE)

Stream the answer in real-time as it’s generated
Characters appear one by one, like ChatGPT
Users can see that a response is being generated

Lessons Learned

LLMs aren’t magic. In specialized domains, hallucination is a serious issue.
With RAG, the LLM answers based on actual documents, improving accuracy.
Long response times can be mitigated with streaming.

From developing “DocDoc,” which advanced to the finals at PNU x Upstage DOCUMENT AI CHALLENGE 2025.
DAIC-docdoc

Backend, AI

This post is licensed under CC BY 4.0 by the author.