PDF to MarkDown

Xijiang Yu

February 4, 2026

We may have some old documents that are only on paper. The following procedure is to convert them into MarkDown format. The latter can be easily converted to easy to handle PDF files. If you have some graphic card like mine, which is GeForce RTX 3060 Ti, with 12GiB VRAM, or better, you can follow the steps below.

The first step is to scan the papers into a PDF file. Then

sudo dnf install python3.11-devel \
	 	 		 libtiff-devel \
				 libjpeg-devel \
				 openjpeg2-devel \
				 zlib-devel \
				 freetype-devel \
				 lcms2-devel \
				 libwebp-devel \
				 tcl-devel \
				 tk-devel \
				 harfbuzz-devel \
				 fribidi-devel \
				 libraqm-devel \
				 libimagequant-devel \
				 libxcb-devel \
				 tesseract \
				 tesseract-langpack-eng \
				 ghostscript
# make of folder somewhere
mkdir ocr
cd ocr
uv init --python 3.11
uv add marker-pdf
uv run marker_single your_scanned_file.pdf \
  					 --output_dir output/ \
  					 --force_ocr \
  					 --layout_batch_size 4 \
  					 --detection_batch_size 4 \
  					 --recognition_batch_size 4 \
  					 --equation_batch_size 4

Tool marker_single will generate a MarkDown file, a meta file and some figures in folder output. It can contain some $\LaTeX$ errors. I use Gemini cli to recursively compile and correct the errors.