PDF to MarkDown
Xijiang Yu
We may have some old documents that are only on paper. The following procedure is to convert them into MarkDown format. The latter can be easily converted to easy to handle PDF files. If you have some graphic card like mine, which is GeForce RTX 3060 Ti, with 12GiB VRAM, or better, you can follow the steps below.
The first step is to scan the papers into a PDF file. Then
sudo dnf install python3.11-devel \
libtiff-devel \
libjpeg-devel \
openjpeg2-devel \
zlib-devel \
freetype-devel \
lcms2-devel \
libwebp-devel \
tcl-devel \
tk-devel \
harfbuzz-devel \
fribidi-devel \
libraqm-devel \
libimagequant-devel \
libxcb-devel \
tesseract \
tesseract-langpack-eng \
ghostscript
# make of folder somewhere
mkdir ocr
cd ocr
uv init --python 3.11
uv add marker-pdf
uv run marker_single your_scanned_file.pdf \
--output_dir output/ \
--force_ocr \
--layout_batch_size 4 \
--detection_batch_size 4 \
--recognition_batch_size 4 \
--equation_batch_size 4
Tool marker_single will generate a MarkDown file, a meta file and
some figures in folder output. It can contain some $\LaTeX$
errors. I use Gemini cli to recursively compile and correct the
errors.