Skip to content

A layout analysis pipeline using Detectron2 + Label Studio for extracting based on Fast R-CNN and annotating sections like abstract, table, figure, and references from academic PDFs

Notifications You must be signed in to change notification settings

shallowManica/doc-layout-parser

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 

Repository files navigation

doc-layout-parser

This project develops a layout parsing pipeline to extract key components (e.g., abstract, context, table, reference) from academic PDFs using a Detectron2-based model trained on annotations from Label Studio.

🔍 Purpose

To identify and segment document elements like titles, authors, abstracts, tables, figures, and references using object detection techniques, improving downstream analysis and semantic classification with LLMs.

⚙️ Features

  • Fast R-CNN architecture (Detectron2) for layout detection
  • Layout categories: Abstract, Author, Context, Header, Image, Reference, Sub-title, Table, Title
  • Integration-ready with LLMs for content-based filtering or labeling
  • Configuration through config.yaml

🗃 File Structure

  • config.yaml - Detectron2 configuration for the layout model
  • result.json - Output annotations from model inference
  • parsing.ipynb - Sample notebook to run detection and visualize results

📦 Dependencies

Install via pip:

!pip install pycocotools
!pip install layoutparser
!pip install "layoutparser[effdet]"
!pip install layoutparser torchvision
!python -m pip install 'git+https://github.com/facebookresearch/detectron2.git'
!pip install layoutparser torchvision && pip install "git+https://github.com/facebookresearch/detectron2.git@v0.5#egg=detectron2"
!pip install "layoutparser[paddledetection]"
!pip install "layoutparser[ocr]"

Install via Conda:

conda install detectron2 pytorch opencv omegaconf hydra-core -c conda-forge

🚀 How to Run

# Inside parsing.ipynb
from layoutparser.models import Detectron2LayoutModel

model = Detectron2LayoutModel(
    config_path='config.yaml',
    extra_config=["MODEL.ROI_HEADS.SCORE_THRESH_TEST", 0.8],
    label_map={0: "Abstract", 1: "Author", ...}
)
📄 Annotation Categories
	•	Abstract
	•	Author
	•	Context
	•	Header
	•	Image
	•	Reference
	•	Sub-title
	•	Table
	•	Title

About

A layout analysis pipeline using Detectron2 + Label Studio for extracting based on Fast R-CNN and annotating sections like abstract, table, figure, and references from academic PDFs

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published