MatMMExtract¶
MatMMExtract is an open-source Python library for building multimodal materials science datasets from scientific literature.
It provides an end-to-end pipeline for retrieving papers from OpenAlex, Elsevier, Springer, and Scopus, extracting figures and captions, detecting scientific figure panels, generating fine-grained captions using modern large language models (LLMs), and constructing machine-learning-ready multimodal datasets.
Key Features¶
OpenAlex and Scopus paper retrieval
Elsevier and Springer XML parsing
Figure and caption extraction
Scientific figure panel detection
Google Gemini and Azure OpenAI caption generation
Dataset construction for multimodal machine learning
Contents¶
Documentation
- Getting Started
- Examples
- Elsevier Full Pipeline (OpenAlex → Elsevier → Detection → Azure → Dataset)
- Elsevier from Scopus Export (Scopus → Elsevier)
- Springer Full Pipeline (OpenAlex → Springer → Detection (model checkpoint from Hugging Face Hub) → Gemini → Dataset)
- Springer from Scopus Export (Scopus → Springer)
- Cleanup Intermediate Files
- API Reference