Elsevier Extract¶
- build_image_map(soup: BeautifulSoup) dict[str, str][source]¶
Build
{ref_key: url}from the<objects>section.Prefers
category='high'; falls back to any other category.
- extract_all(xml_dir: str | Path, output_csv: str | Path | None = None, verbose: bool = True) tuple[DataFrame, list[tuple[str, str]]][source]¶
Process every XML file in xml_dir and return a figures DataFrame.
- Parameters:
xml_dir – Directory containing Elsevier
.xmlfiles.output_csv – If provided, write the DataFrame here.
verbose – Print per-file progress.
- Returns:
df (pd.DataFrame) – One row per figure.
errors (list of (filename, error_message))
- extract_figures(soup: BeautifulSoup, image_map: dict[str, str]) dict[str, dict][source]¶
Return
{fig_id: {caption, image_url, fig_num}}for all figures.
- extract_reference_sentences(soup: BeautifulSoup, figures: dict[str, dict]) dict[str, list[str]][source]¶
Find body sentences that cite each figure (structured + regex modes).
- get_merged_paragraphs(soup: BeautifulSoup) list[tuple][source]¶
Merge sibling paragraphs split by
<ce:float-anchor>.