Elsevier Extract¶

build_image_map(soup: BeautifulSoup) → dict[str, str][source]¶

Build {ref_key: url} from the <objects> section.

Prefers category='high'; falls back to any other category.

extract_all(xml_dir: str | Path, output_csv: str | Path | None = None, verbose: bool = True) → tuple[DataFrame, list[tuple[str, str]]][source]¶

Process every XML file in xml_dir and return a figures DataFrame.

Parameters:

xml_dir – Directory containing Elsevier .xml files.
output_csv – If provided, write the DataFrame here.
verbose – Print per-file progress.

Returns:

df (pd.DataFrame) – One row per figure.
errors (list of (filename, error_message))

extract_figures(soup: BeautifulSoup, image_map: dict[str, str]) → dict[str, dict][source]¶: Return {fig_id: {caption, image_url, fig_num}} for all figures.

extract_reference_sentences(soup: BeautifulSoup, figures: dict[str, dict]) → dict[str, list[str]][source]¶: Find body sentences that cite each figure (structured + regex modes).

get_merged_paragraphs(soup: BeautifulSoup) → list[tuple][source]¶: Merge sibling paragraphs split by <ce:float-anchor>.

process_file(xml_path: str | Path) → list[dict][source]¶: Extract all figure rows from a single Elsevier XML file.

resolve_image_url(fig_tag, image_map: dict[str, str]) → str | None[source]¶: Resolve the image URL for a <ce:figure> element (3 strategies).