Springer Extract

extract_all(xml_dir: str | Path, output_csv: str | Path | None = None, verbose: bool = True) tuple[DataFrame, list[tuple[str, str]]][source]

Process every XML file in xml_dir and return a figures DataFrame.

Parameters:
  • xml_dir – Directory containing Springer .xml files.

  • output_csv – If provided, write the DataFrame here.

  • verbose – Print per-file progress and summary.

Returns:

  • df (pd.DataFrame) – One row per figure.

  • errors (list of (filename, error_message))

extract_figures(root) dict[str, dict][source]

Return {fig_id: {caption, image_url, fig_num, is_graphical_abstract}}.

extract_reference_sentences(root, figures: dict[str, dict]) dict[str, list[str]][source]

Find body sentences that cite each figure.

process_file(xml_path: str | Path) list[dict][source]

Extract all figure rows from a single Springer JATS XML file.

resolve_image_url(fig_tag) str[source]

Resolve image location from JATS <graphic xlink:href="...">, etc.