Springer Fetch

matmmextract.springer.fetcher

Fetch full-text JATS XML from the Springer Nature API for a list of DOIs.

Ported from springer.py — logic unchanged, already clean. Minor additions: - fetch_all() callable for programmatic use (not just CLI) - Imports shared doi_to_filename, load_set, append_line

Environment variables

SPRINGER_API_KEY Springer API key SPRINGER_API_METRIC Full-text TDM API metric (omit when using –oa)

class FetchResult(success: 'list[str]' = <factory>, failed: 'list[tuple[str, str]]'=<factory>, output_csv: 'str' = '')[source]
failed: list[tuple[str, str]]
output_csv: str = ''
success: list[str]
fetch_all(df: DataFrame, api_key: str, api_metric: str | None = None, use_open_access: bool = False, output_dir: str | Path = 'alloys_springer', processed_file: str | Path = 'processed_dois_springer.txt', failed_file: str | Path = 'failed_dois_springer.txt', output_csv: str | Path = 'springer_with_xml_paths.csv', max_per_run: int = 1000, sleep_min: float = 2.0, sleep_max: float = 5.0, doi_col: str = 'DOI', verbose: bool = True) FetchResult[source]

Fetch Springer JATS XML for every DOI in df.

Parameters:
  • df – Input DataFrame with a DOI column (or override with doi_col).

  • api_key – Springer API key.

  • api_metric – Full-text TDM metric (ignored when use_open_access is True).

  • use_open_access – Use the Open Access JATS endpoint instead of the full-text TDM one.

  • output_dir – Directory to write XML files.

  • failed_file (processed_file /) – Resume-support files.

  • output_csv – Written on completion with an added XML_File column.

  • max_per_run – Stop after this many successes.

  • sleep_max (sleep_min /) – Random sleep range between requests.

  • doi_col – Name of the DOI column in df.

  • verbose – Print progress to stdout.

Return type:

FetchResult

fetch_fulltext_xml(doi: str, api_key: str, endpoint: str = 'https://spdi.public.springernature.app/xmldata/jats', timeout: int = 60) tuple[bytes | None, str | None][source]

Fetch JATS XML for doi.

Parameters:
  • doi – DOI string.

  • api_key – Full API key string (already combined with metric if needed).

  • endpoint – API endpoint URL.

  • timeout – Request timeout in seconds.

Returns:

  • (content, None) on success

  • (None, error_str) on failure