Springer Fetch¶
matmmextract.springer.fetcher¶
Fetch full-text JATS XML from the Springer Nature API for a list of DOIs.
Ported from springer.py — logic unchanged, already clean.
Minor additions:
- fetch_all() callable for programmatic use (not just CLI)
- Imports shared doi_to_filename, load_set, append_line
Environment variables¶
SPRINGER_API_KEY Springer API key SPRINGER_API_METRIC Full-text TDM API metric (omit when using –oa)
- class FetchResult(success: 'list[str]' = <factory>, failed: 'list[tuple[str, str]]'=<factory>, output_csv: 'str' = '')[source]¶
- failed: list[tuple[str, str]]¶
- output_csv: str = ''¶
- success: list[str]¶
- fetch_all(df: DataFrame, api_key: str, api_metric: str | None = None, use_open_access: bool = False, output_dir: str | Path = 'alloys_springer', processed_file: str | Path = 'processed_dois_springer.txt', failed_file: str | Path = 'failed_dois_springer.txt', output_csv: str | Path = 'springer_with_xml_paths.csv', max_per_run: int = 1000, sleep_min: float = 2.0, sleep_max: float = 5.0, doi_col: str = 'DOI', verbose: bool = True) FetchResult[source]¶
Fetch Springer JATS XML for every DOI in df.
- Parameters:
df – Input DataFrame with a
DOIcolumn (or override with doi_col).api_key – Springer API key.
api_metric – Full-text TDM metric (ignored when use_open_access is
True).use_open_access – Use the Open Access JATS endpoint instead of the full-text TDM one.
output_dir – Directory to write XML files.
failed_file (processed_file /) – Resume-support files.
output_csv – Written on completion with an added
XML_Filecolumn.max_per_run – Stop after this many successes.
sleep_max (sleep_min /) – Random sleep range between requests.
doi_col – Name of the DOI column in df.
verbose – Print progress to stdout.
- Return type:
- fetch_fulltext_xml(doi: str, api_key: str, endpoint: str = 'https://spdi.public.springernature.app/xmldata/jats', timeout: int = 60) tuple[bytes | None, str | None][source]¶
Fetch JATS XML for doi.
- Parameters:
doi – DOI string.
api_key – Full API key string (already combined with metric if needed).
endpoint – API endpoint URL.
timeout – Request timeout in seconds.
- Returns:
(content, None) on success
(None, error_str) on failure