Elsevier Fetch¶

matmmextract.elsevier.fetcher¶

Fetch full-text XML from the Elsevier API for a list of DOIs.

Ported from elsevier.py: - API credentials read from environment variables (never hardcoded) - Everything wrapped in fetch_all() / main() — safe to import - Resume support via processed_dois_elsevier.txt / failed_dois_elsevier.txt

Environment variables¶

ELSEVIER_API_KEY Your Elsevier API key ELSEVIER_INST_TOKEN Your institutional token

class FetchResult(success: 'list[str]' = <factory>, failed: 'list[tuple[str, str]]'=<factory>, output_csv: 'str' = '')[source]¶

failed: list[tuple[str, str]]¶

output_csv: str = ''¶

success: list[str]¶

fetch_all(df: DataFrame, api_key: str, inst_token: str, output_dir: str | Path = 'alloys_elsevier', processed_file: str | Path = 'processed_dois_elsevier.txt', failed_file: str | Path = 'failed_dois_elsevier.txt', output_csv: str | Path = 'elsevier_with_xml_paths.csv', max_per_run: int = 1000, sleep_min: float = 2.0, sleep_max: float = 5.0, doi_col: str = 'DOI', verbose: bool = True) → FetchResult[source]¶

Fetch Elsevier full-text XML for every DOI in df.

Parameters:

df – Input DataFrame with a DOI column (or override with doi_col).
api_key – Elsevier API key.
inst_token – Elsevier institutional token.
output_dir – Directory to write XML files into.
processed_file – Newline-delimited file of already-processed DOIs (resume support).
failed_file – Newline-delimited file of failed DOIs + reasons.
output_csv – CSV written on completion with an added XML_File column.
max_per_run – Stop after this many successful fetches (API rate-limit safety).
sleep_max (sleep_min /) – Random sleep range (seconds) between requests.
doi_col – Name of the DOI column in df.
verbose – Print progress to stdout.

Return type:

FetchResult

fetch_fulltext_xml(pii: str, api_key: str, inst_token: str) → tuple[bytes | None, str | None][source]¶: Return (xml_bytes, None) on success or (None, error_reason) on failure.

get_pii_from_doi(doi: str, api_key: str, inst_token: str) → str | None[source]¶: Return the PII for doi, or None on failure.