Elsevier Fetch

matmmextract.elsevier.fetcher

Fetch full-text XML from the Elsevier API for a list of DOIs.

Ported from elsevier.py: - API credentials read from environment variables (never hardcoded) - Everything wrapped in fetch_all() / main() — safe to import - Resume support via processed_dois_elsevier.txt / failed_dois_elsevier.txt

Environment variables

ELSEVIER_API_KEY Your Elsevier API key ELSEVIER_INST_TOKEN Your institutional token

class FetchResult(success: 'list[str]' = <factory>, failed: 'list[tuple[str, str]]'=<factory>, output_csv: 'str' = '')[source]
failed: list[tuple[str, str]]
output_csv: str = ''
success: list[str]
fetch_all(df: DataFrame, api_key: str, inst_token: str, output_dir: str | Path = 'alloys_elsevier', processed_file: str | Path = 'processed_dois_elsevier.txt', failed_file: str | Path = 'failed_dois_elsevier.txt', output_csv: str | Path = 'elsevier_with_xml_paths.csv', max_per_run: int = 1000, sleep_min: float = 2.0, sleep_max: float = 5.0, doi_col: str = 'DOI', verbose: bool = True) FetchResult[source]

Fetch Elsevier full-text XML for every DOI in df.

Parameters:
  • df – Input DataFrame with a DOI column (or override with doi_col).

  • api_key – Elsevier API key.

  • inst_token – Elsevier institutional token.

  • output_dir – Directory to write XML files into.

  • processed_file – Newline-delimited file of already-processed DOIs (resume support).

  • failed_file – Newline-delimited file of failed DOIs + reasons.

  • output_csv – CSV written on completion with an added XML_File column.

  • max_per_run – Stop after this many successful fetches (API rate-limit safety).

  • sleep_max (sleep_min /) – Random sleep range (seconds) between requests.

  • doi_col – Name of the DOI column in df.

  • verbose – Print progress to stdout.

Return type:

FetchResult

fetch_fulltext_xml(pii: str, api_key: str, inst_token: str) tuple[bytes | None, str | None][source]

Return (xml_bytes, None) on success or (None, error_reason) on failure.

get_pii_from_doi(doi: str, api_key: str, inst_token: str) str | None[source]

Return the PII for doi, or None on failure.