Elsevier Fetch¶
matmmextract.elsevier.fetcher¶
Fetch full-text XML from the Elsevier API for a list of DOIs.
Ported from elsevier.py:
- API credentials read from environment variables (never hardcoded)
- Everything wrapped in fetch_all() / main() — safe to import
- Resume support via processed_dois_elsevier.txt / failed_dois_elsevier.txt
Environment variables¶
ELSEVIER_API_KEY Your Elsevier API key ELSEVIER_INST_TOKEN Your institutional token
- class FetchResult(success: 'list[str]' = <factory>, failed: 'list[tuple[str, str]]'=<factory>, output_csv: 'str' = '')[source]¶
- failed: list[tuple[str, str]]¶
- output_csv: str = ''¶
- success: list[str]¶
- fetch_all(df: DataFrame, api_key: str, inst_token: str, output_dir: str | Path = 'alloys_elsevier', processed_file: str | Path = 'processed_dois_elsevier.txt', failed_file: str | Path = 'failed_dois_elsevier.txt', output_csv: str | Path = 'elsevier_with_xml_paths.csv', max_per_run: int = 1000, sleep_min: float = 2.0, sleep_max: float = 5.0, doi_col: str = 'DOI', verbose: bool = True) FetchResult[source]¶
Fetch Elsevier full-text XML for every DOI in df.
- Parameters:
df – Input DataFrame with a
DOIcolumn (or override with doi_col).api_key – Elsevier API key.
inst_token – Elsevier institutional token.
output_dir – Directory to write XML files into.
processed_file – Newline-delimited file of already-processed DOIs (resume support).
failed_file – Newline-delimited file of failed DOIs + reasons.
output_csv – CSV written on completion with an added
XML_Filecolumn.max_per_run – Stop after this many successful fetches (API rate-limit safety).
sleep_max (sleep_min /) – Random sleep range (seconds) between requests.
doi_col – Name of the DOI column in df.
verbose – Print progress to stdout.
- Return type: