Downloader

matmmextract.shared.downloader

Threaded image download engine shared by the Elsevier and Springer download scripts.

The two original scripts (elsevier_download_image.py and download_springer_img.py) had largely parallel structures. This module provides:

  • next_name_allocator — resume-safe sequential image naming

    (ported from the Springer script; replaces the hardcoded img_counter = 13101 in Elsevier)

  • load_download_log — read a previous log to skip already-done URLs

  • append_log — append rows to the CSV log incrementally

  • DownloadResult — typed result from a single download attempt

  • download_one — download one URL with retries

  • run_downloads — orchestrate a full threaded batch

Each publisher supplies its own build_candidate_urls callable and request_headers dict; everything else is shared.

class DownloadResult(row_index: 'Any', image_url: 'str', status: 'str', downloaded_image_name: 'str', local_path: 'str', attempted_url: 'str', error: 'str' = '')[source]
attempted_url: str
downloaded_image_name: str
error: str = ''
image_url: str
local_path: str
row_index: Any
status: str
append_log(log_file: str | Path, rows: list[dict]) None[source]

Append rows to the CSV log, writing a header if the file is new.

Parameters:
  • log_file – Destination CSV path.

  • rows – List of dicts; all must share the same keys.

download_one(index: Any, image_url: str, candidate_urls: list[str], local_path: Path, local_name: str, headers: dict, retries: int = 3, timeout: int = 90) DownloadResult[source]

Attempt to download one image, trying each candidate URL in order.

Parameters:
  • index – DataFrame row index (passed through to the result for later join).

  • image_url – Original URL from the figure CSV (used as the lookup key in logs).

  • candidate_urls – Ordered list of URLs to try (built by the caller’s build_candidate_urls function).

  • local_path – Full path where the file should be written on success.

  • local_name – Bare filename (used in the result for the CSV column).

  • headers – HTTP request headers (auth tokens, User-Agent, Accept, etc.).

  • retries – Attempts per candidate URL before moving to the next.

  • timeout – Per-request timeout in seconds.

Return type:

DownloadResult

extension_from_url(url: str) str[source]

Infer a file extension from a URL, defaulting to .jpg.

load_download_log(log_file: str | Path) dict[str, dict][source]

Load a previous download log CSV into a {image_url: result} dict.

Only "success" rows whose local file still exists are returned.

Parameters:

log_file – Path to the CSV log written by previous runs.

Returns:

Keys are image_url strings; values are dicts with keys download_status, downloaded_image_name, downloaded_image_path, attempted_url.

Return type:

dict

next_name_allocator(output_dir: str | Path, df: DataFrame, name_prefix: str) Callable[[str], str][source]

Return a closure that allocates the next unique image filename.

Scans output_dir for existing files matching <name_prefix><N>.* and the downloaded_image_name column of df so that resumed runs never overwrite previous downloads.

Parameters:
  • output_dir – Directory where images will be saved.

  • df – DataFrame that may already have a downloaded_image_name column from a previous run.

  • name_prefix – Prefix string, e.g. "alloy_img" or "img".

Returns:

allocate(ext) → unique filename string (e.g. "alloy_img42.jpg").

Return type:

Callable[[str], str]

run_downloads(df: DataFrame, output_dir: str | Path, output_csv: str | Path, log_file: str | Path, build_candidate_urls: Callable[[str, str], list[str]], request_headers: dict, name_prefix: str = 'img', max_workers: int = 4, retries: int = 3, timeout: int = 90, cool_every: int = 100, cool_seconds: float = 60.0, long_cool_every: int = 1500, long_cool_seconds: float = 300.0, xml_file_col: str = 'xml_file', verbose: bool = True) DataFrame[source]

Download all images referenced in df, with resume support.

Parameters:
  • df – Figure DataFrame with at least an image_url column. Modified in-place with download_status, downloaded_image_name, downloaded_image_path columns.

  • output_dir – Directory where image files are saved.

  • output_csv – CSV written after each batch (safe to resume from).

  • log_file – Per-URL download log CSV.

  • build_candidate_urls(image_url, xml_file) -> [url, ...] Publisher-specific URL builder.

  • request_headers – HTTP headers for every request (auth keys, User-Agent, etc.).

  • name_prefix – Image filename prefix (e.g. "alloy_img" or "img").

  • max_workers – Thread pool size.

  • retries – Attempts per candidate URL inside download_one().

  • timeout – Per-request timeout seconds.

  • cool_seconds (cool_every /) – Pause after this many real downloads.

  • long_cool_seconds (long_cool_every /) – Longer pause after this many real downloads.

  • xml_file_col – Column in df holding the source XML filename (used by build_candidate_urls for relative-URL resolution).

  • verbose – Print progress lines.

Returns:

Updated df (also written to output_csv).

Return type:

pd.DataFrame