Downloader¶

matmmextract.shared.downloader¶

Threaded image download engine shared by the Elsevier and Springer download scripts.

The two original scripts (elsevier_download_image.py and download_springer_img.py) had largely parallel structures. This module provides:

next_name_allocator — resume-safe sequential image naming
(ported from the Springer script; replaces the hardcoded img_counter = 13101 in Elsevier)
load_download_log — read a previous log to skip already-done URLs
append_log — append rows to the CSV log incrementally
DownloadResult — typed result from a single download attempt
download_one — download one URL with retries
run_downloads — orchestrate a full threaded batch

Each publisher supplies its own build_candidate_urls callable and request_headers dict; everything else is shared.

class DownloadResult(row_index: 'Any', image_url: 'str', status: 'str', downloaded_image_name: 'str', local_path: 'str', attempted_url: 'str', error: 'str' = '')[source]¶

attempted_url: str¶

downloaded_image_name: str¶

error: str = ''¶

image_url: str¶

local_path: str¶

row_index: Any¶

status: str¶

append_log(log_file: str | Path, rows: list[dict]) → None[source]¶

Append rows to the CSV log, writing a header if the file is new.

Parameters:

log_file – Destination CSV path.
rows – List of dicts; all must share the same keys.

download_one(index: Any, image_url: str, candidate_urls: list[str], local_path: Path, local_name: str, headers: dict, retries: int = 3, timeout: int = 90) → DownloadResult[source]¶

Attempt to download one image, trying each candidate URL in order.

Parameters:

index – DataFrame row index (passed through to the result for later join).
image_url – Original URL from the figure CSV (used as the lookup key in logs).
candidate_urls – Ordered list of URLs to try (built by the caller’s build_candidate_urls function).
local_path – Full path where the file should be written on success.
local_name – Bare filename (used in the result for the CSV column).
headers – HTTP request headers (auth tokens, User-Agent, Accept, etc.).
retries – Attempts per candidate URL before moving to the next.
timeout – Per-request timeout in seconds.

Return type:

DownloadResult

extension_from_url(url: str) → str[source]¶: Infer a file extension from a URL, defaulting to .jpg.

load_download_log(log_file: str | Path) → dict[str, dict][source]¶

Load a previous download log CSV into a {image_url: result} dict.

Only "success" rows whose local file still exists are returned.

Parameters:: log_file – Path to the CSV log written by previous runs.
Returns:: Keys are image_url strings; values are dicts with keys download_status, downloaded_image_name, downloaded_image_path, attempted_url.
Return type:: dict

next_name_allocator(output_dir: str | Path, df: DataFrame, name_prefix: str) → Callable[[str], str][source]¶

Return a closure that allocates the next unique image filename.

Scans output_dir for existing files matching <name_prefix><N>.* and the downloaded_image_name column of df so that resumed runs never overwrite previous downloads.

Parameters:

output_dir – Directory where images will be saved.
df – DataFrame that may already have a downloaded_image_name column from a previous run.
name_prefix – Prefix string, e.g. "alloy_img" or "img".

Returns:

allocate(ext) → unique filename string (e.g. "alloy_img42.jpg").

Return type:

Callable[[str], str]

run_downloads(df: DataFrame, output_dir: str | Path, output_csv: str | Path, log_file: str | Path, build_candidate_urls: Callable[[str, str], list[str]], request_headers: dict, name_prefix: str = 'img', max_workers: int = 4, retries: int = 3, timeout: int = 90, cool_every: int = 100, cool_seconds: float = 60.0, long_cool_every: int = 1500, long_cool_seconds: float = 300.0, xml_file_col: str = 'xml_file', verbose: bool = True) → DataFrame[source]¶

Download all images referenced in df, with resume support.

Parameters:

df – Figure DataFrame with at least an image_url column. Modified in-place with download_status, downloaded_image_name, downloaded_image_path columns.
output_dir – Directory where image files are saved.
output_csv – CSV written after each batch (safe to resume from).
log_file – Per-URL download log CSV.
build_candidate_urls – (image_url, xml_file) -> [url, ...] Publisher-specific URL builder.
request_headers – HTTP headers for every request (auth keys, User-Agent, etc.).
name_prefix – Image filename prefix (e.g. "alloy_img" or "img").
max_workers – Thread pool size.
retries – Attempts per candidate URL inside download_one().
timeout – Per-request timeout seconds.
cool_seconds (cool_every /) – Pause after this many real downloads.
long_cool_seconds (long_cool_every /) – Longer pause after this many real downloads.
xml_file_col – Column in df holding the source XML filename (used by build_candidate_urls for relative-URL resolution).
verbose – Print progress lines.

Returns:

Updated df (also written to output_csv).

Return type:

pd.DataFrame