Downloader¶
matmmextract.shared.downloader¶
Threaded image download engine shared by the Elsevier and Springer download scripts.
The two original scripts (elsevier_download_image.py and
download_springer_img.py) had largely parallel structures.
This module provides:
next_name_allocator— resume-safe sequential image naming(ported from the Springer script; replaces the hardcoded
img_counter = 13101in Elsevier)
load_download_log— read a previous log to skip already-done URLsappend_log— append rows to the CSV log incrementallyDownloadResult— typed result from a single download attemptdownload_one— download one URL with retriesrun_downloads— orchestrate a full threaded batch
Each publisher supplies its own build_candidate_urls callable and
request_headers dict; everything else is shared.
- class DownloadResult(row_index: 'Any', image_url: 'str', status: 'str', downloaded_image_name: 'str', local_path: 'str', attempted_url: 'str', error: 'str' = '')[source]¶
- attempted_url: str¶
- downloaded_image_name: str¶
- error: str = ''¶
- image_url: str¶
- local_path: str¶
- row_index: Any¶
- status: str¶
- append_log(log_file: str | Path, rows: list[dict]) → None[source]¶
Append rows to the CSV log, writing a header if the file is new.
- Parameters:
log_file – Destination CSV path.
rows – List of dicts; all must share the same keys.
- download_one(index: Any, image_url: str, candidate_urls: list[str], local_path: Path, local_name: str, headers: dict, retries: int = 3, timeout: int = 90) → DownloadResult[source]¶
Attempt to download one image, trying each candidate URL in order.
- Parameters:
index – DataFrame row index (passed through to the result for later join).
image_url – Original URL from the figure CSV (used as the lookup key in logs).
candidate_urls – Ordered list of URLs to try (built by the caller’s
build_candidate_urlsfunction).local_path – Full path where the file should be written on success.
local_name – Bare filename (used in the result for the CSV column).
headers – HTTP request headers (auth tokens, User-Agent, Accept, etc.).
retries – Attempts per candidate URL before moving to the next.
timeout – Per-request timeout in seconds.
- Return type:
- load_download_log(log_file: str | Path) → dict[str, dict][source]¶
Load a previous download log CSV into a
{image_url: result}dict.Only
"success"rows whose local file still exists are returned.- Parameters:
log_file – Path to the CSV log written by previous runs.
- Returns:
Keys are
image_urlstrings; values are dicts with keysdownload_status,downloaded_image_name,downloaded_image_path,attempted_url.- Return type:
dict
- next_name_allocator(output_dir: str | Path, df: DataFrame, name_prefix: str) → Callable[[str], str][source]¶
Return a closure that allocates the next unique image filename.
Scans output_dir for existing files matching
<name_prefix><N>.*and thedownloaded_image_namecolumn of df so that resumed runs never overwrite previous downloads.- Parameters:
output_dir – Directory where images will be saved.
df – DataFrame that may already have a
downloaded_image_namecolumn from a previous run.name_prefix – Prefix string, e.g.
"alloy_img"or"img".
- Returns:
allocate(ext)→ unique filename string (e.g."alloy_img42.jpg").- Return type:
Callable[[str], str]
- run_downloads(df: DataFrame, output_dir: str | Path, output_csv: str | Path, log_file: str | Path, build_candidate_urls: Callable[[str, str], list[str]], request_headers: dict, name_prefix: str = 'img', max_workers: int = 4, retries: int = 3, timeout: int = 90, cool_every: int = 100, cool_seconds: float = 60.0, long_cool_every: int = 1500, long_cool_seconds: float = 300.0, xml_file_col: str = 'xml_file', verbose: bool = True) → DataFrame[source]¶
Download all images referenced in df, with resume support.
- Parameters:
df – Figure DataFrame with at least an
image_urlcolumn. Modified in-place withdownload_status,downloaded_image_name,downloaded_image_pathcolumns.output_dir – Directory where image files are saved.
output_csv – CSV written after each batch (safe to resume from).
log_file – Per-URL download log CSV.
build_candidate_urls –
(image_url, xml_file) -> [url, ...]Publisher-specific URL builder.request_headers – HTTP headers for every request (auth keys, User-Agent, etc.).
name_prefix – Image filename prefix (e.g.
"alloy_img"or"img").max_workers – Thread pool size.
retries – Attempts per candidate URL inside
download_one().timeout – Per-request timeout seconds.
cool_seconds (cool_every /) – Pause after this many real downloads.
long_cool_seconds (long_cool_every /) – Longer pause after this many real downloads.
xml_file_col – Column in df holding the source XML filename (used by
build_candidate_urlsfor relative-URL resolution).verbose – Print progress lines.
- Returns:
Updated df (also written to output_csv).
- Return type:
pd.DataFrame