Crop CSV Builder

matmmextract.inference.crop_csv_builder

Bridge between the cropper and the captioner.

The figures CSV has xml_file + figure_id but no downloaded_image_name. The downloaded images are named img1.jpg, img2.jpg, … sequentially.

This module joins them by ROW ORDER — the N-th row in the figures CSV corresponds to imgN.jpg — which is how the Elsevier downloader assigned names.

If your downloader CSV DOES have a downloaded_image_name column, pass that CSV as figures_csv and set use_row_order=False.

build_crop_csv(crops_dir: str | Path, figures_csv: str | Path, output_csv: str | Path = 'crops_for_captioning.csv', use_row_order: bool = True, image_name_col: str = 'downloaded_image_name', caption_col: str = 'caption', reference_col: str = 'reference_sentences', verbose: bool = True) DataFrame[source]

Build a CSV mapping each crop to its figure caption and references.

Parameters:
  • crops_dir – Directory of cropped panel images (output of cropper.run). Files named like img1_A.jpg, img1_single.jpg, img2_B.jpg …

  • figures_csv – Figure-level CSV from elsevier/springer extractor. Must have caption and reference_sentences columns.

  • output_csv – Where to write the resulting CSV.

  • use_row_order – If True (default): join by row position — img1 = row 0, img2 = row 1, etc. Use this when the figures CSV has no downloaded_image_name column (standard extractor output). If False: join by the downloaded_image_name column value.

  • image_name_col – Only used when use_row_order=False. Column holding image stems.

  • reference_col (caption_col /) – Column names for caption and references in figures_csv.

  • verbose – Print summary.

Returns:

One row per cropped image with columns: downloaded_image_name, caption, reference_sentences

Return type:

pd.DataFrame