Dataset Builder

matmmextract.inference.dataset_builder

Link cropped panel images with their JSON sub-captions to produce the final linked_dataset.csv.

JSON lookup key

Each JSON file is named after the full crop stem:

img11_A.json ← written by captioner for crop img11_A.jpg img10_single.json ← written by captioner for crop img10_single.jpg

The linker looks up the JSON by full crop stem, then finds the matching panel inside the JSON by the panel letter/key.

Filename pattern matched

imgXXXX_single.jpg → looks for img XXXX_single.json, panel “main” imgXXXX_single_2.jpg → looks for imgXXXX_single_2.json, panel “main” imgXXXX_A.jpg → looks for imgXXXX_A.json, panel “a” imgXXXX_A_2.jpg → looks for imgXXXX_A_2.json, panel “a”

Output columns

image_filename, visualization_category, visualization_subtype, subcaption, summary

class BuildResult(n_images: 'int' = 0, n_matched: 'int' = 0, n_no_json: 'int' = 0, n_no_panel: 'int' = 0, n_skipped_pattern: 'int' = 0, n_has_caption: 'int' = 0, output_csv: 'str' = '', log_path: 'str' = '')[source]
log_path: str = ''
n_has_caption: int = 0
n_images: int = 0
n_matched: int = 0
n_no_json: int = 0
n_no_panel: int = 0
n_skipped_pattern: int = 0
output_csv: str = ''
build(images_dir: str | Path, json_dir: str | Path, output_csv: str | Path = 'linked_dataset.csv', log_path: str | Path = 'build_dataset.log', verbose: bool = True) BuildResult[source]

Link cropped images with sub-caption JSON files.

Parameters:
  • images_dir – Directory of cropped panel images (output of cropper.run).

  • json_dir – Directory of per-crop sub-caption JSON files (output of captioner.run or captioner_azure.run). Each JSON is named after the crop stem: img11_A.json.

  • output_csv – Path for the final linked CSV.

  • log_path – Path for the human-readable build log.

  • verbose – Print stats to stdout.