Dataset Builder¶

matmmextract.inference.dataset_builder¶

Link cropped panel images with their JSON sub-captions to produce the final linked_dataset.csv.

JSON lookup key¶

Each JSON file is named after the full crop stem:: img11_A.json ← written by captioner for crop img11_A.jpg img10_single.json ← written by captioner for crop img10_single.jpg

The linker looks up the JSON by full crop stem, then finds the matching panel inside the JSON by the panel letter/key.

Filename pattern matched¶

imgXXXX_single.jpg → looks for img XXXX_single.json, panel “main” imgXXXX_single_2.jpg → looks for imgXXXX_single_2.json, panel “main” imgXXXX_A.jpg → looks for imgXXXX_A.json, panel “a” imgXXXX_A_2.jpg → looks for imgXXXX_A_2.json, panel “a”

Output columns¶

image_filename, visualization_category, visualization_subtype, subcaption, summary

class BuildResult(n_images: 'int' = 0, n_matched: 'int' = 0, n_no_json: 'int' = 0, n_no_panel: 'int' = 0, n_skipped_pattern: 'int' = 0, n_has_caption: 'int' = 0, output_csv: 'str' = '', log_path: 'str' = '')[source]¶

log_path: str = ''¶

n_has_caption: int = 0¶

n_images: int = 0¶

n_matched: int = 0¶

n_no_json: int = 0¶

n_no_panel: int = 0¶

n_skipped_pattern: int = 0¶

output_csv: str = ''¶

build(images_dir: str | Path, json_dir: str | Path, output_csv: str | Path = 'linked_dataset.csv', log_path: str | Path = 'build_dataset.log', verbose: bool = True) → BuildResult[source]¶

Link cropped images with sub-caption JSON files.

Parameters:

images_dir – Directory of cropped panel images (output of cropper.run).
json_dir – Directory of per-crop sub-caption JSON files (output of captioner.run or captioner_azure.run). Each JSON is named after the crop stem: img11_A.json.
output_csv – Path for the final linked CSV.
log_path – Path for the human-readable build log.
verbose – Print stats to stdout.