Dataset Builder¶
matmmextract.inference.dataset_builder¶
Link cropped panel images with their JSON sub-captions to produce
the final linked_dataset.csv.
JSON lookup key¶
- Each JSON file is named after the full crop stem:
img11_A.json ← written by captioner for crop img11_A.jpg img10_single.json ← written by captioner for crop img10_single.jpg
The linker looks up the JSON by full crop stem, then finds the matching panel inside the JSON by the panel letter/key.
Filename pattern matched¶
imgXXXX_single.jpg → looks for img XXXX_single.json, panel “main”
imgXXXX_single_2.jpg → looks for imgXXXX_single_2.json, panel “main”
imgXXXX_A.jpg → looks for imgXXXX_A.json, panel “a”
imgXXXX_A_2.jpg → looks for imgXXXX_A_2.json, panel “a”
Output columns¶
image_filename, visualization_category, visualization_subtype, subcaption, summary
- class BuildResult(n_images: 'int' = 0, n_matched: 'int' = 0, n_no_json: 'int' = 0, n_no_panel: 'int' = 0, n_skipped_pattern: 'int' = 0, n_has_caption: 'int' = 0, output_csv: 'str' = '', log_path: 'str' = '')[source]¶
- log_path: str = ''¶
- n_has_caption: int = 0¶
- n_images: int = 0¶
- n_matched: int = 0¶
- n_no_json: int = 0¶
- n_no_panel: int = 0¶
- n_skipped_pattern: int = 0¶
- output_csv: str = ''¶
- build(images_dir: str | Path, json_dir: str | Path, output_csv: str | Path = 'linked_dataset.csv', log_path: str | Path = 'build_dataset.log', verbose: bool = True) BuildResult[source]¶
Link cropped images with sub-caption JSON files.
- Parameters:
images_dir – Directory of cropped panel images (output of cropper.run).
json_dir – Directory of per-crop sub-caption JSON files (output of captioner.run or captioner_azure.run). Each JSON is named after the crop stem: img11_A.json.
output_csv – Path for the final linked CSV.
log_path – Path for the human-readable build log.
verbose – Print stats to stdout.