Cleaner¶
matmmextract.inference.cleaner¶
Delete intermediate files and directories after the full pipeline has completed successfully, keeping only what matters:
KEPT¶
crops/ cropped panel images
images/ original downloaded images
linked_dataset.csv final output
DELETED¶
inference_results/ per-image detection JSONs + _summary.json
subcaptions/ per-crop captioning JSONs
output/ all intermediate CSVs
alloys_elsevier/ fetched Elsevier XMLs
alloys_springer/ fetched Springer XMLs
*.txtprocessed_dois / failed_dois resume files
- class CleanResult(deleted_dirs: 'list[str]' = <factory>, deleted_files: 'list[str]' = <factory>, kept: 'list[str]' = <factory>, skipped_missing: 'list[str]' = <factory>)[source]¶
- deleted_dirs: list[str]¶
- deleted_files: list[str]¶
- kept: list[str]¶
- skipped_missing: list[str]¶
- clean(base_dir: str | Path = '.', delete_dirs: list[str] | None = None, delete_files: list[str] | None = None, delete_glob_patterns: list[str] | None = None, dry_run: bool = False, verbose: bool = True) CleanResult[source]¶
Delete intermediate pipeline artifacts.
- Parameters:
base_dir – Root directory to resolve paths from (default: current directory).
delete_dirs – List of directory names/paths to delete recursively. Defaults to the standard pipeline intermediates.
delete_files – List of specific file paths to delete.
delete_glob_patterns – Glob patterns relative to base_dir, e.g.
["*.txt", "output/*.csv"].dry_run – Print what would be deleted without actually deleting.
verbose – Print each deleted item.
- Return type:
Examples
>>> from matmmextract.inference.cleaner import clean >>> clean() # delete all standard intermediates >>> clean(dry_run=True) # preview without deleting >>> clean(delete_dirs=["output"]) # delete only output/