Cleaner

matmmextract.inference.cleaner

Delete intermediate files and directories after the full pipeline has completed successfully, keeping only what matters:

KEPT

  • crops/ cropped panel images

  • images/ original downloaded images

  • linked_dataset.csv final output

DELETED

  • inference_results/ per-image detection JSONs + _summary.json

  • subcaptions/ per-crop captioning JSONs

  • output/ all intermediate CSVs

  • alloys_elsevier/ fetched Elsevier XMLs

  • alloys_springer/ fetched Springer XMLs

  • *.txt processed_dois / failed_dois resume files

class CleanResult(deleted_dirs: 'list[str]' = <factory>, deleted_files: 'list[str]' = <factory>, kept: 'list[str]' = <factory>, skipped_missing: 'list[str]' = <factory>)[source]
deleted_dirs: list[str]
deleted_files: list[str]
kept: list[str]
skipped_missing: list[str]
clean(base_dir: str | Path = '.', delete_dirs: list[str] | None = None, delete_files: list[str] | None = None, delete_glob_patterns: list[str] | None = None, dry_run: bool = False, verbose: bool = True) CleanResult[source]

Delete intermediate pipeline artifacts.

Parameters:
  • base_dir – Root directory to resolve paths from (default: current directory).

  • delete_dirs – List of directory names/paths to delete recursively. Defaults to the standard pipeline intermediates.

  • delete_files – List of specific file paths to delete.

  • delete_glob_patterns – Glob patterns relative to base_dir, e.g. ["*.txt", "output/*.csv"].

  • dry_run – Print what would be deleted without actually deleting.

  • verbose – Print each deleted item.

Return type:

CleanResult

Examples

>>> from matmmextract.inference.cleaner import clean
>>> clean()                          # delete all standard intermediates
>>> clean(dry_run=True)              # preview without deleting
>>> clean(delete_dirs=["output"])    # delete only output/