Pipeline¶

matmmextract.preprocess.pipeline¶

Load, deduplicate, and filter Scopus export CSVs down to the set of papers that will be fetched from publisher APIs.

This is the notebook’s first half — everything before the XML download steps — expressed as a callable API.

Notebook cells covered¶

Load multiple CSVs and concatenate
Inspect publisher counts (bar chart)
Identify and drop duplicate DOIs
Filter by publisher regex (default: Elsevier)
Filter by Open Access type
Save filtered CSVs
Move qualifying XMLs into a working directory (green open access)
Copy CC BY XMLs into a separate directory (post-license-check)
Flatten downloaded image sub-folders into a single flat folder
Prune images that have no matching .txt caption file

Cells NOT included (publisher-specific, live in elsevier/ subpackage): - The figure extractor cell (→ elsevier/extractor.py) - The CC license diagnostics cell (→ preprocess/cc_license.py)

copy_xmls_by_filename(filenames: Iterable[str], source_dir: str | Path, target_dir: str | Path, verbose: bool = True) → tuple[int, list[str]][source]¶

Copy specific XML files (by filename) to target_dir.

Mirrors:: for fname in cc_by_files: shutil.copy2(source_dir/fname, cc_by_xml_dir/fname)

Returns:

copied (int)
missing (list[str] — filenames not found in source_dir)

drop_duplicate_dois(df: DataFrame, doi_col: str = 'DOI', keep: str = 'first', verbose: bool = True) → DataFrame[source]¶

Remove duplicated DOIs, keeping one row per DOI.

Mirrors:: merged_df.drop_duplicates(subset=["DOI"], keep="first").reset_index(drop=True)

filter_by_publisher(df: DataFrame, pattern: str = '\\bElsevier\\b', publisher_col: str = 'Publisher', case: bool = False, verbose: bool = True) → DataFrame[source]¶

Keep rows whose publisher matches pattern.

Mirrors:: merged_df[merged_df["Publisher"].fillna("").str.contains(elsevier_pattern, ...)]

filter_open_access(df: DataFrame, allowed: frozenset[str] | set[str] | list[str] = frozenset({'All Open Access', 'All Open Access; Bronze Open Access', 'All Open Access; Bronze Open Access; Green Open Access', 'All Open Access; Green Open Access'}), oa_col: str = 'Open Access', verbose: bool = True) → DataFrame[source]¶

Keep rows whose Open Access type is in allowed.

Mirrors:: green_df = df[df["Open Access"].isin(allowed_open_access)]

find_duplicate_dois(df: DataFrame, doi_col: str = 'DOI') → DataFrame[source]¶

Return all rows whose DOI appears more than once.

Mirrors:: merged_df[merged_df["DOI"].notna() & merged_df["DOI"].duplicated(keep=False)]

flatten_images(source_root: str | Path, target_dir: str | Path, limit: int | None = None, image_exts: frozenset[str] = frozenset({'.bmp', '.gif', '.jpeg', '.jpg', '.png', '.tif', '.tiff', '.webp'}), verbose: bool = True) → dict[str, int][source]¶

Copy images from nested per-paper folders into one flat directory.

Mirrors the notebook cell that scanned alloy_elsevier_contents sub-folders and copied everything into alloy_images_flat.

Parameters:

source_root – Root whose immediate children are per-paper folders.
target_dir – Flat output directory.
limit – If set, process only the first limit sub-folders (sorted).

Return type:

dict with keys "copied" and "skipped".

load_csvs(filenames: str | Path | Iterable[str | Path], **read_csv_kwargs) → DataFrame[source]¶

Load one or more Scopus export CSVs and concatenate them.

Parameters:

filenames – A single path or any iterable of paths.
**read_csv_kwargs – Passed to pandas.read_csv().

Returns:

Concatenated data, index reset to 0 … N-1.

Return type:

pd.DataFrame

Examples

>>> df = load_csvs(["Al_alloy_10k.csv", "Ni_alloy_10k.csv"])
>>> df = load_csvs(Path("data").glob("*.csv"))

move_xmls_by_doi(dois: Series | list[str], source_dir: str | Path, target_dir: str | Path, verbose: bool = True) → tuple[list[str], list[str]][source]¶

Move XML files from source_dir to target_dir for the given DOIs.

Mirrors the notebook cell:: for doi in dois: shutil.move(source_dir / doi.replace("/","_")+".xml", ...)

Parameters:

dois – Iterable of DOI strings (from green_df["DOI"]).
source_dir – Directory currently holding all XML files.
target_dir – Destination directory (created if it doesn’t exist).

Returns:

moved (list[str] — filenames successfully moved)
missing (list[str] — DOIs whose XML was not found)

prune_images_without_captions(image_dir: str | Path, caption_ext: str = '.txt', image_exts: frozenset[str] = frozenset({'.bmp', '.gif', '.jpeg', '.jpg', '.png', '.tif', '.tiff', '.webp'}), dry_run: bool = False, verbose: bool = True) → dict[str, int | list[str]][source]¶

Delete image files that have no matching caption file.

Mirrors:: for img in images: if not img.with_suffix(".txt").exists(): img.unlink()

Parameters:: dry_run – Report what would be deleted without actually deleting.

publisher_counts(df: DataFrame, top_n: int = 10) → Series[source]¶

Return a value-counts Series for the top-N publishers.

Mirrors the merged_df["Publisher"].value_counts().head(10) cell.

run(csv_files: str | Path | Iterable[str | Path], publisher_pattern: str = '\\bElsevier\\b', open_access_allowed: frozenset[str] = frozenset({'All Open Access', 'All Open Access; Bronze Open Access', 'All Open Access; Bronze Open Access; Green Open Access', 'All Open Access; Green Open Access'}), output_dir: str | Path = PosixPath('output'), verbose: bool = True) → dict[str, DataFrame][source]¶

Run the full CSV preprocessing pipeline.

Load → dedup → filter publisher → filter open-access → save.

Parameters:

csv_files – Input CSV path(s).
publisher_pattern – Regex for publisher filter.
open_access_allowed – Allowed Open Access tag values.
output_dir – Directory for output CSVs.
verbose – Print progress.

Returns:

"raw" — concatenated raw data "deduped" — after DOI dedup "publisher" — after publisher filter (saved as publisher_filtered.csv) "open_access" — after OA filter (saved as open_access_filtered.csv)

Return type:

dict with keys

Examples

>>> result = run(["Al_alloy_10k.csv", "Ni_alloy_10k.csv"])
>>> green_df = result["open_access"]

save_csv(df: DataFrame, path: str | Path, verbose: bool = True) → Path[source]¶: Save df to CSV (no index). Creates parent directories if needed.