Pipeline

matmmextract.preprocess.pipeline

Load, deduplicate, and filter Scopus export CSVs down to the set of papers that will be fetched from publisher APIs.

This is the notebook’s first half — everything before the XML download steps — expressed as a callable API.

Notebook cells covered

  1. Load multiple CSVs and concatenate

  2. Inspect publisher counts (bar chart)

  3. Identify and drop duplicate DOIs

  4. Filter by publisher regex (default: Elsevier)

  5. Filter by Open Access type

  6. Save filtered CSVs

  7. Move qualifying XMLs into a working directory (green open access)

  8. Copy CC BY XMLs into a separate directory (post-license-check)

  9. Flatten downloaded image sub-folders into a single flat folder

  10. Prune images that have no matching .txt caption file

Cells NOT included (publisher-specific, live in elsevier/ subpackage): - The figure extractor cell (→ elsevier/extractor.py) - The CC license diagnostics cell (→ preprocess/cc_license.py)

copy_xmls_by_filename(filenames: Iterable[str], source_dir: str | Path, target_dir: str | Path, verbose: bool = True) tuple[int, list[str]][source]

Copy specific XML files (by filename) to target_dir.

Mirrors:

for fname in cc_by_files: shutil.copy2(source_dir/fname, cc_by_xml_dir/fname)

Returns:

  • copied (int)

  • missing (list[str] — filenames not found in source_dir)

drop_duplicate_dois(df: DataFrame, doi_col: str = 'DOI', keep: str = 'first', verbose: bool = True) DataFrame[source]

Remove duplicated DOIs, keeping one row per DOI.

Mirrors:

merged_df.drop_duplicates(subset=["DOI"], keep="first").reset_index(drop=True)

filter_by_publisher(df: DataFrame, pattern: str = '\\bElsevier\\b', publisher_col: str = 'Publisher', case: bool = False, verbose: bool = True) DataFrame[source]

Keep rows whose publisher matches pattern.

Mirrors:

merged_df[merged_df["Publisher"].fillna("").str.contains(elsevier_pattern, ...)]

filter_open_access(df: DataFrame, allowed: frozenset[str] | set[str] | list[str] = frozenset({'All Open Access', 'All Open Access; Bronze Open Access', 'All Open Access; Bronze Open Access; Green Open Access', 'All Open Access; Green Open Access'}), oa_col: str = 'Open Access', verbose: bool = True) DataFrame[source]

Keep rows whose Open Access type is in allowed.

Mirrors:

green_df = df[df["Open Access"].isin(allowed_open_access)]

find_duplicate_dois(df: DataFrame, doi_col: str = 'DOI') DataFrame[source]

Return all rows whose DOI appears more than once.

Mirrors:

merged_df[merged_df["DOI"].notna() & merged_df["DOI"].duplicated(keep=False)]

flatten_images(source_root: str | Path, target_dir: str | Path, limit: int | None = None, image_exts: frozenset[str] = frozenset({'.bmp', '.gif', '.jpeg', '.jpg', '.png', '.tif', '.tiff', '.webp'}), verbose: bool = True) dict[str, int][source]

Copy images from nested per-paper folders into one flat directory.

Mirrors the notebook cell that scanned alloy_elsevier_contents sub-folders and copied everything into alloy_images_flat.

Parameters:
  • source_root – Root whose immediate children are per-paper folders.

  • target_dir – Flat output directory.

  • limit – If set, process only the first limit sub-folders (sorted).

Return type:

dict with keys "copied" and "skipped".

load_csvs(filenames: str | Path | Iterable[str | Path], **read_csv_kwargs) DataFrame[source]

Load one or more Scopus export CSVs and concatenate them.

Parameters:
  • filenames – A single path or any iterable of paths.

  • **read_csv_kwargs – Passed to pandas.read_csv().

Returns:

Concatenated data, index reset to 0 … N-1.

Return type:

pd.DataFrame

Examples

>>> df = load_csvs(["Al_alloy_10k.csv", "Ni_alloy_10k.csv"])
>>> df = load_csvs(Path("data").glob("*.csv"))
move_xmls_by_doi(dois: Series | list[str], source_dir: str | Path, target_dir: str | Path, verbose: bool = True) tuple[list[str], list[str]][source]

Move XML files from source_dir to target_dir for the given DOIs.

Mirrors the notebook cell:

for doi in dois: shutil.move(source_dir / doi.replace("/","_")+".xml", ...)

Parameters:
  • dois – Iterable of DOI strings (from green_df["DOI"]).

  • source_dir – Directory currently holding all XML files.

  • target_dir – Destination directory (created if it doesn’t exist).

Returns:

  • moved (list[str] — filenames successfully moved)

  • missing (list[str] — DOIs whose XML was not found)

prune_images_without_captions(image_dir: str | Path, caption_ext: str = '.txt', image_exts: frozenset[str] = frozenset({'.bmp', '.gif', '.jpeg', '.jpg', '.png', '.tif', '.tiff', '.webp'}), dry_run: bool = False, verbose: bool = True) dict[str, int | list[str]][source]

Delete image files that have no matching caption file.

Mirrors:

for img in images: if not img.with_suffix(".txt").exists(): img.unlink()

Parameters:

dry_run – Report what would be deleted without actually deleting.

publisher_counts(df: DataFrame, top_n: int = 10) Series[source]

Return a value-counts Series for the top-N publishers.

Mirrors the merged_df["Publisher"].value_counts().head(10) cell.

run(csv_files: str | Path | Iterable[str | Path], publisher_pattern: str = '\\bElsevier\\b', open_access_allowed: frozenset[str] = frozenset({'All Open Access', 'All Open Access; Bronze Open Access', 'All Open Access; Bronze Open Access; Green Open Access', 'All Open Access; Green Open Access'}), output_dir: str | Path = PosixPath('output'), verbose: bool = True) dict[str, DataFrame][source]

Run the full CSV preprocessing pipeline.

Load → dedup → filter publisher → filter open-access → save.

Parameters:
  • csv_files – Input CSV path(s).

  • publisher_pattern – Regex for publisher filter.

  • open_access_allowed – Allowed Open Access tag values.

  • output_dir – Directory for output CSVs.

  • verbose – Print progress.

Returns:

"raw" — concatenated raw data "deduped" — after DOI dedup "publisher" — after publisher filter (saved as publisher_filtered.csv) "open_access" — after OA filter (saved as open_access_filtered.csv)

Return type:

dict with keys

Examples

>>> result = run(["Al_alloy_10k.csv", "Ni_alloy_10k.csv"])
>>> green_df = result["open_access"]
save_csv(df: DataFrame, path: str | Path, verbose: bool = True) Path[source]

Save df to CSV (no index). Creates parent directories if needed.