Pipeline¶
matmmextract.preprocess.pipeline¶
Load, deduplicate, and filter Scopus export CSVs down to the set of papers that will be fetched from publisher APIs.
This is the notebook’s first half — everything before the XML download steps — expressed as a callable API.
Notebook cells covered¶
Load multiple CSVs and concatenate
Inspect publisher counts (bar chart)
Identify and drop duplicate DOIs
Filter by publisher regex (default: Elsevier)
Filter by Open Access type
Save filtered CSVs
Move qualifying XMLs into a working directory (green open access)
Copy CC BY XMLs into a separate directory (post-license-check)
Flatten downloaded image sub-folders into a single flat folder
Prune images that have no matching .txt caption file
Cells NOT included (publisher-specific, live in elsevier/ subpackage): - The figure extractor cell (→ elsevier/extractor.py) - The CC license diagnostics cell (→ preprocess/cc_license.py)
- copy_xmls_by_filename(filenames: Iterable[str], source_dir: str | Path, target_dir: str | Path, verbose: bool = True) tuple[int, list[str]][source]¶
Copy specific XML files (by filename) to target_dir.
- Mirrors:
for fname in cc_by_files: shutil.copy2(source_dir/fname, cc_by_xml_dir/fname)
- Returns:
copied (int)
missing (list[str] — filenames not found in source_dir)
- drop_duplicate_dois(df: DataFrame, doi_col: str = 'DOI', keep: str = 'first', verbose: bool = True) DataFrame[source]¶
Remove duplicated DOIs, keeping one row per DOI.
- Mirrors:
merged_df.drop_duplicates(subset=["DOI"], keep="first").reset_index(drop=True)
- filter_by_publisher(df: DataFrame, pattern: str = '\\bElsevier\\b', publisher_col: str = 'Publisher', case: bool = False, verbose: bool = True) DataFrame[source]¶
Keep rows whose publisher matches pattern.
- Mirrors:
merged_df[merged_df["Publisher"].fillna("").str.contains(elsevier_pattern, ...)]
- filter_open_access(df: DataFrame, allowed: frozenset[str] | set[str] | list[str] = frozenset({'All Open Access', 'All Open Access; Bronze Open Access', 'All Open Access; Bronze Open Access; Green Open Access', 'All Open Access; Green Open Access'}), oa_col: str = 'Open Access', verbose: bool = True) DataFrame[source]¶
Keep rows whose Open Access type is in allowed.
- Mirrors:
green_df = df[df["Open Access"].isin(allowed_open_access)]
- find_duplicate_dois(df: DataFrame, doi_col: str = 'DOI') DataFrame[source]¶
Return all rows whose DOI appears more than once.
- Mirrors:
merged_df[merged_df["DOI"].notna() & merged_df["DOI"].duplicated(keep=False)]
- flatten_images(source_root: str | Path, target_dir: str | Path, limit: int | None = None, image_exts: frozenset[str] = frozenset({'.bmp', '.gif', '.jpeg', '.jpg', '.png', '.tif', '.tiff', '.webp'}), verbose: bool = True) dict[str, int][source]¶
Copy images from nested per-paper folders into one flat directory.
Mirrors the notebook cell that scanned
alloy_elsevier_contentssub-folders and copied everything intoalloy_images_flat.- Parameters:
source_root – Root whose immediate children are per-paper folders.
target_dir – Flat output directory.
limit – If set, process only the first limit sub-folders (sorted).
- Return type:
dict with keys
"copied"and"skipped".
- load_csvs(filenames: str | Path | Iterable[str | Path], **read_csv_kwargs) DataFrame[source]¶
Load one or more Scopus export CSVs and concatenate them.
- Parameters:
filenames – A single path or any iterable of paths.
**read_csv_kwargs – Passed to
pandas.read_csv().
- Returns:
Concatenated data, index reset to 0 … N-1.
- Return type:
pd.DataFrame
Examples
>>> df = load_csvs(["Al_alloy_10k.csv", "Ni_alloy_10k.csv"]) >>> df = load_csvs(Path("data").glob("*.csv"))
- move_xmls_by_doi(dois: Series | list[str], source_dir: str | Path, target_dir: str | Path, verbose: bool = True) tuple[list[str], list[str]][source]¶
Move XML files from source_dir to target_dir for the given DOIs.
- Mirrors the notebook cell:
for doi in dois: shutil.move(source_dir / doi.replace("/","_")+".xml", ...)
- Parameters:
dois – Iterable of DOI strings (from
green_df["DOI"]).source_dir – Directory currently holding all XML files.
target_dir – Destination directory (created if it doesn’t exist).
- Returns:
moved (list[str] — filenames successfully moved)
missing (list[str] — DOIs whose XML was not found)
- prune_images_without_captions(image_dir: str | Path, caption_ext: str = '.txt', image_exts: frozenset[str] = frozenset({'.bmp', '.gif', '.jpeg', '.jpg', '.png', '.tif', '.tiff', '.webp'}), dry_run: bool = False, verbose: bool = True) dict[str, int | list[str]][source]¶
Delete image files that have no matching caption file.
- Mirrors:
for img in images: if not img.with_suffix(".txt").exists(): img.unlink()
- Parameters:
dry_run – Report what would be deleted without actually deleting.
- publisher_counts(df: DataFrame, top_n: int = 10) Series[source]¶
Return a value-counts Series for the top-N publishers.
Mirrors the
merged_df["Publisher"].value_counts().head(10)cell.
- run(csv_files: str | Path | Iterable[str | Path], publisher_pattern: str = '\\bElsevier\\b', open_access_allowed: frozenset[str] = frozenset({'All Open Access', 'All Open Access; Bronze Open Access', 'All Open Access; Bronze Open Access; Green Open Access', 'All Open Access; Green Open Access'}), output_dir: str | Path = PosixPath('output'), verbose: bool = True) dict[str, DataFrame][source]¶
Run the full CSV preprocessing pipeline.
Load → dedup → filter publisher → filter open-access → save.
- Parameters:
csv_files – Input CSV path(s).
publisher_pattern – Regex for publisher filter.
open_access_allowed – Allowed Open Access tag values.
output_dir – Directory for output CSVs.
verbose – Print progress.
- Returns:
"raw"— concatenated raw data"deduped"— after DOI dedup"publisher"— after publisher filter (saved as publisher_filtered.csv)"open_access"— after OA filter (saved as open_access_filtered.csv)- Return type:
dict with keys
Examples
>>> result = run(["Al_alloy_10k.csv", "Ni_alloy_10k.csv"]) >>> green_df = result["open_access"]