CC License¶
matmmextract.preprocess.cc_license¶
Scan Elsevier XML files to identify their CC license type, then filter figure rows down to CC BY only (with at least one reference sentence).
This is the notebook’s CC diagnostics cell — expressed as a callable API.
Notebook cells covered¶
analyse_file()— detect CC license from a single XMLThe main scan loop →
scan_directory()CC BY filter on figures →
filter_figures_cc_by()Copy CC BY XMLs → lives in preprocess.pipeline.copy_xmls_by_filename
Detection strategies (in order): 1. <oa:userLicense> / <oa:openAccessInformation> (most reliable) 2. <prism:copyright> / <dc:rights> 3. <ce:copyright> and any tag whose name contains “copyright” 4. <license> / <ali:license_ref> (JATS style) 5. Free-text scan for creativecommons.org URLs 6. © Elsevier without any CC marker → Subscription/Copyright
- REDISTRIB_RIGHTS: dict[str, str] = {'CC (version unknown)': '⚠️ CC present but type unclear — check manually', 'CC BY': '✅ Full reuse incl. commercial — safe for dataset', 'CC BY-NC': '⚠️ Non-commercial only — dataset must be NC too', 'CC BY-NC-ND': '❌ No derivatives + non-commercial', 'CC BY-NC-SA': '⚠️ Non-commercial + share-alike', 'CC BY-ND': '❌ No derivatives — cannot redistribute image dataset', 'CC BY-SA': '✅ Reuse allowed (share-alike) — safe for dataset', 'Elsevier Open Archive': '❌ Read-only — not suitable for redistribution', 'Elsevier User License': '❌ Restricted reuse — not suitable for redistribution', 'Subscription/Copyright': '❌ No open license — cannot reuse', 'UNKNOWN': '❓ Could not detect license — check manually'}¶
Human-readable redistribution rights for each license label.
- analyse_file(xml_path: str | Path) dict[source]¶
Detect the CC license of a single Elsevier XML file.
- Parameters:
xml_path – Path to the XML file.
- Returns:
file,license,license_raw,detection_path,doi,title,notes- Return type:
dict with keys
- filter_figures_cc_by(figures_df: DataFrame, cc_summary_df: DataFrame, require_references: bool = True, output_csv: str | Path | None = None, verbose: bool = True) DataFrame[source]¶
Keep only figures from CC BY papers that have reference sentences.
- Mirrors the notebook cells:
cc_by_df = cc_df[cc_df["license"] == "CC BY"]filtered = figures_df[figures_df["xml_file"].isin(cc_by_files) & has_ref]
- Parameters:
figures_df – Output of
extract_all().cc_summary_df – Output of
scan_directory()(thedfreturn value).require_references – If
True(default), also requirenum_references > 0OR a non-emptyreference_sentencesvalue.output_csv – If provided, save the filtered DataFrame here.
verbose – Print counts.
- Returns:
Filtered figure rows.
- Return type:
pd.DataFrame
- scan_directory(xml_dir: str | Path, output_txt: str | Path | None = None, output_csv: str | Path | None = None, verbose: bool = True) tuple[DataFrame, dict[str, int]][source]¶
Scan every XML file in xml_dir and return license information.
- Parameters:
xml_dir – Directory containing Elsevier XML files.
output_txt – Optional path for the full human-readable report.
output_csv – Optional path for the machine-readable per-file CSV.
verbose – Print per-file progress.
- Returns:
df (pd.DataFrame) – One row per file with columns:
file,license,license_raw,detection_path,doi,title,reuse_rightstally (dict[str, int]) –
{license_label: count}