CC License

matmmextract.preprocess.cc_license

Scan Elsevier XML files to identify their CC license type, then filter figure rows down to CC BY only (with at least one reference sentence).

This is the notebook’s CC diagnostics cell — expressed as a callable API.

Notebook cells covered

  • analyse_file() — detect CC license from a single XML

  • The main scan loop → scan_directory()

  • CC BY filter on figures → filter_figures_cc_by()

  • Copy CC BY XMLs → lives in preprocess.pipeline.copy_xmls_by_filename

Detection strategies (in order): 1. <oa:userLicense> / <oa:openAccessInformation> (most reliable) 2. <prism:copyright> / <dc:rights> 3. <ce:copyright> and any tag whose name contains “copyright” 4. <license> / <ali:license_ref> (JATS style) 5. Free-text scan for creativecommons.org URLs 6. © Elsevier without any CC marker → Subscription/Copyright

REDISTRIB_RIGHTS: dict[str, str] = {'CC (version unknown)': '⚠️  CC present but type unclear check manually', 'CC BY': '✅  Full reuse incl. commercial safe for dataset', 'CC BY-NC': '⚠️  Non-commercial only dataset must be NC too', 'CC BY-NC-ND': '❌  No derivatives + non-commercial', 'CC BY-NC-SA': '⚠️  Non-commercial + share-alike', 'CC BY-ND': '❌  No derivatives cannot redistribute image dataset', 'CC BY-SA': '✅  Reuse allowed (share-alike) safe for dataset', 'Elsevier Open Archive': '❌  Read-only not suitable for redistribution', 'Elsevier User License': '❌  Restricted reuse not suitable for redistribution', 'Subscription/Copyright': '❌  No open license cannot reuse', 'UNKNOWN': '❓  Could not detect license check manually'}

Human-readable redistribution rights for each license label.

analyse_file(xml_path: str | Path) dict[source]

Detect the CC license of a single Elsevier XML file.

Parameters:

xml_path – Path to the XML file.

Returns:

file, license, license_raw, detection_path, doi, title, notes

Return type:

dict with keys

filter_figures_cc_by(figures_df: DataFrame, cc_summary_df: DataFrame, require_references: bool = True, output_csv: str | Path | None = None, verbose: bool = True) DataFrame[source]

Keep only figures from CC BY papers that have reference sentences.

Mirrors the notebook cells:

cc_by_df = cc_df[cc_df["license"] == "CC BY"] filtered = figures_df[figures_df["xml_file"].isin(cc_by_files) & has_ref]

Parameters:
  • figures_df – Output of extract_all().

  • cc_summary_df – Output of scan_directory() (the df return value).

  • require_references – If True (default), also require num_references > 0 OR a non-empty reference_sentences value.

  • output_csv – If provided, save the filtered DataFrame here.

  • verbose – Print counts.

Returns:

Filtered figure rows.

Return type:

pd.DataFrame

scan_directory(xml_dir: str | Path, output_txt: str | Path | None = None, output_csv: str | Path | None = None, verbose: bool = True) tuple[DataFrame, dict[str, int]][source]

Scan every XML file in xml_dir and return license information.

Parameters:
  • xml_dir – Directory containing Elsevier XML files.

  • output_txt – Optional path for the full human-readable report.

  • output_csv – Optional path for the machine-readable per-file CSV.

  • verbose – Print per-file progress.

Returns:

  • df (pd.DataFrame) – One row per file with columns: file, license, license_raw, detection_path, doi, title, reuse_rights

  • tally (dict[str, int]) – {license_label: count}