OpenAlex Fetch¶
matmmextract.openalex.fetcher¶
Fetch paper metadata from the OpenAlex API, filtered by publisher, license, and optional keyword/topic constraints.
Why OpenAlex instead of Scopus CSVs?¶
Scopus exports require manual download and have no license metadata.
OpenAlex is free, has an API, exposes CC license data directly, and
covers >250M works. The output CSV is intentionally shaped to match
what the rest of the pipeline expects (DOI, Publisher,
Open Access, Title columns).
Key filters supported¶
publisher— publisher display name substring (e.g."Elsevier")license— OA license type:"cc-by","cc-by-nc", etc.Pass
Noneto skip license filtering.
is_oa— restrict to open-access works (default True)from_year— publication year lower boundto_year— publication year upper boundkeywords— list of keyword strings (ANDed together as concept search)topics— list of OpenAlex topic/concept IDs or display names
OpenAlex API reference: https://docs.openalex.org/api-entities/works
- class FetchResult(df: 'pd.DataFrame' = <factory>, total_fetched: 'int' = 0, output_csv: 'str' = '')[source]¶
- df: DataFrame¶
- output_csv: str = ''¶
- total_fetched: int = 0¶
- fetch(publisher: str | None = None, license_: str | list[str] | None = 'cc-by', is_oa: bool = True, from_year: int | None = None, to_year: int | None = None, keywords: list[str] | None = None, topics: list[str] | None = None, max_results: int = 1000, output_csv: str | Path | None = None, email: str | None = None, api_key: str | None = None, per_page: int = 200, verbose: bool = True) FetchResult[source]¶
Fetch works from OpenAlex matching the given filters.
- Parameters:
publisher – Publisher display name substring, e.g.
"Elsevier","Springer","Wiley". Case-insensitive partial match.license – OA license filter. Pass a single string (
"cc-by") or a list to accept multiple licenses (["cc-by", "cc-by-nc"]). PassNoneto skip license filtering.is_oa – Restrict to open-access works (default
True).to_year (from_year /) – Publication year range (inclusive).
keywords – List of concept/keyword strings to filter by (ANDed). e.g.
["alloy", "microstructure"]topics – List of OpenAlex concept display names or IDs to filter by. e.g.
["Materials Science", "C2780799074"]Applied as an additionalconcepts.display_name.searchfilter.max_results – Maximum number of works to return.
output_csv – If provided, save the DataFrame as CSV.
email – Your email address for the OpenAlex polite pool (appended to User-Agent).
api_key – Free OpenAlex API key (get one at https://openalex.org/settings/api). Required for reliable access; passed as
?api_key=query parameter.per_page – Results per API page (max 200).
verbose – Print progress.
- Returns:
.dfhas the same column structure as a Scopus CSV export so it can be passed directly tofilter_by_publisher().- Return type:
Examples
>>> from matmmextract.openalex.fetcher import fetch >>> result = fetch(publisher="Elsevier", license_="cc-by", ... keywords=["titanium alloy"], max_results=500) >>> df = result.df >>> print(df[["DOI", "Title", "License"]].head())
>>> # Springer, any OA license, materials science, 2020-2024 >>> result = fetch( ... publisher="Springer", ... license_=None, ... keywords=["nickel alloy", "microstructure"], ... from_year=2020, to_year=2024, ... max_results=2000, ... )
- fetch_elsevier(license_: str | list[str] | None = 'cc-by', keywords: list[str] | None = None, from_year: int | None = None, to_year: int | None = None, max_results: int = 1000, output_csv: str | Path | None = None, email: str | None = None, api_key: str | None = None, verbose: bool = True) FetchResult[source]¶
Fetch Elsevier works. Shorthand for
fetch(publisher="Elsevier", ...).
- fetch_springer(license_: str | list[str] | None = 'cc-by', keywords: list[str] | None = None, from_year: int | None = None, to_year: int | None = None, max_results: int = 1000, output_csv: str | Path | None = None, email: str | None = None, api_key: str | None = None, verbose: bool = True) FetchResult[source]¶
Fetch Springer works. Shorthand for
fetch(publisher="Springer", ...).