OpenAlex Fetch

matmmextract.openalex.fetcher

Fetch paper metadata from the OpenAlex API, filtered by publisher, license, and optional keyword/topic constraints.

Why OpenAlex instead of Scopus CSVs?

Scopus exports require manual download and have no license metadata. OpenAlex is free, has an API, exposes CC license data directly, and covers >250M works. The output CSV is intentionally shaped to match what the rest of the pipeline expects (DOI, Publisher, Open Access, Title columns).

Key filters supported

  • publisher — publisher display name substring (e.g. "Elsevier")

  • license — OA license type: "cc-by", "cc-by-nc", etc.

    Pass None to skip license filtering.

  • is_oa — restrict to open-access works (default True)

  • from_year — publication year lower bound

  • to_year — publication year upper bound

  • keywords — list of keyword strings (ANDed together as concept search)

  • topics — list of OpenAlex topic/concept IDs or display names

OpenAlex API reference: https://docs.openalex.org/api-entities/works

class FetchResult(df: 'pd.DataFrame' = <factory>, total_fetched: 'int' = 0, output_csv: 'str' = '')[source]
df: DataFrame
output_csv: str = ''
total_fetched: int = 0
fetch(publisher: str | None = None, license_: str | list[str] | None = 'cc-by', is_oa: bool = True, from_year: int | None = None, to_year: int | None = None, keywords: list[str] | None = None, topics: list[str] | None = None, max_results: int = 1000, output_csv: str | Path | None = None, email: str | None = None, api_key: str | None = None, per_page: int = 200, verbose: bool = True) FetchResult[source]

Fetch works from OpenAlex matching the given filters.

Parameters:
  • publisher – Publisher display name substring, e.g. "Elsevier", "Springer", "Wiley". Case-insensitive partial match.

  • license – OA license filter. Pass a single string ("cc-by") or a list to accept multiple licenses (["cc-by", "cc-by-nc"]). Pass None to skip license filtering.

  • is_oa – Restrict to open-access works (default True).

  • to_year (from_year /) – Publication year range (inclusive).

  • keywords – List of concept/keyword strings to filter by (ANDed). e.g. ["alloy", "microstructure"]

  • topics – List of OpenAlex concept display names or IDs to filter by. e.g. ["Materials Science", "C2780799074"] Applied as an additional concepts.display_name.search filter.

  • max_results – Maximum number of works to return.

  • output_csv – If provided, save the DataFrame as CSV.

  • email – Your email address for the OpenAlex polite pool (appended to User-Agent).

  • api_key – Free OpenAlex API key (get one at https://openalex.org/settings/api). Required for reliable access; passed as ?api_key= query parameter.

  • per_page – Results per API page (max 200).

  • verbose – Print progress.

Returns:

.df has the same column structure as a Scopus CSV export so it can be passed directly to filter_by_publisher().

Return type:

FetchResult

Examples

>>> from matmmextract.openalex.fetcher import fetch
>>> result = fetch(publisher="Elsevier", license_="cc-by",
...                keywords=["titanium alloy"], max_results=500)
>>> df = result.df
>>> print(df[["DOI", "Title", "License"]].head())
>>> # Springer, any OA license, materials science, 2020-2024
>>> result = fetch(
...     publisher="Springer",
...     license_=None,
...     keywords=["nickel alloy", "microstructure"],
...     from_year=2020, to_year=2024,
...     max_results=2000,
... )
fetch_elsevier(license_: str | list[str] | None = 'cc-by', keywords: list[str] | None = None, from_year: int | None = None, to_year: int | None = None, max_results: int = 1000, output_csv: str | Path | None = None, email: str | None = None, api_key: str | None = None, verbose: bool = True) FetchResult[source]

Fetch Elsevier works. Shorthand for fetch(publisher="Elsevier", ...).

fetch_springer(license_: str | list[str] | None = 'cc-by', keywords: list[str] | None = None, from_year: int | None = None, to_year: int | None = None, max_results: int = 1000, output_csv: str | Path | None = None, email: str | None = None, api_key: str | None = None, verbose: bool = True) FetchResult[source]

Fetch Springer works. Shorthand for fetch(publisher="Springer", ...).