Zotero

Our initial GeoArchive use case explores using a Zotero group library as a repository for document-type materials. Zotero is a reference management software tool that includes an online shared storage offering where Group Libraries can be set up for citation metadata and attached file content.

This module leverages the pyzotero package to create an instance of an API connection to a specified Group Library in Zotero and a set of functions for working with that connection in a variety of ways. It requires a library_id and an api_key, which can be supplied through environment variables or through passed variables in instantiating the connection. Certain functionality also requires the specification of a inventory_item, which is the identifier for a specific item stored in the library that contains a cache of metadata.

Prerequisites

Connecting to Zotero programmatically requires an API key provided by a Zotero user. An API key will provide read and potentially write access to one or more libraries. Instantiation of a Zot object using the geoarchive.zotero module requires both a library_id and an api_key. Library ID (aka "group") values are public information (unless the library is completely restricted) and viewable in the URL pattern in the Zotero web interface (e.g., https://www.zotero.org/groups/4530692). API keys are secret and should be carefully guarded.

Usage Pattern

One potential usage pattern would be to periodically run a process to read everything from a Zotero Group Library and send metadata and files into some other store. This particular mockup is based on our use case for sending a set of documents through processing with the xDD Digital Library and Cyberinfrastructure.

First, you may want to set environment variables for the two key pieces of information - library ID and API key - as opposed to passing them in via the function. For instance, you might set these for a Docker container executing this process via the deploy process.

export ZOTERO_LIBRARY_ID=4530692
export ZOTERO_API_KEY=MyMadeUpAPI_Key

Next, you'll need some Python code to handle processing as needed for your scenario. The following mockup provides a rough outline for how the geoarchive.zotero module would be used to interace with a file system or some other data store. Note that the Zot() instance established includes "z", which is the basic connection object from pyzotero. This means you can use any standard function available from pyzotero with that object that has been authenticated to a Zotero Group Library. The dump() and file() options in the dummy file_handler() function below are from pyzotero.

from geoarchive import zotero
import json
import boto3

z_lib = zotero.Zot()

# This returns a list of dictionaries in a particular target schema
lib_xdd_format = z_lib.item_export(target_schema="xdd", output_format="dict")

def metadata_handler(output_file, lib_meta):
    # Dump the metadata to a file
    json.dump(lib_meta, open(output_file, "r"))

    # Or you might want to load metadata to a database or whatever

def file_handler(zotero_lib, library_item, output_path=None, bucket_name=None):
    if output_path is not None:
        # If you have a mounted disc to send the files to
        zotero_lib.z.dump(
            library_item["file_key"],
            filename=library_item["filename"],
            path=output_path
        )
    elif bucket_name is not None:
        # If you need to get the file as bytes and send it somewhere else
        f = zotero_lib.z.file(library_item["file_key"])
        s3_client = boto3.client('s3')
        client.put_object(Body=f, Bucket=bucket_name, Key=library_item["filename"])

# Send the metadata somewhere
metadata_handler(lib_xdd_format)

# Send the files somewhere
for item in lib_xdd_format:
    file_handler(z_lib, item, output_path="z_lib/files")

API

class geoarchive.zotero.Zot(library_id, api_key)
baseline_cache(self, inventory_item=None, output_path='data')

This function takes the files found in a specified output_path (from baseline or update) for items, collections, and tags and uploads these to the specified inventory item in the library. This function can only be successfully operated using an API key with write permissions.

Args: inventory_item (str): Alphanumeric identifier for the inventory item in a given library output_path (str, optional): Relative or absolute path to output json file. Defaults to "data".

Returns: bool: False if the API key does not have permission to write to the library

baseline_collections(self, output_path='data')

This function gets all collections for a library and dumps them as a JSON file to a specified output path.

Args: output_path (str, optional): Relative or absolute path to output json file. Defaults to "data".

baseline_items(self, inventory_item=None, output_path='data')

This function gets all items (metadata and files) for a library and dumps them as a JSON file to a specified output path. It will strip out any items associated with the inventory item.

Args: inventory_item (str): Alphanumeric identifier for the inventory item in a given library output_path (str, optional): Relative or absolute path to output json file. Defaults to "data".

baseline_tags(self, output_path='data')

This function gets all tags for a library and dumps them as a JSON file to a specified output path.

Args: output_path (str, optional): Relative or absolute path to output json file. Defaults to "data".

build_identifier(self, record)

Builds identifier objects for a particular type of export format.

Args: record (series): Dataframe series (record)

Returns: list: List of identifier objects/dicts

build_link(self, record)

Builds link objects for a particular type of export format.

Args: record (series): Dataframe series (record)

Returns: list: List of link objects/dicts

collection_tags(self, geo_tags=True, custom_tags=None)

This is a fairly specialized function that interprets collections from the Zotero library to turn these into additional type-classified tags.

Args: geo_tags (bool, optional): Specifies whether or not to check collections for geo names. Defaults to True. custom_tags (list, optional): List of lists of specific types of additional tags that should be found within the collection structure of a library. Defaults to None.

Returns: list: List of additional tags as compound, type-classified strings containing a type and term.

get_inventory_files(self, inventory_item=None)

To facilitate efficient access to larger libraries, we established a method of building and maintaining a specific item in the library where we cache inventory data as JSON files which can be retrieved in lieu of what can be a time-consuming pull of the entire recordset via the API. If the inventory item is not explicitly provided, the inventory_item_key() function is used to try and find it.

Args: inventory_item (str): Alphanumeric identifier for the inventory item in a given library. Defaults to None.

Returns: list: List of dictionaries from the Zotero API containing the inventory item and its files

inventory_item_key(self)

If necessary, this function can be used to identify the item containing the inventory cache in the library using a specific convention on the item's title - "Inventory:".

Returns: str: Key (identifier) of the library item providing the cached inventory of items, collections, and tags.

item_export(self, inventory_item=None, target_schema='xdd', output_format='dataframe')

This function packages the items for a given library using the cached inventory files and returns them for some particular use. We need to better work up the configuration details for this so that we can accept some form of configuration principals that are used to determine the appropriate mappings or transformations.

Args: inventory_item (str): Alphanumeric identifier for the inventory item in a given library target_schema (str, optional): The particular target output being transformed to. Defaults to "xdd". output_format (str, optional): Determines the output format for the transformation. Defaults to "dataframe".

Returns: dataframe/list: Either a dataframe (default) containing the transformation or a list of dicts.

load_df_inventory(self, raw_inventory, tag_delimiter=':')

This function takes the raw inventory dict containing three sets of records for items, collections, and tags and builds Pandas dataframes for further processing. We do some minimal processing on tags to split these out using a ":" delimiter by default.

Args: raw_inventory (dict): Dictionary contanining three keys with lists for items, collections, and tags from load_raw_inventory() tag_delimiter (str, optional): Delimiter to use in splitting tags into classes. Defaults to ":".

Returns: dict: Dictionary containing three dataframes for items, collections, and tags

load_raw_inventory(self, inventory_item)

This function loads the raw inventory information for items, collections, and tags stored in the inventory item as JSON files.

Args: inventory_item (str): Alphanumeric identifier for the inventory item in a given library

Returns: dict: Dictionary containing lists for each set of cached records - items, collections, and tags

update_inventory(self, inventory_item=None, output_path='data')

This function handles a periodic refresh of the cached inventory. It reads the current inventory files, determines the last version of the library cached, retrieves new records (items, collections, tags), and updates the cached files. This function can only be successfully operated using an API key with write permissions.

Args: inventory_item (str): Alphanumeric identifier for the inventory item in a given library output_path (str, optional): Relative or absolute path to output json file. Defaults to "data".

Returns: bool: False if the API key does not have permission to write to the library