Approach: ShEx-Guided Wikidata Item Creation
Overview
This document outlines an approach for creating new Wikidata items by combining: 1. ShEx Entity Schemas - Define the structure and constraints for valid items 2. Property Mappings - Map source data fields to Wikidata properties 3. Source Data - The raw data to be transformed into Wikidata items 4. Wikidata API - Submit items via authenticated API calls
Components
1. ShEx Schema (Validation & Structure Definition)
ShEx schemas define what a valid Wikidata entity should look like. They specify: - Required and optional properties - Value constraints (datatypes, allowed values) - Qualifiers and references - Cardinality (how many times a property can appear)
Example from tribe_E502.shex:
<FederallyRecognizedTribe> {
p:P31 @<InstanceOfFedTribe> ; # Must have P31 (instance of)
wdt:P30 [ wd:Q49 ] ; # Must be in North America
wdt:P17 [ wd:Q30 ] + ; # Must be in United States
p:P571 @<Inception> * ; # Optional inception date
p:P2124 @<MemberCount> * ; # Optional member count
...
}
Role: Validation blueprint - tells us what the final item must conform to.
2. Property Mapping Configuration
A mapping configuration connects source data fields to Wikidata properties and defines transformation rules.
Auto-Generation from ShEx
Rather than manually creating mapping configurations, you can auto-generate them from EntitySchemas using the ClaimsMapBuilder:
from gkc import ClaimsMapBuilder
# Generate mapping from EntitySchema E502
builder = ClaimsMapBuilder(eid="E502")
mapping = builder.build_complete_mapping(entity_type="Q7840353")
# This fetches live property data from Wikidata to ensure
# accurate datatypes, labels, and descriptions
See Claims Map Builder documentation for details.
Separator Support for Multi-Value Fields
When working with spreadsheet-style data, you often have multiple values in a single field separated by a delimiter (e.g., "alias1; alias2; alias3"). The mapping configuration supports a separator parameter to automatically split these values:
{
"source_field": "tribe_name_aliases",
"language": "en",
"separator": ";",
"comment": "Multiple aliases separated by semicolons"
}
This will split "CNO; Cherokee Nation of Oklahoma; Eastern Band" into three separate alias entries. Common separators include:
- ; - Semicolon (recommended for CSV data)
- , - Comma (use with caution in CSV)
- | - Pipe
- \t - Tab
Proposed Structure (JSON/YAML):
{
"schema": {
"entity_schema_id": "E502",
"entity_type": "Q7840353",
"description": "Federally recognized tribe mapper"
},
"reference_library": {
"stated_in_federal_register": {
"P248": {
"value": "Q127419548",
"datatype": "wikibase-item",
"comment": "Stated in: Federal Register"
},
"P813": {
"value": "current_date",
"datatype": "time",
"comment": "Retrieved date"
}
}
},
"qualifier_library": {
"point_in_time": {
"property": "P585",
"source_field": "point_in_time_date",
"datatype": "time"
}
},
"mappings": {
"labels": [
{
"source_field": "tribe_name",
"language": "en"
}
],
"aliases": [
{
"source_field": "tribe_name_aliases",
"language": "en",
"separator": ";",
"comment": "Multiple aliases in one field, separated by semicolons"
}
],
"claims": [
{
"property": "P31",
"comment": "Instance of: federally recognized tribe",
"value": "Q7840353",
"datatype": "wikibase-item",
"references": ["stated_in_federal_register"]
},
{
"property": "P1705",
"comment": "Native label",
"source_field": "native_name",
"datatype": "monolingualtext",
"references": ["stated_in_federal_register"]
},
{
"property": "P2124",
"comment": "Member count",
"source_field": "member_count",
"datatype": "quantity",
"qualifiers": [
{
"property": "P585",
"source_field": "count_date",
"datatype": "time"
}
],
"references": ["stated_in_federal_register"]
},
{
"property": "P159",
"comment": "Headquarters location",
"source_field": "headquarters_qid",
"datatype": "wikibase-item",
"qualifiers": [
{
"property": "P625",
"source_field": "headquarters_coordinates",
"datatype": "globe-coordinate"
}
]
}
]
},
"validation": {
"country": "Q30"
}
}
3. Source Data
Source data can come from various formats (CSV, JSON, database, API):
Example CSV:
tribe_name,tribe_name_aliases,official_name_native,member_count,count_date,headquarters_location,source_reference_item
Cherokee Nation,"CNO; Cherokee Nation of Oklahoma",ᏣᎳᎩ ᎠᏰᎵ,400000,2023-01-01,Cherokee,Q123456
Example JSON:
{
"tribe_name": "Cherokee Nation",
"tribe_name_aliases": "CNO; Cherokee Nation of Oklahoma; Eastern Band",
"official_name_native": "ᏣᎳᎩ ᎠᏰᎵ",
"member_count": 400000,
"count_date": "2023-01-01",
"headquarters_location": "Cherokee",
"headquarters_coordinates": {
"latitude": 35.9149,
"longitude": -94.8703
},
"source_reference_item": "Q123456",
"data_source_item": "Q789"
}
4. Wikidata JSON Structure
The Wikidata API expects a specific JSON structure for item creation:
wbeditentity API format:
{
"labels": {
"en": {
"language": "en",
"value": "Cherokee Nation"
}
},
"descriptions": {
"en": {
"language": "en",
"value": "Federally recognized tribe in the United States"
}
},
"claims": {
"P31": [
{
"mainsnak": {
"snaktype": "value",
"property": "P31",
"datavalue": {
"value": {
"entity-type": "item",
"numeric-id": 7840353,
"id": "Q7840353"
},
"type": "wikibase-entityid"
}
},
"type": "statement",
"rank": "normal",
"references": [
{
"snaks": {
"P248": [
{
"snaktype": "value",
"property": "P248",
"datavalue": {
"value": {
"entity-type": "item",
"numeric-id": 123456,
"id": "Q123456"
},
"type": "wikibase-entityid"
}
}
],
"P813": [
{
"snaktype": "value",
"property": "P813",
"datavalue": {
"value": {
"time": "+2024-01-15T00:00:00Z",
"timezone": 0,
"before": 0,
"after": 0,
"precision": 11,
"calendarmodel": "http://www.wikidata.org/entity/Q1985727"
},
"type": "time"
}
}
]
},
"snaks-order": ["P248", "P813"]
}
]
}
],
"P2124": [
{
"mainsnak": {
"snaktype": "value",
"property": "P2124",
"datavalue": {
"value": {
"amount": "+400000",
"unit": "1"
},
"type": "quantity"
}
},
"type": "statement",
"rank": "normal",
"qualifiers": {
"P585": [
{
"snaktype": "value",
"property": "P585",
"datavalue": {
"value": {
"time": "+2023-01-01T00:00:00Z",
"timezone": 0,
"before": 0,
"after": 0,
"precision": 11,
"calendarmodel": "http://www.wikidata.org/entity/Q1985727"
},
"type": "time"
}
}
]
},
"qualifiers-order": ["P585"]
}
]
}
}
Proposed Architecture
Core Classes
1. PropertyMapper
Handles the mapping configuration and transformation logic.
class PropertyMapper:
"""Manages property mappings from source data to Wikidata format."""
def __init__(self, mapping_config: dict):
"""Load mapping configuration."""
def load_source_data(self, data: dict | list[dict]):
"""Load source data to be transformed."""
def transform_to_wikidata(self, source_record: dict) -> dict:
"""Transform a single source record to Wikidata JSON format."""
def create_mainsnak(self, property_id: str, value: Any, datatype: str) -> dict:
"""Create a mainsnak (main value) for a claim."""
def create_qualifier(self, property_id: str, value: Any, datatype: str) -> dict:
"""Create a qualifier for a claim."""
def create_reference(self, reference_config: dict, source_record: dict) -> dict:
"""Create a reference block."""
2. WikidataItemBuilder
Builds the complete Wikidata JSON structure.
class WikidataItemBuilder:
"""Builds Wikidata item JSON structures."""
def __init__(self):
self.item_data = {"labels": {}, "descriptions": {}, "claims": {}}
def add_label(self, language: str, value: str) -> "WikidataItemBuilder":
"""Add a label in a specific language."""
return self
def add_description(self, language: str, value: str) -> "WikidataItemBuilder":
"""Add a description in a specific language."""
return self
def add_claim(self, property_id: str, claim_data: dict) -> "WikidataItemBuilder":
"""Add a claim (statement) to the item."""
return self
def build(self) -> dict:
"""Return the complete item JSON."""
return self.item_data
3. ItemCreator
Orchestrates the entire creation process.
class ItemCreator:
"""Creates Wikidata items from source data using ShEx validation."""
def __init__(
self,
auth: WikiverseAuth,
mapper: PropertyMapper,
validator: ShExValidator = None
):
"""Initialize with authentication, mapper, and optional validator."""
def create_item(self, source_record: dict, validate: bool = True) -> str:
"""
Create a new Wikidata item.
Args:
source_record: Source data record
validate: Whether to validate against ShEx before submission
Returns:
QID of created item
"""
def validate_before_submit(self, item_json: dict) -> bool:
"""Validate the constructed item against ShEx schema."""
def submit_to_wikidata(self, item_json: dict) -> dict:
"""Submit the item to Wikidata via API."""
Workflow
Step 1: Define Mapping Configuration
Create a mapping file that connects source data to Wikidata properties:
from gkc import PropertyMapper
mapping_config = {
"schema": {"entity_schema_id": "E502"},
"mappings": [...] # See mapping structure above
}
mapper = PropertyMapper(mapping_config)
Step 2: Load Source Data
source_data = [
{
"tribe_name": "Cherokee Nation",
"member_count": 400000,
# ... more fields
}
]
Step 3: Transform and Validate
from gkc import ItemCreator, WikiverseAuth, ShExValidator
auth = WikiverseAuth()
auth.login()
validator = ShExValidator(eid="E502")
creator = ItemCreator(auth=auth, mapper=mapper, validator=validator)
# Transform source record to Wikidata JSON
for record in source_data:
try:
# This will:
# 1. Transform source data to Wikidata JSON
# 2. Validate against ShEx schema
# 3. Submit to Wikidata
qid = creator.create_item(record, validate=True)
print(f"Created item: {qid}")
except ValidationError as e:
print(f"Validation failed: {e}")
except WikidataAPIError as e:
print(f"API error: {e}")
Benefits of This Approach
- Schema-Driven: ShEx schemas ensure data quality and consistency
- Flexible Mapping: Supports complex transformations and multiple source formats
- Validation First: Catch errors before submission
- Auditable: Clear mapping from source to Wikidata
- Reusable: Mapping configs can be shared and versioned
- Incremental: Can be extended later for updates and deletions
Next Steps for Implementation
- Phase 1: Implement core datatype transformations
- String → label/description
- Item references → wikibase-entityid
- Numbers → quantity
- Dates → time
-
Coordinates → globe-coordinate
-
Phase 2: Implement mapping system
- Load/validate mapping configs
- Transform source data using mappings
-
Handle qualifiers and references
-
Phase 3: Integrate with existing auth system
- Use WikiverseAuth for API calls
- Implement wbeditentity API wrapper
-
Handle CSRF tokens
-
Phase 4: Add ShEx validation integration
- Validate transformed JSON before submission
- Convert JSON to RDF for validation
-
Provide meaningful error messages
-
Phase 5: Add batch processing and error handling
- Process multiple records
- Retry logic
- Detailed logging
- Dry-run mode
Example Usage Pattern
from gkc import WikiverseAuth, PropertyMapper, ItemCreator, ShExValidator
# 1. Setup authentication
auth = WikiverseAuth()
auth.login()
# 2. Load mapping configuration
mapper = PropertyMapper.from_file("mappings/tribe_mapping.json")
# 3. Setup validator (optional but recommended)
validator = ShExValidator(eid="E502")
# 4. Create the item creator
creator = ItemCreator(auth=auth, mapper=mapper, validator=validator)
# 5. Load your source data
import csv
with open("tribes.csv") as f:
reader = csv.DictReader(f)
for row in reader:
qid = creator.create_item(row, validate=True)
print(f"Created {row['tribe_name']} as {qid}")
# 6. Cleanup
auth.logout()
Open Questions
- ShEx to Mapping: Can we auto-generate initial mapping configs from ShEx schemas?
- Item Lookup: How to handle looking up existing items (e.g., for headquarters location)?
- Duplicate Detection: Should we check for duplicates before creating?
- Batch Submission: Should we support batch uploads via QuickStatements format?
- Error Recovery: How to handle partial failures in batch operations?