🔍 Data Discovery & Ingestion
How we discovered Chicago's open data ecosystem and built a robust data ingestion pipeline using SODA APIs for real-time business intelligence.
🎯 Data Source Discovery
Our journey to find the right data sources for tracking small business activity
The Challenge
We needed to find reliable, real-time data sources that could tell us where new businesses were opening across Chicago. The data had to be:
- Geographically granular (neighborhood/ward level)
- Frequently updated (daily/weekly)
- Comprehensive (covering all business types)
- Free and accessible via API
The Discovery Process
After exploring various options, we discovered Chicago's comprehensive open data portal at data.cityofchicago.org. This portal provides access to hundreds of datasets through SODA (Socrata Open Data API).
The key insight was that business licenses, building permits, and transit data could be combined to create a comprehensive view of economic activity.
🔌 SODA API Integration
Building a robust data ingestion pipeline using Chicago's SODA endpoints
What is SODA?
SODA (Socrata Open Data API) is a RESTful API that provides programmatic access to government datasets. It supports SQL-like queries, filtering, and aggregation directly in the API calls.
This means we can do server-side aggregation, reducing data transfer and processing time.
Core API Endpoints
Business Licenses
Dataset ID: r5zr-chrr
https://data.cityofchicago.org/resource/r5zr-chrr.jsonBuilding Permits
Dataset ID: ydr8-5enu
https://data.cityofchicago.org/resource/ydr8-5enu.jsonCTA Boardings
Dataset ID: 6iiy-9s97
https://data.cityofchicago.org/resource/6iiy-9s97.json💻 Code Implementation
The Python code that powers our data ingestion pipeline
SODA Client Implementation
We built a robust SODA client that handles authentication, rate limiting, and error handling. Here's the actual implementation from our GitHub repository:
Robust client with authentication, rate limiting, and error handling
Business Licenses Query
Here's how we query for new business licenses with server-side aggregation:
def get_new_business_licenses(client: SodaClient, days_back: int = 90):
"""Fetch new business licenses with daily aggregation"""
# Calculate date threshold
from datetime import datetime, timedelta
threshold_date = (datetime.now() - timedelta(days=days_back)).strftime('%Y-%m-%d')
# SODA query parameters for server-side aggregation
params = {
'$select': 'community_area_name,community_area,license_description,' +
'date_trunc_ymd(license_term_start_date) AS day,count(1) AS n',
'$where': f"application_type='ISSUE' AND license_term_start_date >= '{threshold_date}'",
'$group': 'community_area_name,community_area,license_description,day',
'$order': 'day',
'$limit': 10000 # Safety limit
}
return client.query('r5zr-chrr', params)Server-side aggregation for efficient data retrieval
Data Quality Checks
We implement several data quality checks to ensure reliable ingestion:
def validate_business_license_data(data: list) -> bool:
"""Validate business license data quality"""
if not data:
return False
# Check required fields
required_fields = ['community_area_name', 'community_area', 'day', 'n']
for record in data:
if not all(field in record for field in required_fields):
return False
# Validate numeric values
for record in data:
try:
count = int(record['n'])
if count < 0:
return False
except (ValueError, TypeError):
return False
return True
def log_data_quality_metrics(data: list, source: str):
"""Log data quality metrics for monitoring"""
total_records = len(data)
unique_areas = len(set(record['community_area'] for record in data))
unique_dates = len(set(record['day'] for record in data))
print(f"{source} Quality Metrics:")
print(f" Total Records: {total_records}")
print(f" Unique Areas: {unique_areas}")
print(f" Unique Dates: {unique_dates}")
print(f" Date Range: {min(record['day'] for record in data)} to {max(record['day'] for record in data)}")Comprehensive data validation and monitoring functions
📋 Data Schema & Fields
Understanding the structure and meaning of our raw data
Business Licenses Schema
| Field | Type | Description | Example |
|---|---|---|---|
| community_area | Integer | Numeric identifier for Chicago community area | 8 |
| community_area_name | String | Human-readable community area name | Near North Side |
| license_description | String | Type of business license | Retail Food Establishment |
| license_term_start_date | Date | When the license becomes effective | 2025-01-15 |
| application_type | String | Type of application (ISSUE, RENEW, etc.) | ISSUE |
Key Insights from Schema
- Geographic Granularity: Community areas provide neighborhood-level insights
- Temporal Precision: Daily license start dates enable trend analysis
- Business Classification: License descriptions can be mapped to business categories
- Application Types: Filtering by 'ISSUE' gives us new businesses only
⚠️ Challenges & Solutions
Real-world problems we encountered and how we solved them
🚨 Rate Limiting
Chicago's SODA API has rate limits that can cause failures during peak usage.
Solution:
- • Implemented exponential backoff retry logic
- • Added request throttling between calls
- • Used app tokens for higher rate limits
⚠️ Data Consistency
Some community areas have missing or inconsistent naming conventions.
Solution:
- • Built community area mapping table
- • Implemented fuzzy matching for names
- • Added data validation checks
🔍 Field Mapping
API field names don't always match business logic requirements.
Solution:
- • Created field mapping configuration
- • Built schema validation at runtime
- • Added field existence checks
📊 Data Volume
Large datasets can cause memory issues and slow processing.
Solution:
- • Used server-side aggregation ($group, $select)
- • Implemented pagination for large queries
- • Added streaming data processing
Ready for the Next Chapter?
Now that we understand our data sources, let's see how we transform this raw data into actionable business insights.