🔍 Data Discovery & Ingestion

How we discovered Chicago's open data ecosystem and built a robust data ingestion pipeline using SODA APIs for real-time business intelligence.

📊 Overview⚙️ Transformation📥 Data Loading📊 Visualization🤖 Automation🎯 Results

🎯 Data Source Discovery

Our journey to find the right data sources for tracking small business activity

The Challenge

We needed to find reliable, real-time data sources that could tell us where new businesses were opening across Chicago. The data had to be:

Geographically granular (neighborhood/ward level)
Frequently updated (daily/weekly)
Comprehensive (covering all business types)
Free and accessible via API

The Discovery Process

After exploring various options, we discovered Chicago's comprehensive open data portal at data.cityofchicago.org. This portal provides access to hundreds of datasets through SODA (Socrata Open Data API).

The key insight was that business licenses, building permits, and transit data could be combined to create a comprehensive view of economic activity.

🔌 SODA API Integration

Building a robust data ingestion pipeline using Chicago's SODA endpoints

What is SODA?

SODA (Socrata Open Data API) is a RESTful API that provides programmatic access to government datasets. It supports SQL-like queries, filtering, and aggregation directly in the API calls.

This means we can do server-side aggregation, reducing data transfer and processing time.

Core API Endpoints

Business Licenses

Dataset ID: r5zr-chrr

https://data.cityofchicago.org/resource/r5zr-chrr.json

Building Permits

Dataset ID: ydr8-5enu

https://data.cityofchicago.org/resource/ydr8-5enu.json

CTA Boardings

Dataset ID: 6iiy-9s97

https://data.cityofchicago.org/resource/6iiy-9s97.json

💻 Code Implementation

The Python code that powers our data ingestion pipeline

SODA Client Implementation

We built a robust SODA client that handles authentication, rate limiting, and error handling. Here's the actual implementation from our GitHub repository:

Loading code...

SODA Client Implementation

Download View on GitHub

Robust client with authentication, rate limiting, and error handling

Business Licenses Query

Here's how we query for new business licenses with server-side aggregation:

Business Licenses Query

def get_new_business_licenses(client: SodaClient, days_back: int = 90):
    &quot;&quot;&quot;Fetch new business licenses with daily aggregation&quot;&quot;&quot;

    # Calculate date threshold
    from datetime import datetime, timedelta
    threshold_date = (datetime.now() - timedelta(days=days_back)).strftime(&#039;%Y-%m-%d&#039;)

    # SODA query parameters for server-side aggregation
    params = {
        &#039;$select&#039;: &#039;community_area_name,community_area,license_description,&#039; +
                   &#039;date_trunc_ymd(license_term_start_date) AS day,count(1) AS n&#039;,
        &#039;$where&#039;: f&quot;application_type=&#039;ISSUE&#039; AND license_term_start_date &gt;= &#039;{threshold_date}&#039;&quot;,
        &#039;$group&#039;: &#039;community_area_name,community_area,license_description,day&#039;,
        &#039;$order&#039;: &#039;day&#039;,
        &#039;$limit&#039;: 10000  # Safety limit
    }

    return client.query(&#039;r5zr-chrr&#039;, params)

Server-side aggregation for efficient data retrieval

Data Quality Checks

We implement several data quality checks to ensure reliable ingestion:

Data Quality Validation

def validate_business_license_data(data: list) -&gt; bool:
    &quot;&quot;&quot;Validate business license data quality&quot;&quot;&quot;

    if not data:
        return False

    # Check required fields
    required_fields = [&#039;community_area_name&#039;, &#039;community_area&#039;, &#039;day&#039;, &#039;n&#039;]
    for record in data:
        if not all(field in record for field in required_fields):
            return False

    # Validate numeric values
    for record in data:
        try:
            count = int(record[&#039;n&#039;])
            if count &lt; 0:
                return False
        except (ValueError, TypeError):
            return False

    return True

def log_data_quality_metrics(data: list, source: str):
    &quot;&quot;&quot;Log data quality metrics for monitoring&quot;&quot;&quot;
    total_records = len(data)
    unique_areas = len(set(record[&#039;community_area&#039;] for record in data))
    unique_dates = len(set(record[&#039;day&#039;] for record in data))

    print(f&quot;{source} Quality Metrics:&quot;)
    print(f&quot;   Total Records: {total_records}&quot;)
    print(f&quot;   Unique Areas: {unique_areas}&quot;)
    print(f&quot;   Unique Dates: {unique_dates}&quot;)
    print(f&quot;   Date Range: {min(record[&#039;day&#039;] for record in data)} to {max(record[&#039;day&#039;] for record in data)}&quot;)

Comprehensive data validation and monitoring functions

📋 Data Schema & Fields

Understanding the structure and meaning of our raw data

Business Licenses Schema

Field	Type	Description	Example
community_area	Integer	Numeric identifier for Chicago community area	8
community_area_name	String	Human-readable community area name	Near North Side
license_description	String	Type of business license	Retail Food Establishment
license_term_start_date	Date	When the license becomes effective	2025-01-15
application_type	String	Type of application (ISSUE, RENEW, etc.)	ISSUE

Key Insights from Schema

Geographic Granularity: Community areas provide neighborhood-level insights
Temporal Precision: Daily license start dates enable trend analysis
Business Classification: License descriptions can be mapped to business categories
Application Types: Filtering by 'ISSUE' gives us new businesses only

⚠️ Challenges & Solutions

Real-world problems we encountered and how we solved them

🚨 Rate Limiting

Chicago's SODA API has rate limits that can cause failures during peak usage.

Solution:

• Implemented exponential backoff retry logic
• Added request throttling between calls
• Used app tokens for higher rate limits

⚠️ Data Consistency

Some community areas have missing or inconsistent naming conventions.

Solution:

• Built community area mapping table
• Implemented fuzzy matching for names
• Added data validation checks

🔍 Field Mapping

API field names don't always match business logic requirements.

Solution:

• Created field mapping configuration
• Built schema validation at runtime
• Added field existence checks

📊 Data Volume

Large datasets can cause memory issues and slow processing.

Solution:

• Used server-side aggregation ($group, $select)
• Implemented pagination for large queries
• Added streaming data processing

Ready for the Next Chapter?

Now that we understand our data sources, let's see how we transform this raw data into actionable business insights.

← Back to Overview⚙️ Data Transformation Pipeline →