Step 2: Data Ingestion
Connect to data sources, validate data quality, and establish reliable ingestion patterns.
Chicago Open Data Analysis
Discovery: Chicago maintains 600+ datasets through their Open Data Portal, but only 12 were relevant for small business analysis.
Key Datasets Identified:
- •
Business Licenses- 200K+ active licenses, updated daily - •
Building Permits- Construction and renovation permits - •
Business License Applications- New business tracking - •
Zoning Data- Commercial district boundaries
API Integration Strategy
Socrata Open Data API (SODA)
Chicago uses Socrata for their data portal. Key implementation details:
# Basic API exploration
import requests
import pandas as pd
# Business Licenses endpoint
base_url = "https://data.cityofchicago.org/resource/"
business_licenses = "r5kz-chrr.json"
# Test API connection and data structure
response = requests.get(f"{base_url}{business_licenses}",
params={"$limit": 10})
# Examine data schema
sample_data = response.json()
print(f"Columns: {list(sample_data[0].keys())}")
print(f"Record count: {len(sample_data)}")✅ API Advantages
- • Real-time data access
- • SQL-like filtering ($where)
- • JSON response format
- • No authentication required
⚠️ API Limitations
- • 1000 record default limit
- • Rate limiting (unclear docs)
- • Inconsistent data types
- • No change logs
Framework Progress
✓
Scope & Strategy✓
Data Ingestion3
Transform & Model4
Load & Validate5
Visualize & Report6
Automate & ScaleKey Lesson
Spending 1 week on data quality assessment prevented 3 weeks of debugging later. Always validate data assumptions before building pipelines.
Tools & Libraries
Python RequestsPandasSocrata APIData Profiling