Making Cities Smarter through Graph Theory

🌆 Making Cities Smarter through Graph Theory

Recent advancements in Urban Graph Networks reveal how graph theory can change the landscape of city trips! These networks leverage the connections between various urban elements to optimize travel routes, reduce congestion, and enhance public transport. The potential for smart city initiatives is vast, offering improved efficiency for residents and visitors alike. 🚍

Embracing Graph Theory

Embracing graph theory offers consulting and technology firms a wealth of opportunities. By integrating these innovative approaches into their service portfolios, companies can provide clients with data solutions that not only simplify urban navigation but also contribute to sustainable urban planning. The result? Enhanced quality of life for citizens and a significant competitive advantage for businesses that position themselves at the forefront of these transformative trends. 🌟

Graph theory, a branch of mathematics, studies the relationships between objects. In the context of urban planning, these objects can be anything from intersections and roads to public transport routes and pedestrian pathways. By modeling these elements as nodes and edges in a graph, urban planners can analyze and optimize the flow of traffic, identify bottlenecks, and propose efficient routes. This approach not only improves travel times but also reduces environmental impact by minimizing congestion and emissions. 🌍

The Power of Urban Graph Networks

Urban Graph Networks utilize graph neural networks (GNNs) to process and analyze complex urban data. These networks can predict traffic patterns, detect anomalies, and even forecast the impact of new infrastructure projects. For example, a GNN can analyze data from various sources, such as GPS signals, social media check-ins, and public transport schedules, to provide real-time insights into urban mobility. This information can help city planners make informed decisions about where to build new roads, how to optimize public transport routes, and how to manage traffic during peak hours. 🕒

One of the key benefits of Urban Graph Networks is their ability to adapt to changing conditions. As cities grow and evolve, so do the patterns of movement within them. GNNs can continuously learn from new data, ensuring that the models remain accurate and relevant. This adaptability is crucial for creating resilient and future-proof urban environments. 🏙️

Real-World Applications of Graph Theory in Urban Planning

To better understand the impact of graph theory on urban planning, let's explore some real-world applications:

  1. Traffic Management: Cities like Singapore and London have implemented intelligent traffic management systems that use graph theory to optimize traffic flow. By analyzing data from traffic cameras, sensors, and GPS devices, these systems can predict congestion and suggest alternative routes to drivers in real-time.

  2. Public Transport Optimization: In cities such as New York and Tokyo, public transport authorities use graph theory to design efficient bus and train routes. By modeling the transport network as a graph, they can identify the most critical nodes (stations) and edges (routes) to ensure maximum coverage and minimal travel time for passengers.

  3. Emergency Response: Graph theory is also used in emergency response planning. For instance, during natural disasters, authorities can use graph models to determine the fastest evacuation routes and the best locations for emergency shelters. This helps in minimizing response times and ensuring the safety of residents.

  4. Urban Development: In rapidly growing cities like Dubai, urban planners use graph theory to plan new infrastructure projects. By analyzing the existing urban network, they can identify areas that need new roads, bridges, or public transport links to support future growth.

Challenges and Future Directions

While the potential of graph theory in urban planning is immense, there are also challenges that need to be addressed:

  1. Data Quality and Availability: The effectiveness of graph-based models depends on the quality and availability of data. In many cities, data collection infrastructure is still lacking, which can limit the accuracy of these models.

  2. Computational Complexity: Analyzing large urban networks can be computationally intensive. Advanced algorithms and high-performance computing resources are required to process and analyze the vast amounts of data involved.

  3. Privacy Concerns: The use of data from sources like GPS signals and social media check-ins raises privacy concerns. Ensuring that data is anonymized and used ethically is crucial for gaining public trust.

Despite these challenges, the future of graph theory in urban planning looks promising. Advances in machine learning and artificial intelligence are expected to further enhance the capabilities of Urban Graph Networks, making cities smarter and more efficient.

The Role of Public Participation

An often overlooked but crucial aspect of implementing graph theory in urban planning is public participation. Engaging citizens in the planning process can provide valuable insights and foster a sense of ownership and cooperation. Here are some ways to involve the public:

  1. Community Workshops: Organize workshops where residents can learn about graph theory and its applications in urban planning. These sessions can also serve as platforms for gathering feedback and suggestions.

  2. Surveys and Polls: Conduct surveys and polls to understand the needs and preferences of the community. This data can be integrated into graph models to ensure that the proposed solutions align with public expectations.

  3. Collaborative Platforms: Develop online platforms where citizens can contribute data, report issues, and suggest improvements. These platforms can facilitate continuous engagement and ensure that the urban environment evolves in line with the needs of its inhabitants.

Let's Discuss!

We'd love to hear from those pondering the effects of AI and Graph Theory on the future of cities! How can your business utilize these technologies for urban innovation? Here are a few steps to get started:

  1. Identify Key Urban Elements: Determine which elements of the urban environment are most critical to your objectives. These could include roads, intersections, public transport routes, and pedestrian pathways.
  2. Collect and Integrate Data: Gather data from various sources, such as GPS signals, social media check-ins, and public transport schedules. Integrate this data into a unified platform for analysis.
  3. Model the Urban Environment: Use graph theory to model the urban environment as a network of nodes and edges. This model will serve as the foundation for analysis and optimization.
  4. Analyze and Optimize: Apply graph neural networks to analyze the data and identify patterns. Use these insights to optimize travel routes, reduce congestion, and enhance public transport.
  5. Implement and Monitor: Implement the proposed changes and continuously monitor their impact. Use real-time data to make adjustments and ensure that the urban environment remains efficient and sustainable.

Dive deeper into this fascinating topic by reading the article here:

👉 Urban Graph Networks

Building a Data Dashboard

Image by Author

With source data from a Postgres database

Using the streamlit Python library

As a Python data engineer for many years, one area I was not very involved in was the production of data dashboards. That all changed when Python-based libraries such as Streamlit, Gradio and Taipy came along.

With their introduction, python programmers had no excuses not to use them to craft nice-looking front-ends and dashboards.

Until then, the only other options were to use specialised tools like Tableau or AWS’s Quicksight or—horror of horrors—get your hands dirty with CSS, HTML, and Javascript.

So, if you’ve never used one of these new Python-based graphical front-end libraries before, this article is for you as I’ll be taking you through how to code up a data dashboard using one of the most popular libraries for this purpose called Streamlit.

My intention is that this will be the first part of a series of articles on developing a data dashboard using three of the most popular Python-based GUI libraries. In addition to this one, I also plan to release articles on Gradio and Taipy, so look out for those. As much as possible I’ll try to replicate the same layout and functionality in each dashboard. I’ll use the exact same data for all three too, albeit in different formats e.g. a CSV, database etc …
Please also note that I have no connection or affiliation with Streamlit/Snowflake, Postgres or any other company or tool mentioned in this post.

What is Streamlit?

Founded in 2018 by Adrien Treuille, Amanda Kelly, and Thiago Teixeira, Streamlit quickly gained popularity among data scientists and machine learning engineers when it introduced its open-source Python framework to simplify the creation of interactive data applications.

In March 2022, Snowflake, a Data Cloud company, acquired Streamlit and its capabilities were integrated into the Snowflake ecosystem to enhance data application development.

Streamlit’s open-source framework has been widely adopted, with over 8 million downloads and more than 1.5 million applications built using the platform. An active community of developers and contributors continues to play a significant role in its ongoing development and success.

What we’ll develop

We’re going to develop a data dashboard. Our source data for the dashboard will be in a single Postgres database table and contain 100,000 synthetic sales records.

To be honest, the actual source of the data isn’t that important. It could just as easily be a text or CSV file, SQLite, or any database you can connect to. I chose Postgres because I have a copy on my local PC, and it's convenient for me to use.

This is what our final dashboard will look like.

Image by Author

There are four main sections.

  • The top row allows the user to choose specific start and end dates and/or product categories via date pickers and a drop-down list, respectively.
  • The second row — Key metrics — shows a top-level summary of the chosen data.
  • The Visualisation section allows the user to select one of three graphs to display the input data set.
  • The raw data section is exactly what it says. This is a tabular representation of the chosen data, effectively viewing the underlying Postgres database table data.

Using the dashboard is easy. Initially, stats for the whole data set are displayed. The user can then narrow the data focus using the 3 choice fields at the top of the display. The graphs, key metrics and raw data sections dynamically change to reflect what the user has chosen.

The underlying data

As mentioned, the dashboard's source data is contained in a single Postgres database table. The data is a set of 100,000 synthetic sales-related data records. Here is the Postgres table creation script for reference.

CREATE TABLE IF NOT EXISTS public.sales_data
(
order_id integer NOT NULL,
order_date date,
customer_id integer,
customer_name character varying(255) COLLATE pg_catalog."default",
product_id integer,
product_names character varying(255) COLLATE pg_catalog."default",
categories character varying(100) COLLATE pg_catalog."default",
quantity integer,
price numeric(10,2),
total numeric(10,2)
)

And here is some Python code you can use to generate a data set for yourself. Make sure both numpy and polars libraries are installed first

# generate the 1m record CSV file
#
import polars as pl
import numpy as np
from datetime import datetime, timedelta

def generate(nrows: int, filename: str):
names = np.asarray(
[
"Laptop",
"Smartphone",
"Desk",
"Chair",
"Monitor",
"Printer",
"Paper",
"Pen",
"Notebook",
"Coffee Maker",
"Cabinet",
"Plastic Cups",
]
)

categories = np.asarray(
[
"Electronics",
"Electronics",
"Office",
"Office",
"Electronics",
"Electronics",
"Stationery",
"Stationery",
"Stationery",
"Electronics",
"Office",
"Sundry",
]
)

product_id = np.random.randint(len(names), size=nrows)
quantity = np.random.randint(1, 11, size=nrows)
price = np.random.randint(199, 10000, size=nrows) / 100

# Generate random dates between 2010-01-01 and 2023-12-31
start_date = datetime(2010, 1, 1)
end_date = datetime(2023, 12, 31)
date_range = (end_date - start_date).days

# Create random dates as np.array and convert to string format
order_dates = np.array([(start_date + timedelta(days=np.random.randint(0, date_range))).strftime('%Y-%m-%d') for _ in range(nrows)])

# Define columns
columns = {
"order_id": np.arange(nrows),
"order_date": order_dates,
"customer_id": np.random.randint(100, 1000, size=nrows),
"customer_name": [f"Customer_{i}" for i in np.random.randint(2**15, size=nrows)],
"product_id": product_id + 200,
"product_names": names[product_id],
"categories": categories[product_id],
"quantity": quantity,
"price": price,
"total": price * quantity,
}

# Create Polars DataFrame and write to CSV with explicit delimiter
df = pl.DataFrame(columns)
df.write_csv(filename, separator=',',include_header=True) # Ensure comma is used as the delimiter

# Generate 100,000 rows of data with random order_date and save to CSV
generate(100_000, "/mnt/d/sales_data/sales_data.csv")

Setting up our development environment

Before we get to the example code, let’s set up a separate development environment. That way, what we do won’t interfere with other versions of libraries, programming, etc… we might have on the go for other projects we’re working on.

I use Miniconda for this, but you can use whatever method suits you best.

If you want to go down the Miniconda route and don’t already have it, you must install Miniconda first. Get it using this link,

Miniconda - Anaconda documentation

Once the environment is created, switch to it using the activatecommand, and then pip install our required Python libraries.

#create our test environment
(base) C:\Users\thoma>conda create -n streamlit_test python=3.12 -y
# Now activate it
(base) C:\Users\thoma>conda activate streamlit_test
# Install python libraries, etc ...
(streamlit_test) C:\Users\thoma>pip install streamlit pandas matplotlib psycopg2

The Code

I’ll split the code up into sections and explain each one along the way.

#
# Streamlit equivalent of final Gradio app
#
import streamlit as st
import pandas as pd
import matplotlib.pyplot as plt
import datetime
import psycopg2
from psycopg2 import sql
from psycopg2 import pool

# Initialize connection pool
try:
connection_pool = psycopg2.pool.ThreadedConnectionPool(
minconn=5,
maxconn=20,
dbname="postgres",
user="postgres",
password="postgres",
host="localhost",
port="5432"
)
except psycopg2.Error as e:
st.error(f"Error creating connection pool: {e}")

def get_connection():
try:
return connection_pool.getconn()
except psycopg2.Error as e:
st.error(f"Error getting connection from pool: {e}")
return None

def release_connection(conn):
try:
connection_pool.putconn(conn)
except psycopg2.Error as e:
st.error(f"Error releasing connection back to pool: {e}")

We start by importing all the external libraries we’ll need. Next, we set up a ThreadedConnectionPool that allows multiple threads to share a pool of database connections. Two helper functions follow, one to get a database connection and the other to release it. This is overkill for a simple single-user app but essential for handling multiple simultaneous users or threads accessing the database in a web app environment.

def get_date_range():
conn = get_connection()
if conn is None:
return None, None
try:
with conn.cursor() as cur:
query = sql.SQL("SELECT MIN(order_date), MAX(order_date) FROM public.sales_data")
cur.execute(query)
return cur.fetchone()
finally:
release_connection(conn)

def get_unique_categories():
conn = get_connection()
if conn is None:
return []
try:
with conn.cursor() as cur:
query = sql.SQL("SELECT DISTINCT categories FROM public.sales_data ORDER BY categories")
cur.execute(query)
return [row[0].capitalize() for row in cur.fetchall()]
finally:
release_connection(conn)

def get_dashboard_stats(start_date, end_date, category):
conn = get_connection()
if conn is None:
return None
try:
with conn.cursor() as cur:
query = sql.SQL("""
WITH category_totals AS (
SELECT
categories,
SUM(price * quantity) as category_revenue
FROM public.sales_data
WHERE order_date BETWEEN %s AND %s
AND (%s = 'All Categories' OR categories = %s)
GROUP BY categories
),
top_category AS (
SELECT categories
FROM category_totals
ORDER BY category_revenue DESC
LIMIT 1
),
overall_stats AS (
SELECT
SUM(price * quantity) as total_revenue,
COUNT(DISTINCT order_id) as total_orders,
SUM(price * quantity) / COUNT(DISTINCT order_id) as avg_order_value
FROM public.sales_data
WHERE order_date BETWEEN %s AND %s
AND (%s = 'All Categories' OR categories = %s)
)
SELECT
total_revenue,
total_orders,
avg_order_value,
(SELECT categories FROM top_category) as top_category
FROM overall_stats
""")
cur.execute(query, [start_date, end_date, category, category,
start_date, end_date, category, category])
return cur.fetchone()
finally:
release_connection(conn)

The get_date_range function executes the SQL query to find the range of dates (MIN and MAX) in the order_date column and returns the two dates as a tuple: (start_date, end_date).

The get_unique_categories function runs an SQL query to fetch unique values from the categories column. It capitalizes the category names (first letter uppercase) before returning them as a list.

The get_dashboard_stats function executes a SQL query with the following parts:

  • category_totals: Calculates total revenue for each category in the given date range.
  • top_category: Finds the category with the highest revenue.
  • overall_stats: Computes overall statistics:
  • Total revenue (SUM(price * quantity)).
  • Total number of unique orders (COUNT(DISTINCT order_id)).
  • Average order value (total revenue divided by total orders).

It returns a single row containing:

  • total_revenue: Total revenue in the specified period.
  • total_orders: Number of distinct orders.
  • avg_order_value: Average revenue per order.
  • top_category: The category with the highest revenue.
def get_plot_data(start_date, end_date, category):
conn = get_connection()
if conn is None:
return pd.DataFrame()
try:
with conn.cursor() as cur:
query = sql.SQL("""
SELECT DATE(order_date) as date,
SUM(price * quantity) as revenue
FROM public.sales_data
WHERE order_date BETWEEN %s AND %s
AND (%s = 'All Categories' OR categories = %s)
GROUP BY DATE(order_date)
ORDER BY date
""")
cur.execute(query, [start_date, end_date, category, category])
return pd.DataFrame(cur.fetchall(), columns=['date', 'revenue'])
finally:
release_connection(conn)

def get_revenue_by_category(start_date, end_date, category):
conn = get_connection()
if conn is None:
return pd.DataFrame()
try:
with conn.cursor() as cur:
query = sql.SQL("""
SELECT categories,
SUM(price * quantity) as revenue
FROM public.sales_data
WHERE order_date BETWEEN %s AND %s
AND (%s = 'All Categories' OR categories = %s)
GROUP BY categories
ORDER BY revenue DESC
""")
cur.execute(query, [start_date, end_date, category, category])
return pd.DataFrame(cur.fetchall(), columns=['categories', 'revenue'])
finally:
release_connection(conn)

def get_top_products(start_date, end_date, category):
conn = get_connection()
if conn is None:
return pd.DataFrame()
try:
with conn.cursor() as cur:
query = sql.SQL("""
SELECT product_names,
SUM(price * quantity) as revenue
FROM public.sales_data
WHERE order_date BETWEEN %s AND %s
AND (%s = 'All Categories' OR categories = %s)
GROUP BY product_names
ORDER BY revenue DESC
LIMIT 10
""")
cur.execute(query, [start_date, end_date, category, category])
return pd.DataFrame(cur.fetchall(), columns=['product_names', 'revenue'])
finally:
release_connection(conn)

def get_raw_data(start_date, end_date, category):
conn = get_connection()
if conn is None:
return pd.DataFrame()
try:
with conn.cursor() as cur:
query = sql.SQL("""
SELECT
order_id, order_date, customer_id, customer_name,
product_id, product_names, categories, quantity, price,
(price * quantity) as revenue
FROM public.sales_data
WHERE order_date BETWEEN %s AND %s
AND (%s = 'All Categories' OR categories = %s)
ORDER BY order_date, order_id
""")
cur.execute(query, [start_date, end_date, category, category])
return pd.DataFrame(cur.fetchall(), columns=[desc[0] for desc in cur.description])
finally:
release_connection(conn)

def plot_data(data, x_col, y_col, title, xlabel, ylabel, orientation='v'):
fig, ax = plt.subplots(figsize=(10, 6))
if not data.empty:
if orientation == 'v':
ax.bar(data[x_col], data[y_col])
else:
ax.barh(data[x_col], data[y_col])
ax.set_title(title)
ax.set_xlabel(xlabel)
ax.set_ylabel(ylabel)
plt.xticks(rotation=45)
else:
ax.text(0.5, 0.5, "No data available", ha='center', va='center')
return fig

The get_plot_data function fetches daily revenue within the given date range and category. It retrieves data grouped by the day (DATE(order_date)) and calculates daily revenue (SUM(price * quantity)), then returns a Pandas DataFrame with columns: date (the day) and revenue (total revenue for that day).

The get_revenue_by_category function fetches revenue totals grouped by category within the specified date range. It groups data by categories and calculates revenue for each category (SUM(price * quantity)), orders the results by revenue in descending order and returns a Pandas DataFrame with columns: categories (category name) and revenue (total revenue for the category).

The get_top_products function retrieves the top 10 products by revenue within the given date range and category. It groups data by product_names and calculates revenue for each product (SUM(price * quantity)), orders the products by revenue in descending order and limits results to the top 10 before returning a Pandas DataFrame with columns: product_names (product name) and revenue (total revenue for the product).

The get_raw_data function fetches raw transaction data within the specified date range and category.

The plot_data function takes in some data (in a pandas DataFrame) and the names of the columns you want to plot on the x- and y-axes. It then creates a bar chart — either vertical or horizontal, depending on the chosen orientation — labels the axes, adds a title, and returns the finished chart (a Matplotlib Figure). If the data is empty, it just displays a “No data available” message instead of trying to plot anything.

# Streamlit App
st.title("Sales Performance Dashboard")

# Filters
with st.container():
col1, col2, col3 = st.columns([1, 1, 2])
min_date, max_date = get_date_range()
start_date = col1.date_input("Start Date", min_date)
end_date = col2.date_input("End Date", max_date)
categories = get_unique_categories()
category = col3.selectbox("Category", ["All Categories"] + categories)

# Custom CSS for metrics
st.markdown("""
<style>
.metric-row {
display: flex;
justify-content: space-between;
margin-bottom: 20px;
}
.metric-container {
flex: 1;
padding: 10px;
text-align: center;
background-color: #f0f2f6;
border-radius: 5px;
margin: 0 5px;
}
.metric-label {
font-size: 14px;
color: #555;
margin-bottom: 5px;
}
.metric-value {
font-size: 18px;
font-weight: bold;
color: #0e1117;
}
</style>
""", unsafe_allow_html=True)

# Metrics
st.header("Key Metrics")
stats = get_dashboard_stats(start_date, end_date, category)
if stats:
total_revenue, total_orders, avg_order_value, top_category = stats
else:
total_revenue, total_orders, avg_order_value, top_category = 0, 0, 0, "N/A"

# Custom metrics display
metrics_html = f"""
<div class="metric-row">
<div class="metric-container">
<div class="metric-label">Total Revenue</div>
<div class="metric-value">${total_revenue:,.2f}</div>
</div>
<div class="metric-container">
<div class="metric-label">Total Orders</div>
<div class="metric-value">{total_orders:,}</div>
</div>
<div class="metric-container">
<div class="metric-label">Average Order Value</div>
<div class="metric-value">${avg_order_value:,.2f}</div>
</div>
<div class="metric-container">
<div class="metric-label">Top Category</div>
<div class="metric-value">{top_category}</div>
</div>
</div>
"""
st.markdown(metrics_html, unsafe_allow_html=True)

This code section creates the main structure for displaying the key metrics in the Streamlit dashboard. It:

  1. Sets up the page title: “Sales Performance Dashboard.”
  2. Presents filters for start/end dates and category selection.
  3. Retrieves metrics (such as total revenue, total orders, etc.) for the chosen filters from the database.
  4. Applies custom CSS to style these metrics in a row of boxes with labels and values.
  5. Displays the metrics within an HTML block, ensuring each metric gets its own styled container.
# Visualization Tabs
st.header("Visualizations")
tabs = st.tabs(["Revenue Over Time", "Revenue by Category", "Top Products"])

# Revenue Over Time Tab
with tabs[0]:
st.subheader("Revenue Over Time")
revenue_data = get_plot_data(start_date, end_date, category)
st.pyplot(plot_data(revenue_data, 'date', 'revenue', "Revenue Over Time", "Date", "Revenue"))

# Revenue by Category Tab
with tabs[1]:
st.subheader("Revenue by Category")
category_data = get_revenue_by_category(start_date, end_date, category)
st.pyplot(plot_data(category_data, 'categories', 'revenue', "Revenue by Category", "Category", "Revenue"))

# Top Products Tab
with tabs[2]:
st.subheader("Top Products")
top_products_data = get_top_products(start_date, end_date, category)
st.pyplot(plot_data(top_products_data, 'product_names', 'revenue', "Top Products", "Revenue", "Product Name", orientation='h'))

This section adds a header titled “Visualizations” to this part of the dashboard. It creates three tabs, each of which displays a different graphical representation of the data:

Tab 1: Revenue Over Time

  • Fetches revenue data grouped by date for the given filters using get_plot_data().
  • Calls plot_data() to generate a bar chart of revenue over time, with dates on the x-axis and revenue on the y-axis.
  • Displays the chart in the first tab.

Tab 2: Revenue by Category

  • Fetches revenue grouped by category using get_revenue_by_category().
  • Calls plot_data() to create a bar chart of revenue by category.
  • Displays the chart in the second tab.

Tab 3: Top Products

  • Fetches top 10 products by revenue for the given filters using get_top_products().
  • Calls plot_data() to create a horizontal bar chart (indicated by orientation='h').
  • Displays the chart in the third tab.

st.header("Raw Data")

raw_data = get_raw_data(
start_date=start_date,
end_date=end_date,
category=category
)

# Remove the index by resetting it and dropping the old index
raw_data = raw_data.reset_index(drop=True)

st.dataframe(raw_data,hide_index=True)

# Add spacing
st.write("")

The final section displays the raw data in a dataframe. The user is able to scroll up and down as required to see all records available.

An empty st.write("") is added at the end to provide spacing for better visual alignment.

Running the App

Let’s say you save your code into a file called app.py. You can run it using this from the command line,

(streamlit_test) C:\Users\thoma> python -m streamlit run app.py

If everything works as expected, you will see this after you run the above command.


You can now view your Streamlit app in your browser.

Local URL: http://localhost:8501
Network URL: http://192.168.0.59:8501

Click on the Local URLs shown, and a browser screen should appear with the Streamlit app running.

Summary

In this article, I’ve attempted to provide a comprehensive guide to building an interactive sales performance dashboard using Streamlit with a Postgres database table as its source data.

Streamlit is a modern, Python-based open-source framework that simplifies the creation of data-driven dashboards and applications. The dashboard I developed allows users to filter data by date ranges and product categories, view key metrics such as total revenue and top-performing categories, explore visualizations like revenue trends and top products, and navigate through raw data with pagination.

This guide includes a complete implementation, from setting up a Postgres database with sample data to creating Python functions for querying data, generating plots, and handling user input. This step-by-step approach demonstrates how to leverage Streamlit’s capabilities to create user-friendly and dynamic dashboards, making it ideal for data engineers and scientists who want to build interactive data applications.

Although I used Postgres for my data, it should be straightforward to modify the code to use a CSV file or any other relational database management system (RDBMS), such as SQLite, as your data source.

That’s all from me for now. I hope you found this article useful. If you did, please check out my profile page at this link. From there, you can see my other published stories and subscribe to get notified when I post new content.

If you liked this content, Medium thinks you’ll find these articles interesting, too.


Building a Data Dashboard was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.



from Datascience in Towards Data Science on Medium https://ift.tt/YDFcn5B
via IFTTT

Why Generative-AI Apps’ Quality Often Sucks and What to Do About It

How to get from PoCs to tested high-quality applications in production

Image licensed from elements.envato.com, edit by Marcel Müller, 2025

The generative AI hype has rolled through the business world in the past two years. This technology can make business process executions more efficient, reduce wait time, and reduce process defects. Some interfaces like ChatGPT make interacting with an LLM easy and accessible. Anyone with experience using a chat application can effortlessly type a query, and ChatGPT will always generate a response. Yet the quality and suitability for the intended use of your generated content may vary. This is especially true for enterprises that want to use generative AI technology in their business operations.

I have spoken to countless managers and entrepreneurs who failed in their endeavors because they could not get high-quality generative AI applications to production and get reusable results from a non-deterministic model. On the other hand, I have also built more than three dozen AI applications and have realized one common misconception when people think about quality for generative AI applications: They think it is all about how powerful your underlying model is. But this is only 30% of the full story.

But there are dozens of techniques, patterns, and architectures that help create impactful LLM-based applications of the quality that businesses desire. Different foundation models, fine-tuned models, architectures with retrieval augmented generation (RAG) and advanced processing pipelines are just the tip of the iceberg.

This article shows how we can qualitatively and quantitatively evaluate generative AI applications in the context of concrete business processes. We will not stop at generic benchmarks but introduce approaches to evaluating applications with generative AI. After a quick analysis of generative AI applications and their business processes, we will look into the following questions:

  • In what context do we need to evaluate generative AI applications to assess their end-to-end quality and utility in enterprise applications?
  • When in the development life cycle of applications with generative AI, do we use different approaches for evaluation, and what are the objectives?
  • How do we use different metrics in isolation and production to select, monitor and improve the quality of generative AI applications?

This overview will give us an end-to-end evaluation framework for generative AI applications in enterprise scenarios that I call the PEEL (performance evaluation for enterprise LLM applications). Based on the conceptual framework created in this article, we will introduce an implementation concept as an addition to the entAIngine Test Bed module as part of the entAIngine platform.

1. Background: Business Processes and Generative AI

An organization lives by its business processes. Everything in a company can be a business process, such as customer support, software development, and operations processes. Generative AI can improve our business processes by making them faster and more efficient, reducing wait time and improving the outcome quality of our processes. Yet, we can further divide each process activity that uses generative AI even more.

Processes for generative AI applications. © 2025, Marcel Müller

The illustration shows the start of a simple business that a telecommunications company's customer support agent must go through. Every time a new customer support request comes in, the customer support agent has to give it a priority-level. When the work items on their list come to the point that the request has priority, the customer support agents must find the correct answer and write an answer email. Afterward, they need to send the email to the customers and wait for a reply, and they iterate until the request is solved.

We can use a generative AI workflow to make the “find and write answer” activity more efficient. Yet, this activity is often not a single call to ChatGPT or another LLM but a collection of different tasks. In our example, the telco company has built a pipeline using the entAIngine process platform that consists of the following steps.

  • Extract the question and generate a query to the vector database. The example company has a vector database as knowledge for retrieval augmented generation (RAG). We need to extract the essence of the customer’s question from their request email to have the best query and find the sections in the knowledge base that are semantically as close as possible to the question.
  • Find context in the knowledge base. The semantic search activity is the next step in our process. Retrieval-reranking structures are often used to get the top k context chunks relevant to the query and sort them with an LLM. This step aims to retrieve the correct context information to generate the best answer possible.
  • Use context to generate an answer. This step orchestrates a large language model using a prompt and the selected context as input to the prompt.
  • Write an answer email. The final step transforms the pre-formulated answer into a formal email with the correct intro and ending to the message in the company’s desired tone and complexity.

The execution of processes like this is called the orchestration of an advanced LLM workflow. There are dozens of other orchestration architectures in enterprise contexts. Using a chat interface that uses the current prompt and the chat history is also a simple type of orchestration. Yet, for reproducible enterprise workflows with sensitive company data, using a simple chat orchestration is not enough in many cases, and advanced workflows like those shown above are needed.

Thus, when we evaluate complex processes for generative AI orchestrations in enterprise scenarios, looking purely at the capabilities of a foundational (or fine-tuned) model is, in many cases, just the start. The following section will dive deeper into what context and orchestration we need to evaluate generative AI applications.

2. Concept

The following sections introduce the core concepts for our approach.

My team has built the entAIngine platform that is, in that sense, quite unique in that it enables low-code generation of applications with generative AI tasks that are not necessarily a chatbot application. We have also implemented the following approach on entAIngine. If you want to try it out, message me. Or, if you want to build your own testbed functionality, feel free to get inspiration from the concept below.

2.1. Context and Orchestration of Performance Evaluation for Generative AI Applications

When evaluating the performance of generative AI applications in their orchestrations, we have the following choices: We can evaluate a foundational model in isolation, a fine-tuned model or either of those options as part of a larger orchestration, including several calls to different models and RAG. This has the following implications.

Context and orchestration for LLM-based applications. © Marcel Müller, 2025

Publicly available generative AI models like (for LLMs) GPT-4o, Llama 3.2 and many others were trained on the “public wisdom of the internet.” Their training sets included a large corpus of knowledge from books, world literature, Wikipedia articles, and other Internet crawls from forums and block posts. There is no company internal knowledge encoded in foundational models. Thus, when we evaluate the capabilities of a foundational model in evaluation, we can only evaluate the general capabilities of how queries are answered. However, the extensiveness of company-specific knowledge bases that show “how much the model knows” cannot be judged. There is only company-specific knowledge in foundational models with advanced orchestration that inserts company-specific context.

For example, with a free account from ChatGPT, anyone can ask, “How did Goethe die?” The model will provide an answer because the key information about Goethe’s life and death is in the model’s knowledge base. Yet, the question “How much revenue did our company make last year in Q3 in EMEA?” will most likely lead to a heavily hallucinated answer which will seem plausible to inexperienced users. However, we can still evaluate the form and representation of the answers, including style and tone, as well as language capabilities and skills concerning reasoning and logical deduction. Synthetic benchmarks such as ARC, HellaSwag, and MMLU provide comparative metrics for those dimensions. We will take a deeper look into those benchmarks in a later section.

Fine-tuned models build on foundational models. They use additional data sets to add foundational knowledge into a model that has not been there before by further training of the underlying machine learning model. Fine-tuned models have more context-specific knowledge. Suppose we orchestrate them in isolation without any other ingested data. In that case, we can evaluate the knowledge base concerning its suitability for real-world scenarios in a given business process. Fine-tuning is often used to focus on adding domain-specific vocabulary and sentence structures to a foundational model.

Suppose, we train a model on a corpus of legal court rulings. In that case, a fine-tuned model will start using the vocabulary and reproducing the sentence structure that is common in the legal domain. The model can combine some excerpts from old cases but fails to quote the right sources.

Orchestrating foundational models or fine-tuned models with retrieval-ation (RAG) produces highly context-dependent results. However, this also requires a more complex orchestration pipeline.

For example, a telco company, like in our example above, can use a language model to create embeddings of their customer support knowledge base and store them in a vector store. We can now efficiently query this knowledge base in a vector store with semantic search. By keeping track of the text segments that are retrieved, we can very precisely show the source of the retrieved text chunk and use it as context in a call to a large language model. This lets us answer our question end-to-end.

We can evaluate how well our application serves its intended purpose end-to-end for such large orchestrations with different data processing pipeline steps.

Evaluating those different types of setups gives us different insights that we can use in the development process of generative AI applications. We will look deeper into this aspect in the next section.

2.2 Evaluation of Generative AI Applications in the Development Lifecycle

We develop generative AI applications in different stages: 1) before building, 2) during build and testing, and 3) in production. With an agile approach, these stages are not executed in a linear sequence but iteratively. Yet, the goals and methods of evaluation in the different stages remain the same regardless of their order.

Before building, we need to evaluate which foundational model to choose or whether to create a new one from scratch. Therefore, we must first define our expectations and requirements, especially w.r.t. execution time, efficiency, price and quality. Currently, only very few companies decide to build their own foundational models from scratch due to cost and updating efforts. Fine-tuning and retrieval augmented generation are the standard tools to build highly personalized pipelines with traceable internal knowledge that leads to reproducible outputs. In this stage, synthetic benchmarks are the go-to approaches to achieve comparability. For example, if we want to build an application that helps lawyers prepare their cases, we need a model that is good at logical argumentation and understanding of a specific language.

During building, our evaluation needs to focus on satisfying the quality and performance requirements of the application’s example cases. In the case of building an application for lawyers, we need to make a representative selection of limited old cases. Those cases are the basis for defining standard scenarios of the application based on which we implement the application. For example, if the lawyer specializes in financial law and taxation, we would select a few of the standard cases for which this lawyer has to create scenarios. Every building and evaluation activity that we do in this phase has a limited view of representative scenarios and does not cover every instance. Yet, we need to evaluate the scenarios in the ongoing steps of application development.

In production, our evaluation approach focuses on quantitatively evaluating the real-world usage of our application with the expectations of live users. In production, we will find scenarios that are not covered in our building scenarios. The goal of the evaluation in this phase is to discover those scenarios and gather feedback from live users to improve the application further.

The production phase should always feed back into the development phase to improve the application iteratively. Hence, the three phases are not in a linear sequence, but interleaving.

2.3. Benchmark Metrics for Evaluation

With the “what” and “when” of the evaluation covered, we have to ask “how” we are going to evaluate our generative AI applications. Therefore, we have three different methods: Synthetic benchmarks, limited scenarios and feedback loop evaluation in production.

For synthetic benchmarks, we will look into the most commonly used approaches and compare them.

The AI2 Reasoning Challenge (ARC) tests an LLM’s knowledge and reasoning using a dataset of 7787 multiple-choice science questions. These questions range from 3rd to 9th grade and are divided into Easy and Challenge sets. ARC is useful for evaluating diverse knowledge types and pushing models to integrate information from multiple sentences. Its main benefit is comprehensive reasoning assessment, but it’s limited to scientific questions.

HellaSwag tests commonsense reasoning and natural language inference through sentence completion exercises based on real-world scenarios. Each exercise includes a video caption context and four possible endings. This benchmark measures an LLM’s understanding of everyday scenarios. Its main benefit is the complexity added by adversarial filtering, but it primarily focuses on general knowledge, limiting specialized domain testing.

The MMLU (Massive Multitask Language Understanding) benchmark measures an LLM’s natural language understanding across 57 tasks covering various subjects, from STEM to humanities. It includes 15,908 questions from elementary to advanced levels. MMLU is ideal for comprehensive knowledge assessment. Its broad coverage helps identify deficiencies, but limited construction details and errors may affect reliability.

TruthfulQA evaluates an LLM’s ability to generate truthful answers, addressing hallucinations in language models. It measures how accurately an LLM can respond, especially when training data is insufficient or low quality. This benchmark is useful for assessing accuracy and truthfulness, with the main benefit of focusing on factually correct answers. However, its general knowledge dataset may not reflect truthfulness in specialized domains.

The RAGAS framework is designed to evaluate Retrieval Augmented Generation (RAG) pipelines. It is a framework especially useful for a category of LLM applications that utilize external data to enhance the LLM’s context. The frameworks introduces metrics for faithfulness, answer relevancy, context recall, context precision, context relevancy, context entity recall and summarization score that can be used to assess in a differentiated view the quality of the retrieved outputs.

WinoGrande tests an LLM’s commonsense reasoning through pronoun resolution problems based on the Winograd Schema Challenge. It presents near-identical sentences with different answers based on a trigger word. This benchmark is beneficial for resolving ambiguities in pronoun references, featuring a large dataset and reduced bias. However, annotation artifacts remain a limitation.

The GSM8K benchmark measures an LLM’s multi-step mathematical reasoning using around 8,500 grade-school-level math problems. Each problem requires multiple steps involving basic arithmetic operations. This benchmark highlights weaknesses in mathematical reasoning, featuring diverse problem framing. However, the simplicity of problems may limit their long-term relevance.

SuperGLUE enhances the GLUE benchmark by testing an LLM’s NLU capabilities across eight diverse subtasks, including Boolean Questions and the Winograd Schema Challenge. It provides a thorough assessment of linguistic and commonsense knowledge. SuperGLUE is ideal for broad NLU evaluation, with comprehensive tasks offering detailed insights. However, fewer models are tested compared to benchmarks similar to MMLU.

HumanEval measures an LLM’s ability to generate functionally correct code through coding challenges and unit tests. It includes 164 coding problems with several unit tests per problem. This benchmark assesses coding and problem-solving capabilities, focusing on functional correctness similar to human evaluation. However, it only covers some practical coding tasks, limiting its comprehensiveness.

MT-Bench evaluates an LLM’s capability in multi-turn dialogues by simulating real-life conversational scenarios. It measures how effectively chatbots engage in conversations, following a natural dialogue flow. With a carefully curated dataset, MT-Bench is useful for assessing conversational abilities. However, its small dataset and the challenge of simulating real conversations still need to be improved.

All those metrics are synthetic and aim to provide a relative comparison between different LLMs. However, their concrete impact for a use case in a company depends on the classification of the challenge in the scenario to the benchmark. For example, in use cases for tax accounts where a lot of math is needed, GSM8K would be a good candidate to evaluate that capability. HumanEval is the initial tool of choice for the use of an LLM in a coding-related scenario.

2.4. Real-life Scenario-based Evaluation

However, the impact of those benchmarks is rather abstract and only gives an indication of their performance in an enterprise use case. This is where working with real-life scenarios is needed.

Real-life scenarios consist of the following components:

  • case-specific context data (input),
  • case-independent context data,
  • a sequence of tasks to complete and
  • the expected output.

With real-life test scenarios, we can model different situations, like

  • multi-step chat interactions with several questions and answers,
  • complex automation tasks with multiple AI interactions,
  • processes that involve RAG and
  • multi-modal process interactions.

In other words, it does not help anyone to have the best model in the world if the RAG pipeline always returns mediocre results because your chunking strategy is not good. Also, if you do not have the right data to answer your queries, you will always get some hallucinations that may or may not be close to the truth. In the same way, your results will vary based on the hyperparameters of your chosen models (temperature, frequency penalty, etc.). And we cannot use the most powerful model for every use case, if this is an expensive model.

Standard benchmarks focus on the individual models rather than on the big picture. That is why we introduce the PEEL framework for performance evaluation of enterprise LLM applications, which gives us an end-to-end view.

The core concept of PEEL is the evaluation scenario. We distinguish between an evaluation scenario definition and an evaluation scenario execution. The conceptual illustration shows the overall concepts in black, an example definition in blue and the outcome of one instance of an execution in green.

The concept of evaluation scenarios as introduced by the PEEL framework © Marcel Müller

An evaluation scenario definition consists of input definitions, an orchestration definition and an expected output definition.

For the input, we distinguish between case-specific and case-independent context data. Case-specific context data changes from case to case. For example, in the customer support use case, the question that a customer asks is different from customer case to customer case. In our example evaluation execution, we depicted one case where the email inquiry reads as follows:

“Dear customer support,

my name is […]. How do I reset my router when I move to a different apartment?

Kind regards, […] “

Yet, the knowledge base where the answers to the question are located in large documents is case-independent. In our example, we have a knowledge base with the pdf manuals for the routers AR83, AR93, AR94 and BD77 stored in a vector store.

An evaluation scenario definition has an orchestration. An orchestration consists of a series of n >= 1 steps that get in the evaluation scenario execution executed in sequence. Each step has inputs that it takes from any of the previous steps or from the input to the scenario execution. Steps can be interactions with LLMs (or other models), context retrieval tasks (for example, from a vector db) or other calls to data sources. For each step, we distinguish between the prompt / request and the execution parameters. The execution parameters include the model or method that needs to be executed and hyperparameters. The prompt / request is a collection of different static or dynamic data pieces that get concatenated (see illustration).

In our example, we have a three-step orchestration. In step 1, we extract a single question from the case-specific input context (the customer’s email inquiry). We use this question in step 2 to create a semantic search query in our vector database using the cosine similarity metric. The last step takes the search results and formulates an email using an LLM.

In an evaluation scenario definition, we have an expected output and an evaluation method. Here, we define for every scenario how we want to evaluate the actual outcome vs. the expected outcome. We have the following options:

  • Exact match/regex match: We check for the occurrence of a specific series of terms/concepts and give as an answer a boolean where 0 means that the defined terms did not appear in the output of the execution and 1 means they did appear. For example, the core concept of installing a router at a new location is pressing the reset button for 3 seconds. If the terms “reset button” and “3 seconds” are not part of the answer, we would evaluate it as a failure.
  • Semantic match: We check if the text is semantically close to what our expected answer is. Therefore, we use an LLM and task it to judge with a rational number between 0 and 1 how well the answer matches the expected answer.
  • Manual match: Humans evaluate the output on a scale between 0 and 1.

An evaluation scenario should be executed many times because LLMs are non-deterministic models. We want to have a reasonable number of executions so we can aggregate the scores and have a statistically significant output.

The benefit of using such scenarios is that we can use them while building and debugging our orchestrations. When we see that we have in 80 out of 100 executions of the same prompt a score of less than 0,3, we use this input to tweak or prompts or to add other data to our fine-tuning before orchestration.

2.5. Feedback Collection and Adjustment in Production

The principle for collecting feedback in production is analogous to the scenario approach. We map each user interaction to a scenario. If the user has larger degrees of freedom of interaction, we might need to create new scenarios that we did not anticipate during the building phase.

The user gets a slider between 0 and 1, where they can indicate how satisfied they were with the output of a result. From a user experience perspective, this number can also be simplified into different media, for example, a laughing, neutral and sad smiley. Thus, this evaluation is the manual match method that we introduced before.

In production, we have to create the same aggregations and metrics as before, just with live users and a potentially larger amount of data.

3. Example Implementation as Part of entAIngine Test Bed

Together with the entAIngine team, we have implemented the functionality on the platform. This section is to show you how things could be done and to give you inspiration. Or if you want to use what we have implemented , feel free to.

We map our concepts for evaluation scenarios and evaluation scenario definitions and map them to classic concepts of software testing. The start point for any interaction to create a new test is via the entAIngine application dashboard.

entAIngine dashboard © Marcel Müller

In entAIngine, users can create many different applications. Each of the applications is a set of processes that define workflows in a no-code interface. Processes consist of input templates (variables), RAG components, calls to LLMs, TTS, Image and Audio modules, integration to documents and OCR. With these components, we build reusable processes that can be integrated via an API, used as chat flows, used in a text editor as a dynamic text-generating block, or in a knowledge management search interface that shows the sources of answers. This functionality is, at the moment, already completely implemented in the entAIngine platform and can be used as SaaS or is 100% deployed on-premise. It integrates to existing gateways, data sources and models via API. We will use the process template generator to evaluation scenario definitions.

When the user wants to create a new test, they go to “test bed” and “tests”.

On the tests screen, the user can create new evaluation scenarios or edit existing ones. When creating a new evaluation scenario, the orchestration (an entAIngine process template) and a set of metrics must be defined. We assume we have a customer support scenario where we need to retrieve data with RAG to answer a question in the first step and then formulate an answer email in the second step. Then, we use the new module to name the test, define / select a process template and pick and evaluator that will create a score for every individual test case.

Test definition © Marcel Müller, 2025
Test case (process template) definition © Marcel Müller, 2025

The Metrics are as defined above: Regex match, semantic match and manual match. The screen with the process definition is already existing and functional, together with the orchestration. The functionality to define tests in bull as seen below is new.

Test and test cases © Marcel Müller, 2025

In the test editor, we work on an evaluation scenario definition (“evaluate how good our customer support answering RAG is”) and we define in this scenario different test cases. A test case assigns data values to the variables in the test. We can try 50 or 100 different instances of test cases and evaluate and aggregate them. For example, if we evaluate our customer support answering, we can define 100 different customer support requests, define our expected outcome and then execute them and analyze how good the answers were. Once we designed a set of test cases, we can execute their scenarios with the right variables using the existing orchestration engine and evaluate them.

Metrics and evaluation © Marcel Müller, 2025

This testing is happening during the building phase. We have an additional screen that we use to evaluate real user feedback in the productive phase. The contents are collected from real user feedback (through our engine and API).

The metrics that we have available in the live feedback section are collected from a user through a star rating.

Conclusion: Testing and Quality

In this article, we have looked into advanced testing and quality engineering concepts for generative AI applications, especially those that are more complex than simple chat bots. The introduced PEEL framework is a new approach for scenario-based test that is closer to the implementation level than the generic benchmarks with which we test models. For good applications, it is important to not only test the model in isolation, but in orchestration.

Get in touch with me

I am working in my day-real-world applications with generative AI, especially in the enterprise. If you want to connect, feel free to add me or send a message on LinkedIn.


Why Generative-AI Apps’ Quality Often Sucks and What to Do About It was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.



from AI in Towards Data Science on Medium https://ift.tt/VAfiJb3
via IFTTT

Designing, Building & Deploying an AI Chat App from Scratch (Part 2)

Cloud Deployment and Scaling

Photo by Alex wong on Unsplash

1. Introduction

In the previous post, we built an AI-powered chat application on our local computer using microservices. Our stack included FastAPI, Docker, Postgres, Nginx and llama.cpp. The goal of this post is to learn more about the fundamentals of cloud deployment and scaling by deploying our app to Azure, making it available to real users. We’ll use Azure because they offer a free education account, but the process is similar for other platforms like AWS and GCP.

You can check a live demo of the app at chat.jorisbaan.nl. Now, obviously, this demo isn’t very large-scale, because the costs ramp up very quickly. With the tight scaling limits I configured I reckon it can handle about 10–40 concurrent users until I run out of Azure credits. However, I do hope it demonstrates the principles behind a scalable production system. We could easily configure it to scale to many more users with a higher budget.

I give a complete breakdown of our infrastructure and the costs at the end. The codebase is at https://github.com/jsbaan/ai-app-from-scratch.

A quick demo of the app at chat.jorisbaan.nl. We start a new chat, come back to that same chat, and start another chat.

1.1. Recap: local application

Let’s recap how we built our local app: A user can start or continue a chat with a language model by sending an HTTP request to http://localhost. An Nginx reverse proxy receives and forwards the request to a UI over a private Docker network. The UI stores a session cookie to identify the user, and sends requests to the backend: the language model API that generates text, and the database API that queries the database server.

Local architecture of the app. See part 1 for more details. Made by author in draw.io.

Table of contents

  1. Introduction
    1.1 Recap: local application
  2. Cloud architecture
    2.1 Scaling
    2.2 Kubernetes Concepts
    2.3 Azure Container Apps
    2.4 Azure architecture: putting it all together
  3. Deployment
    3.1 Setting up
    3.2 PostgreSQL server deployment
    3.3 Azure Container App Environment deployment
    3.4 Azure Container Apps deployment
    3.5 Scaling our Container Apps
    3.6 Custom domain name & HTTPS
  4. Resources & costs overview
  5. Roadmap
  6. Final thoughts
    Acknowledgements
    AI usage

2. Cloud architecture

Conceptually, our cloud architecture will not be too different from our local application: a bunch of containers in a private network with a gateway to the outside world, our users.

However, instead of running containers on our local computer with Docker Compose, we will deploy them to a computing environment that automatically scales across virtual or psychical machines to many concurrent users.

2.1 Scaling

Scaling is a central concept in cloud architectures. It means being able to dynamically handle varying numbers of users (i.e., HTTP requests). Uvicorn, the web server running our UI and database API, can already handle about 40 concurrent requests. It’s even possible to use another web server called Gunicorn as a process manager that employs multiple Uvicorn workers in the same container, further increasing concurrency.

Now, if we want to support even more concurrent request, we could give each container more resources, like CPUs or memory (vertical scaling). However, a more reliable approach is to dynamically create copies (replicas) of a container based on the number of incoming HTTP requests or memory/CPU usage, and distribute the incoming traffic across replicas (horizontal scaling). Each replica container will be assigned an IP address, so we also need to think about networking: how to centrally receive all requests and distribute them over the container replicas.

This “prism” pattern is important: requests arrive centrally in some server (a load balancer) and fan out for parallel processing to multiple other servers (e.g., several identical UI containers).

Photo of two prisms by Fernando @cferdophotography on Unsplash

2.2 Kubernetes Concepts

Kubernetes is the industry standard system for automating deployment, scaling and management of containerized applications. Its core concepts are crucial to understand modern cloud architectures, including ours, so let’s quickly review the basics.

  • Node: A physical or virtual machine to run containerized app or manage the cluster.
  • Cluster: A set of Nodes managed by Kubernetes.
  • Pod: The smallest deployable unit in Kubernetes. Runs one main app container with optional secondary containers that share storage and networking.
  • Deployment: An abstraction that manages the desired state of a set of Pod replicas by deploying, scaling and updating them.
  • Service: An abstraction that manages a stable entrypoint (the service’s DNS name) to expose a set of Pods by distributing incoming traffic over the various dynamic Pod IP addresses. A Service has multiple types:
    - A ClusterIP Service exposes Pods within the Cluster
    - A LoadBalancer Service exposes Pods to outside the Cluster. It triggers the cloud provider to provision an external public IP and load balancer outside the cluster that can be used to reach the cluster. These external requests are then routed via the Service to individual Pods.
  • Ingress: An abstraction that defines more complex rules for a cluster’s entrypoint. It can route traffic to multiple Services; give Services externally-reachable URLs; load balance traffic; and handle secure HTTPS.
  • Ingress Controller: Implements the Ingress rules. For example, an Nginx-based controller runs an Nginx server (like in our local app) under the hood that is dynamically configured to route traffic according to Ingress rules. To expose the Ingress Controller itself to the outside world, you can use a LoadBalancer Service. This architecture is often used.

2.3 Azure Container Apps

Armed with these concepts, instead of deploying our app with Kubernetes directly, I wanted to experiment a little by using Azure Container Apps (ACA). This is a serverless platform built on top of Kubernetes that abstracts away some of its complexity.

With a single command, we can create a Container App Environment, which, under the hood, is an invisible Kubernetes Cluster managed by Azure. Within this Environment, we can run a container as a Container App that Azure internally manages as Kubernetes Deployments, Services, and Pods. See article 1 and article 2 for detailed comparisons.

A Container App Environment also auto-creates:

  1. An invisible Envoy Ingress Controller that routes requests to internal Apps and handles HTTPS and App auto-scaling based on request volume.
  2. An external Public IP address and Azure Load Balancer that routes external traffic to the Ingress Controller that in turn routes it to Apps (sounds similar to a Kubernetes LoadBalancer Service, eh?).
  3. An Azure-generated URL for each Container App that is publicly accessible over the internet or internal, based on its Ingress config.

This gives us everything we need to run our containers at scale. The only thing missing is a database. We will use an Azure-managed PostgreSQL server instead of deploying our own container, because it’s easier, more reliable and scalable. Our local Nginx reverse proxy container is also obsolete because ACA automatically deploys an Envoy Ingress Controller.

It’s interesting to note that we literally don’t have to change a single line of code in our local application, we can just treat it as a bunch of containers!

2.4 Azure architecture: putting it all together

Here is a diagram of the full cloud architecture for our chat application that contains all our Azure resources. Let’s take a high level look at how a user request flows through the system.

Azure architecture diagram. Made by author in draw.io.
  1. User sends HTTPS request to chat.jorisbaan.nl.
  2. A Public DNS server like Google DNS resolves this domain name to an Azure Public IP address.
  3. The Azure Load Balancer on this IP address routes the request to the (for us invisible) Envoy Ingress Controller.
  4. The Ingress Controller routes the request to UI Container App, who routes it to one of its Replicas where a UI web server is running.
  5. The UI web server makes requests to the database API and language model API Apps, who both route it to one of their Replicas.
  6. A database API replica queries the PostgreSQL server hostname. The Azure Private DNS Zone resolves the hostname to the PostgreSQL server’s IP address.

3. Deployment

So, how do we actually create all this? Rather than clicking around in the Azure Portal, infrastructure-as-code tools like Terraform are best to create and manage cloud resources. However, for simplicity, I will instead use the Azure CLI to create a bash script that deploys our entire application step by step. You can find the full deployment script including environment variables here 🤖. We will go through it step by step now.

3.1 Setting up

We need an Azure account (I’m using a free education account), a clone of the https://github.com/jsbaan/ai-app-from-scratch repo, Docker to build and push the container images, the downloaded model, and the Azure CLI to start creating cloud resources.

We first create a resource group so our resources are easier to find, manage and delete. The --location parameter refers to the physical datacenter we’ll use to deploy our app’s infrastructure. Ideally, it is close to our users. We then create a private virtual network with 256 IP addresses to isolate, secure and connect our database server and Container Apps.

brew update && brew install azure-cli # for macos

echo "Create resource group"
az group create \
--name $RESOURCE_GROUP \
--location "$LOCATION"

echo "Create VNET with 256 IP addresses"
az network vnet create \
--resource-group $RESOURCE_GROUP \
--name $VNET \
--address-prefix 10.0.0.0/24 \
--location $LOCATION

3.2 PostgreSQL server deployment

Depending on the hardware, an Azure-managed PostgreSQL database server costs about $13 to $7000 a month. To communicate with Container Apps, we put the DB server within the same private virtual network but in its own subnet. A subnet is a dedicated range of IP addresses that can have its own security and routing rules.

We create the Azure PostgreSQL Flexible Server with private access. This means only resources within the same virtual network can reach it. Azure automatically creates a Private DNS Zone that manages a hostname for the database that resolves to its IP address. The database API will later use this hostname to connect to the database server.

We will randomly generate the database credentials and store them in a secure place: Azure KeyVault.

echo "Create subnet for DB with 128 IP addresses"
az network vnet subnet create \
--resource-group $RESOURCE_GROUP \
--name $DB_SUBNET \
--vnet-name $VNET \
--address-prefix 10.0.0.128/25

echo "Create a key vault to securely store and retrieve secrets, \
like the db password"
az keyvault create \
--name $KEYVAULT \
--resource-group $RESOURCE_GROUP \
--location $LOCATION

echo "Give myself access to the key vault so I can store and retrieve \
the db password"
az role assignment create \
--role "Key Vault Secrets Officer" \
--assignee $EMAIL \
--scope "/subscriptions/$SUBSCRIPTION_ID/resourceGroups/$RESOURCE_GROUP/providers/Microsoft.KeyVault/vaults/$KEYVAULT"

echo "Store random db username and password in the key vault"
az keyvault secret set \
--name postgres-username \
--vault-name $KEYVAULT \
--value $(openssl rand -base64 12 | tr -dc 'a-zA-Z' | head -c 12)
az keyvault secret set \
--name postgres-password \
--vault-name $KEYVAULT \
--value $(openssl rand -base64 16)

echo "While we're at it, let's already store a secret session key for the UI"
az keyvault secret set \
--name session-key \
--vault-name $KEYVAULT \
--value $(openssl rand -base64 16)

echo "Create PostgreSQL flexible server in our VNET in its own subnet. \
Auto-creates Private DS Zone."
POSTGRES_USERNAME=$(az keyvault secret show --name postgres-username --vault-name $KEYVAULT --query "value" --output tsv)
POSTGRES_PASSWORD=$(az keyvault secret show --name postgres-password --vault-name $KEYVAULT --query "value" --output tsv)
az postgres flexible-server create \
--resource-group $RESOURCE_GROUP \
--name $DB_SERVER \
--vnet $VNET \
--subnet $DB_SUBNET \
--location $LOCATION \
--admin-user $POSTGRES_USERNAME \
--admin-password $POSTGRES_PASSWORD \
--sku-name Standard_B1ms \
--tier Burstable \
--storage-size 32 \
--version 16 \
--yes

3.3 Azure Container App Environment deployment

With the network and database in place, let’s deploy the infrastructure to run containers — the Container App Environment (recall, this is a Kubernetes cluster under the hood).

We create another subnet with 128 IP addresses and delegate its management to the Container App Environment. The subnet should be big enough for every ten new replicas to get a new IP address in the subrange. We can then create the Environment. This is just a single command without much configuration.

echo "Create subnet for ACA with 128 IP addresses."
az network vnet subnet create \
--resource-group $RESOURCE_GROUP \
--name $ACA_SUBNET \
--vnet-name $VNET \
--address-prefix 10.0.0.0/25

echo "Delegate the subnet to ACA"
az network vnet subnet update \
--resource-group $RESOURCE_GROUP \
--vnet-name $VNET \
--name $ACA_SUBNET \
--delegations Microsoft.App/environments

echo "Obtain the ID of our subnet"
ACA_SUBNET_ID=$(az network vnet subnet show \
--resource-group $RESOURCE_GROUP \
--name $ACA_SUBNET \
--vnet-name $VNET \
--query id --output tsv)

echo "Create Container Apps Environment in our custom subnet.\\
By default, it has a Workload profile with Consumption plan."
az containerapp env create \
--resource-group $RESOURCE_GROUP \
--name $ACA_ENVIRONMENT \
--infrastructure-subnet-resource-id $ACA_SUBNET_ID \
--location $LOCATION

3.4 Azure Container Apps deployment

Each Container App needs a Docker image to run. Let’s first setup a Container Registry, and then build all our images locally and push them to the registry. Note that we simply copied the model file into the language model image using its Dockerfile, so we don’t need to mount external storage like we did for local deployment in part 1.

echo "Create container registry (ACR)"
az acr create \\
--resource-group $RESOURCE_GROUP \\
--name $ACR \\
--sku Standard \\
--admin-enabled true

echo "Login to ACR and push local images"
az acr login --name $ACR
docker build --tag $ACR.azurecr.io/$DB_API $DB_API
docker push $ACR.azurecr.io/$DB_API
docker build --tag $ACR.azurecr.io/$LM_API $LM_API
docker push $ACR.azurecr.io/$LM_API
docker build --tag $ACR.azurecr.io/$UI $UI
docker push $ACR.azurecr.io/$UI

Now, onto deployment. To create Container Apps we specify their Environment, container registry, image, and the port they will listen to for requests. The ingress parameter regulates whether Container Apps can be reached from the outside world. Our two APIs are internal and therefore completely isolated, with no public URL and no traffic ever routed from the Envoy Ingress Controller. The UI is external and has a public URL, but sends internal HTTP requests over the virtual network to our APIs. We pass these internal hostnames and db credentials as environment variables.

echo "Deploy DB API on Container Apps with the db credentials from the key \
vault as env vars. More secure is to use a managed identity that allows the \
container itself to retrieve them from the key vault. But for simplicity we \
simply fetch it ourselves using the CLI."
POSTGRES_USERNAME=$(az keyvault secret show --name postgres-username --vault-name $KEYVAULT --query "value" --output tsv)
POSTGRES_PASSWORD=$(az keyvault secret show --name postgres-password --vault-name $KEYVAULT --query "value" --output tsv)
az containerapp create --name $DB_API \
--resource-group $RESOURCE_GROUP \
--environment $ACA_ENVIRONMENT \
--registry-server $ACR.azurecr.io \
--image $ACR.azurecr.io/$DB_API \
--target-port 80 \
--ingress internal \
--env-vars "POSTGRES_HOST=$DB_SERVER.postgres.database.azure.com" "POSTGRES_USERNAME=$POSTGRES_USERNAME" "POSTGRES_PASSWORD=$POSTGRES_PASSWORD" \
--min-replicas 1 \
--max-replicas 5 \
--cpu 0.5 \
--memory 1

echo "Deploy UI on Container Apps, and retrieve the secret random session \
key the UI uses to encrypt session cookies"
SESSION_KEY=$(az keyvault secret show --name session-key --vault-name $KEYVAULT --query "value" --output tsv)
az containerapp create --name $UI \
--resource-group $RESOURCE_GROUP \
--environment $ACA_ENVIRONMENT \
--registry-server $ACR.azurecr.io \
--image $ACR.azurecr.io/$UI \
--target-port 80 \
--ingress external \
--env-vars "db_api_url=http://$DB_API" "lm_api_url=http://$LM_API" "session_key=$SESSION_KEY" \
--min-replicas 1 \
--max-replicas 5 \
--cpu 0.5 \
--memory 1

echo "Deploy LM API on Container Apps"
az containerapp create --name $LM_API \
--resource-group $RESOURCE_GROUP \
--environment $ACA_ENVIRONMENT \
--registry-server $ACR.azurecr.io \
--image $ACR.azurecr.io/$LM_API \
--target-port 80 \
--ingress internal \
--min-replicas 1 \
--max-replicas 5 \
--cpu 2 \
--memory 4 \
--scale-rule-name my-http-rule \
--scale-rule-http-concurrency 2

3.5 Scaling our Container Apps

Let’s take a look at how our Container Apps they scale. Container Apps can scale to zero, which means they have zero replicas and stop running (and stop incurring costs). This is a feature of the serverless paradigm, where infrastructure is provisioned on demand. The invisible Envoy proxy handles scaling based on triggers, like concurrent HTTP requests. Spawning new replicas may take some time, which is called cold-start. We set the minimum number of replicas to 1 to avoid cold starts and the resulting timeout errors for first requests.

The default scaling rule creates a new replica whenever an existing replica receives 10 concurrent HTTP requests. This applies to the UI and the database API. To test whether this scaling rule makes sense, we would have to perform load testing to simulate real user traffic and see what each Container App replica can handle individually. My guess is that they can handle a lot more concurrent request than 10, and we could relax the rule.

3.5.1 Scaling language model inference.

Even with our small, quantized language model, inference requires much more compute than a simple FastAPI app. The inference server handles incoming requests sequentially, and the default Container App resources of 0.5 virtual CPU cores and 1GB memory result in very slow response times: up to 30 seconds for generating 128 tokens with a context window of 1024 (these parameters are defined in the LM API’s Dockerfile).

Increasing vCPU to 2 and memory to 4GB gives much better inference speed, and handles about 10 requests within 30 seconds. I configured the http scaling rule very tightly at 2 concurrent requests, so whenever 2 users chat at the same time, the LM API will scale out.

With 5 maximum replicas, I think this will allow for roughly 10–40 concurrent users, depending on the length of the chat histories. Now, obviously, this isn’t very large-scale, but with a higher budget, we could increase vCPUs, memory and the number of replicas. Ultimately we would need to move to GPU-based inference. More on that later.

3.6 Custom domain name & HTTPS

The automatically generated URL from the UI App looks like https://chat-ui.purplepebble-ac46ada4.germanywestcentral.azurecontainerapps.io/. This isn’t very memorable, so I want to make our app available as subdomain on my website: chat.jorisbaan.nl.

I simply add two DNS records on my domain registrar portal (like GoDaddy). A CNAME record that links my chat subdomain to the UI’s URL, and TXT record to prove ownership of the subdomain to Azure and obtain a TLS certificate.

# Obtain UI URL and verification code
URL=$(az containerapp show -n $UI -g $RESOURCE_GROUP -o tsv --query "properties.configuration.ingress.fqdn")
VERIFICATION_CODE=$(az containerapp show -n $UI -g $RESOURCE_GROUP -o tsv --query "properties.customDomainVerificationId")

# Add a CNAME record with the URL and a TXT record with the verification code to domain registrar
# (Do this manually)

# Add custom domain name to UI App
az containerapp hostname add --hostname chat.jorisbaan.nl -g $RESOURCE_GROUP -n $UI
# Configure managed certificate for HTTPS
az containerapp hostname bind --hostname chat.jorisbaan.nl -g $RESOURCE_GROUP -n $UI --environment $ACA_ENVIRONMENT --validation-method CNAME

Container Apps manages a free TLS certificate for my subdomain as long as the CNAME record points directly to the container’s domain name.

The public URL for the UI changes whenever I tear down and redeploy an Environment. We could use a fancier service like Azure Front Door or Application Gateway to get a stable URL and act as reverse proxy with additional security, global availability, and edge caching.

4. Resources & costs overview

Now that the app is deployed, let’s look at an overview of all the Azure resources it app uses. We created most of them ourselves, but Azure also automatically created a Load balancer, Public IP, Private DNS Zone, Network Watcher and Log Analytics workspace.

Screenshot of all resources from Azure Portal.

Some resources are free, others are free up to a certain time or compute budget, which is part of the reason I chose them. The following resources incur the highest costs:

  • Load Balancer (standard Tier): free for 1 month, then $18/month.
  • Container Registry (standard Tier): free for 12 months, then $19/month.
  • PostgreSQL Flexible Server (Burstable B1MS Compute Tier): free for 12 months, then at least $13/month.
  • Container App: Free for 50 CPU hours/month or 2M requests/month, then $10/month for an App with a single replica, 0.5 vCPUs and 1GB memory. The LM API with 2vCPUs, 4GB memory costs about $50 per month for a single replica.

You can see that the costs of this small (but scalable) app can quickly add up to hundreds of dollars per month, even without a GPU server to run a stronger language model! That’s the reason why the app probably won’t be up when you’re reading this.

It also becomes clear that Azure Container Apps is more expensive then I initially thought: it requires a standard-Tier Load balancer for automatic external ingress, HTTPS and auto-scaling. We could get around this by disabling external ingress and deploying a cheaper alternative — like a VM with a custom reverse proxy, or a basic-Tier Load balancer. Still, a standard-tier Kubernetes cluster would have cost at least $150/month, so ACA can be cheaper at small scale.

5. Roadmap

Now, before we wrap up, let’s look at just a few of the many directions to improve this deployment.

Continuous Integration & Continuous Deployment. I would set up a CI/CD pipeline that runs unit and integration tests and redeploys the app upon code changes. It might be triggered by a new git commit or merged pull request. This will also make it easier to see when a service isn’t deployed properly. I would also set up monitoring and alerting to be aware of issues quickly (like a crashing Container App instance).

Lower latency: the language model server. I would load test the whole app — simulating real-world user traffic — with something like Locust or Azure Load Testing. Even without load testing, we have an obvious bottleneck: the LM server. Small and quantized as it is, it can still take up quite a while for lengthy answers, with no concurrency. For more users it would be faster and more efficient to run a GPU inference server with a batching mechanism that collects multiple generation requests in a queue — perhaps with Kafka — and runs batch inference on chunks.

With even more users, we might want several GPU-based LM servers that consume from the same queue. For GPU infrastructure I’d look into Azure Virtual Machines or something more fancy like Azure Machine Learning.

The llama.cpp inference engine is good for single-user CPU-based inference. When moving to a GPU-server, I would look into inference engines more suitable to batch inference, like vLLM or Huggingface TGI. And, obviously, a better (bigger) model for increased response quality — depending on the use case.

6. Final thoughts

I hope this project offers a glimpse of what an AI-powered web app in production may look like. I tried to balance realistic engineering with cutting about every corner to keep it simple, cheap, understandable, and limit my time and compute budget. Sadly, I cannot keep the app live for long since it would quickly cost hundreds of dollars per month. If someone can help with Azure credits to keep the app running, let me know!

Some closing thoughts about using managed services: Although Azure Container Apps abstracts away some of the Kubernetes complexity, it’s still extremely useful to have an understanding of the lower-level Kubernetes concepts. The automatically created invisible infrastructure like Public IPs, Load balancers and ingress controllers add unforeseen costs and make it difficult to understand what’s going on. Also, ACA documentation is limited compared to Kubernetes. However, if you know what you’re doing, you can set something up very quickly.

Acknowledgements

I heavily relied on the Azure docs, and the ACA docs in particular. Thanks to Dennis Ulmer for proofreading and Lucas de Haas for useful discussion.

AI usage

I experimented a bit more with AI tools compared to part 1. I used Pycharm’s CoPilot plugin for code completion and had quite some back-and-forth with ChatGPT to learn about the Azure or Kubernetes ecosystem, and to spar about bugs. I double-checked everything in the docs and most of the information was solid. Like part 1, I did not use AI to write this post, though I did use ChatGPT to paraphrase some bad-running sentences.


Designing, Building & Deploying an AI Chat App from Scratch (Part 2) was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.



from AI in Towards Data Science on Medium https://ift.tt/B0wgRFG
via IFTTT