Building a Data Dashboard

Image by Author

With source data from a Postgres database

Using the streamlit Python library

As a Python data engineer for many years, one area I was not very involved in was the production of data dashboards. That all changed when Python-based libraries such as Streamlit, Gradio and Taipy came along.

With their introduction, python programmers had no excuses not to use them to craft nice-looking front-ends and dashboards.

Until then, the only other options were to use specialised tools like Tableau or AWS’s Quicksight or—horror of horrors—get your hands dirty with CSS, HTML, and Javascript.

So, if you’ve never used one of these new Python-based graphical front-end libraries before, this article is for you as I’ll be taking you through how to code up a data dashboard using one of the most popular libraries for this purpose called Streamlit.

My intention is that this will be the first part of a series of articles on developing a data dashboard using three of the most popular Python-based GUI libraries. In addition to this one, I also plan to release articles on Gradio and Taipy, so look out for those. As much as possible I’ll try to replicate the same layout and functionality in each dashboard. I’ll use the exact same data for all three too, albeit in different formats e.g. a CSV, database etc …
Please also note that I have no connection or affiliation with Streamlit/Snowflake, Postgres or any other company or tool mentioned in this post.

What is Streamlit?

Founded in 2018 by Adrien Treuille, Amanda Kelly, and Thiago Teixeira, Streamlit quickly gained popularity among data scientists and machine learning engineers when it introduced its open-source Python framework to simplify the creation of interactive data applications.

In March 2022, Snowflake, a Data Cloud company, acquired Streamlit and its capabilities were integrated into the Snowflake ecosystem to enhance data application development.

Streamlit’s open-source framework has been widely adopted, with over 8 million downloads and more than 1.5 million applications built using the platform. An active community of developers and contributors continues to play a significant role in its ongoing development and success.

What we’ll develop

We’re going to develop a data dashboard. Our source data for the dashboard will be in a single Postgres database table and contain 100,000 synthetic sales records.

To be honest, the actual source of the data isn’t that important. It could just as easily be a text or CSV file, SQLite, or any database you can connect to. I chose Postgres because I have a copy on my local PC, and it's convenient for me to use.

This is what our final dashboard will look like.

Image by Author

There are four main sections.

  • The top row allows the user to choose specific start and end dates and/or product categories via date pickers and a drop-down list, respectively.
  • The second row — Key metrics — shows a top-level summary of the chosen data.
  • The Visualisation section allows the user to select one of three graphs to display the input data set.
  • The raw data section is exactly what it says. This is a tabular representation of the chosen data, effectively viewing the underlying Postgres database table data.

Using the dashboard is easy. Initially, stats for the whole data set are displayed. The user can then narrow the data focus using the 3 choice fields at the top of the display. The graphs, key metrics and raw data sections dynamically change to reflect what the user has chosen.

The underlying data

As mentioned, the dashboard's source data is contained in a single Postgres database table. The data is a set of 100,000 synthetic sales-related data records. Here is the Postgres table creation script for reference.

CREATE TABLE IF NOT EXISTS public.sales_data
(
order_id integer NOT NULL,
order_date date,
customer_id integer,
customer_name character varying(255) COLLATE pg_catalog."default",
product_id integer,
product_names character varying(255) COLLATE pg_catalog."default",
categories character varying(100) COLLATE pg_catalog."default",
quantity integer,
price numeric(10,2),
total numeric(10,2)
)

And here is some Python code you can use to generate a data set for yourself. Make sure both numpy and polars libraries are installed first

# generate the 1m record CSV file
#
import polars as pl
import numpy as np
from datetime import datetime, timedelta

def generate(nrows: int, filename: str):
names = np.asarray(
[
"Laptop",
"Smartphone",
"Desk",
"Chair",
"Monitor",
"Printer",
"Paper",
"Pen",
"Notebook",
"Coffee Maker",
"Cabinet",
"Plastic Cups",
]
)

categories = np.asarray(
[
"Electronics",
"Electronics",
"Office",
"Office",
"Electronics",
"Electronics",
"Stationery",
"Stationery",
"Stationery",
"Electronics",
"Office",
"Sundry",
]
)

product_id = np.random.randint(len(names), size=nrows)
quantity = np.random.randint(1, 11, size=nrows)
price = np.random.randint(199, 10000, size=nrows) / 100

# Generate random dates between 2010-01-01 and 2023-12-31
start_date = datetime(2010, 1, 1)
end_date = datetime(2023, 12, 31)
date_range = (end_date - start_date).days

# Create random dates as np.array and convert to string format
order_dates = np.array([(start_date + timedelta(days=np.random.randint(0, date_range))).strftime('%Y-%m-%d') for _ in range(nrows)])

# Define columns
columns = {
"order_id": np.arange(nrows),
"order_date": order_dates,
"customer_id": np.random.randint(100, 1000, size=nrows),
"customer_name": [f"Customer_{i}" for i in np.random.randint(2**15, size=nrows)],
"product_id": product_id + 200,
"product_names": names[product_id],
"categories": categories[product_id],
"quantity": quantity,
"price": price,
"total": price * quantity,
}

# Create Polars DataFrame and write to CSV with explicit delimiter
df = pl.DataFrame(columns)
df.write_csv(filename, separator=',',include_header=True) # Ensure comma is used as the delimiter

# Generate 100,000 rows of data with random order_date and save to CSV
generate(100_000, "/mnt/d/sales_data/sales_data.csv")

Setting up our development environment

Before we get to the example code, let’s set up a separate development environment. That way, what we do won’t interfere with other versions of libraries, programming, etc… we might have on the go for other projects we’re working on.

I use Miniconda for this, but you can use whatever method suits you best.

If you want to go down the Miniconda route and don’t already have it, you must install Miniconda first. Get it using this link,

Miniconda - Anaconda documentation

Once the environment is created, switch to it using the activatecommand, and then pip install our required Python libraries.

#create our test environment
(base) C:\Users\thoma>conda create -n streamlit_test python=3.12 -y
# Now activate it
(base) C:\Users\thoma>conda activate streamlit_test
# Install python libraries, etc ...
(streamlit_test) C:\Users\thoma>pip install streamlit pandas matplotlib psycopg2

The Code

I’ll split the code up into sections and explain each one along the way.

#
# Streamlit equivalent of final Gradio app
#
import streamlit as st
import pandas as pd
import matplotlib.pyplot as plt
import datetime
import psycopg2
from psycopg2 import sql
from psycopg2 import pool

# Initialize connection pool
try:
connection_pool = psycopg2.pool.ThreadedConnectionPool(
minconn=5,
maxconn=20,
dbname="postgres",
user="postgres",
password="postgres",
host="localhost",
port="5432"
)
except psycopg2.Error as e:
st.error(f"Error creating connection pool: {e}")

def get_connection():
try:
return connection_pool.getconn()
except psycopg2.Error as e:
st.error(f"Error getting connection from pool: {e}")
return None

def release_connection(conn):
try:
connection_pool.putconn(conn)
except psycopg2.Error as e:
st.error(f"Error releasing connection back to pool: {e}")

We start by importing all the external libraries we’ll need. Next, we set up a ThreadedConnectionPool that allows multiple threads to share a pool of database connections. Two helper functions follow, one to get a database connection and the other to release it. This is overkill for a simple single-user app but essential for handling multiple simultaneous users or threads accessing the database in a web app environment.

def get_date_range():
conn = get_connection()
if conn is None:
return None, None
try:
with conn.cursor() as cur:
query = sql.SQL("SELECT MIN(order_date), MAX(order_date) FROM public.sales_data")
cur.execute(query)
return cur.fetchone()
finally:
release_connection(conn)

def get_unique_categories():
conn = get_connection()
if conn is None:
return []
try:
with conn.cursor() as cur:
query = sql.SQL("SELECT DISTINCT categories FROM public.sales_data ORDER BY categories")
cur.execute(query)
return [row[0].capitalize() for row in cur.fetchall()]
finally:
release_connection(conn)

def get_dashboard_stats(start_date, end_date, category):
conn = get_connection()
if conn is None:
return None
try:
with conn.cursor() as cur:
query = sql.SQL("""
WITH category_totals AS (
SELECT
categories,
SUM(price * quantity) as category_revenue
FROM public.sales_data
WHERE order_date BETWEEN %s AND %s
AND (%s = 'All Categories' OR categories = %s)
GROUP BY categories
),
top_category AS (
SELECT categories
FROM category_totals
ORDER BY category_revenue DESC
LIMIT 1
),
overall_stats AS (
SELECT
SUM(price * quantity) as total_revenue,
COUNT(DISTINCT order_id) as total_orders,
SUM(price * quantity) / COUNT(DISTINCT order_id) as avg_order_value
FROM public.sales_data
WHERE order_date BETWEEN %s AND %s
AND (%s = 'All Categories' OR categories = %s)
)
SELECT
total_revenue,
total_orders,
avg_order_value,
(SELECT categories FROM top_category) as top_category
FROM overall_stats
""")
cur.execute(query, [start_date, end_date, category, category,
start_date, end_date, category, category])
return cur.fetchone()
finally:
release_connection(conn)

The get_date_range function executes the SQL query to find the range of dates (MIN and MAX) in the order_date column and returns the two dates as a tuple: (start_date, end_date).

The get_unique_categories function runs an SQL query to fetch unique values from the categories column. It capitalizes the category names (first letter uppercase) before returning them as a list.

The get_dashboard_stats function executes a SQL query with the following parts:

  • category_totals: Calculates total revenue for each category in the given date range.
  • top_category: Finds the category with the highest revenue.
  • overall_stats: Computes overall statistics:
  • Total revenue (SUM(price * quantity)).
  • Total number of unique orders (COUNT(DISTINCT order_id)).
  • Average order value (total revenue divided by total orders).

It returns a single row containing:

  • total_revenue: Total revenue in the specified period.
  • total_orders: Number of distinct orders.
  • avg_order_value: Average revenue per order.
  • top_category: The category with the highest revenue.
def get_plot_data(start_date, end_date, category):
conn = get_connection()
if conn is None:
return pd.DataFrame()
try:
with conn.cursor() as cur:
query = sql.SQL("""
SELECT DATE(order_date) as date,
SUM(price * quantity) as revenue
FROM public.sales_data
WHERE order_date BETWEEN %s AND %s
AND (%s = 'All Categories' OR categories = %s)
GROUP BY DATE(order_date)
ORDER BY date
""")
cur.execute(query, [start_date, end_date, category, category])
return pd.DataFrame(cur.fetchall(), columns=['date', 'revenue'])
finally:
release_connection(conn)

def get_revenue_by_category(start_date, end_date, category):
conn = get_connection()
if conn is None:
return pd.DataFrame()
try:
with conn.cursor() as cur:
query = sql.SQL("""
SELECT categories,
SUM(price * quantity) as revenue
FROM public.sales_data
WHERE order_date BETWEEN %s AND %s
AND (%s = 'All Categories' OR categories = %s)
GROUP BY categories
ORDER BY revenue DESC
""")
cur.execute(query, [start_date, end_date, category, category])
return pd.DataFrame(cur.fetchall(), columns=['categories', 'revenue'])
finally:
release_connection(conn)

def get_top_products(start_date, end_date, category):
conn = get_connection()
if conn is None:
return pd.DataFrame()
try:
with conn.cursor() as cur:
query = sql.SQL("""
SELECT product_names,
SUM(price * quantity) as revenue
FROM public.sales_data
WHERE order_date BETWEEN %s AND %s
AND (%s = 'All Categories' OR categories = %s)
GROUP BY product_names
ORDER BY revenue DESC
LIMIT 10
""")
cur.execute(query, [start_date, end_date, category, category])
return pd.DataFrame(cur.fetchall(), columns=['product_names', 'revenue'])
finally:
release_connection(conn)

def get_raw_data(start_date, end_date, category):
conn = get_connection()
if conn is None:
return pd.DataFrame()
try:
with conn.cursor() as cur:
query = sql.SQL("""
SELECT
order_id, order_date, customer_id, customer_name,
product_id, product_names, categories, quantity, price,
(price * quantity) as revenue
FROM public.sales_data
WHERE order_date BETWEEN %s AND %s
AND (%s = 'All Categories' OR categories = %s)
ORDER BY order_date, order_id
""")
cur.execute(query, [start_date, end_date, category, category])
return pd.DataFrame(cur.fetchall(), columns=[desc[0] for desc in cur.description])
finally:
release_connection(conn)

def plot_data(data, x_col, y_col, title, xlabel, ylabel, orientation='v'):
fig, ax = plt.subplots(figsize=(10, 6))
if not data.empty:
if orientation == 'v':
ax.bar(data[x_col], data[y_col])
else:
ax.barh(data[x_col], data[y_col])
ax.set_title(title)
ax.set_xlabel(xlabel)
ax.set_ylabel(ylabel)
plt.xticks(rotation=45)
else:
ax.text(0.5, 0.5, "No data available", ha='center', va='center')
return fig

The get_plot_data function fetches daily revenue within the given date range and category. It retrieves data grouped by the day (DATE(order_date)) and calculates daily revenue (SUM(price * quantity)), then returns a Pandas DataFrame with columns: date (the day) and revenue (total revenue for that day).

The get_revenue_by_category function fetches revenue totals grouped by category within the specified date range. It groups data by categories and calculates revenue for each category (SUM(price * quantity)), orders the results by revenue in descending order and returns a Pandas DataFrame with columns: categories (category name) and revenue (total revenue for the category).

The get_top_products function retrieves the top 10 products by revenue within the given date range and category. It groups data by product_names and calculates revenue for each product (SUM(price * quantity)), orders the products by revenue in descending order and limits results to the top 10 before returning a Pandas DataFrame with columns: product_names (product name) and revenue (total revenue for the product).

The get_raw_data function fetches raw transaction data within the specified date range and category.

The plot_data function takes in some data (in a pandas DataFrame) and the names of the columns you want to plot on the x- and y-axes. It then creates a bar chart — either vertical or horizontal, depending on the chosen orientation — labels the axes, adds a title, and returns the finished chart (a Matplotlib Figure). If the data is empty, it just displays a “No data available” message instead of trying to plot anything.

# Streamlit App
st.title("Sales Performance Dashboard")

# Filters
with st.container():
col1, col2, col3 = st.columns([1, 1, 2])
min_date, max_date = get_date_range()
start_date = col1.date_input("Start Date", min_date)
end_date = col2.date_input("End Date", max_date)
categories = get_unique_categories()
category = col3.selectbox("Category", ["All Categories"] + categories)

# Custom CSS for metrics
st.markdown("""
<style>
.metric-row {
display: flex;
justify-content: space-between;
margin-bottom: 20px;
}
.metric-container {
flex: 1;
padding: 10px;
text-align: center;
background-color: #f0f2f6;
border-radius: 5px;
margin: 0 5px;
}
.metric-label {
font-size: 14px;
color: #555;
margin-bottom: 5px;
}
.metric-value {
font-size: 18px;
font-weight: bold;
color: #0e1117;
}
</style>
""", unsafe_allow_html=True)

# Metrics
st.header("Key Metrics")
stats = get_dashboard_stats(start_date, end_date, category)
if stats:
total_revenue, total_orders, avg_order_value, top_category = stats
else:
total_revenue, total_orders, avg_order_value, top_category = 0, 0, 0, "N/A"

# Custom metrics display
metrics_html = f"""
<div class="metric-row">
<div class="metric-container">
<div class="metric-label">Total Revenue</div>
<div class="metric-value">${total_revenue:,.2f}</div>
</div>
<div class="metric-container">
<div class="metric-label">Total Orders</div>
<div class="metric-value">{total_orders:,}</div>
</div>
<div class="metric-container">
<div class="metric-label">Average Order Value</div>
<div class="metric-value">${avg_order_value:,.2f}</div>
</div>
<div class="metric-container">
<div class="metric-label">Top Category</div>
<div class="metric-value">{top_category}</div>
</div>
</div>
"""
st.markdown(metrics_html, unsafe_allow_html=True)

This code section creates the main structure for displaying the key metrics in the Streamlit dashboard. It:

  1. Sets up the page title: “Sales Performance Dashboard.”
  2. Presents filters for start/end dates and category selection.
  3. Retrieves metrics (such as total revenue, total orders, etc.) for the chosen filters from the database.
  4. Applies custom CSS to style these metrics in a row of boxes with labels and values.
  5. Displays the metrics within an HTML block, ensuring each metric gets its own styled container.
# Visualization Tabs
st.header("Visualizations")
tabs = st.tabs(["Revenue Over Time", "Revenue by Category", "Top Products"])

# Revenue Over Time Tab
with tabs[0]:
st.subheader("Revenue Over Time")
revenue_data = get_plot_data(start_date, end_date, category)
st.pyplot(plot_data(revenue_data, 'date', 'revenue', "Revenue Over Time", "Date", "Revenue"))

# Revenue by Category Tab
with tabs[1]:
st.subheader("Revenue by Category")
category_data = get_revenue_by_category(start_date, end_date, category)
st.pyplot(plot_data(category_data, 'categories', 'revenue', "Revenue by Category", "Category", "Revenue"))

# Top Products Tab
with tabs[2]:
st.subheader("Top Products")
top_products_data = get_top_products(start_date, end_date, category)
st.pyplot(plot_data(top_products_data, 'product_names', 'revenue', "Top Products", "Revenue", "Product Name", orientation='h'))

This section adds a header titled “Visualizations” to this part of the dashboard. It creates three tabs, each of which displays a different graphical representation of the data:

Tab 1: Revenue Over Time

  • Fetches revenue data grouped by date for the given filters using get_plot_data().
  • Calls plot_data() to generate a bar chart of revenue over time, with dates on the x-axis and revenue on the y-axis.
  • Displays the chart in the first tab.

Tab 2: Revenue by Category

  • Fetches revenue grouped by category using get_revenue_by_category().
  • Calls plot_data() to create a bar chart of revenue by category.
  • Displays the chart in the second tab.

Tab 3: Top Products

  • Fetches top 10 products by revenue for the given filters using get_top_products().
  • Calls plot_data() to create a horizontal bar chart (indicated by orientation='h').
  • Displays the chart in the third tab.

st.header("Raw Data")

raw_data = get_raw_data(
start_date=start_date,
end_date=end_date,
category=category
)

# Remove the index by resetting it and dropping the old index
raw_data = raw_data.reset_index(drop=True)

st.dataframe(raw_data,hide_index=True)

# Add spacing
st.write("")

The final section displays the raw data in a dataframe. The user is able to scroll up and down as required to see all records available.

An empty st.write("") is added at the end to provide spacing for better visual alignment.

Running the App

Let’s say you save your code into a file called app.py. You can run it using this from the command line,

(streamlit_test) C:\Users\thoma> python -m streamlit run app.py

If everything works as expected, you will see this after you run the above command.


You can now view your Streamlit app in your browser.

Local URL: http://localhost:8501
Network URL: http://192.168.0.59:8501

Click on the Local URLs shown, and a browser screen should appear with the Streamlit app running.

Summary

In this article, I’ve attempted to provide a comprehensive guide to building an interactive sales performance dashboard using Streamlit with a Postgres database table as its source data.

Streamlit is a modern, Python-based open-source framework that simplifies the creation of data-driven dashboards and applications. The dashboard I developed allows users to filter data by date ranges and product categories, view key metrics such as total revenue and top-performing categories, explore visualizations like revenue trends and top products, and navigate through raw data with pagination.

This guide includes a complete implementation, from setting up a Postgres database with sample data to creating Python functions for querying data, generating plots, and handling user input. This step-by-step approach demonstrates how to leverage Streamlit’s capabilities to create user-friendly and dynamic dashboards, making it ideal for data engineers and scientists who want to build interactive data applications.

Although I used Postgres for my data, it should be straightforward to modify the code to use a CSV file or any other relational database management system (RDBMS), such as SQLite, as your data source.

That’s all from me for now. I hope you found this article useful. If you did, please check out my profile page at this link. From there, you can see my other published stories and subscribe to get notified when I post new content.

If you liked this content, Medium thinks you’ll find these articles interesting, too.


Building a Data Dashboard was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.



from Datascience in Towards Data Science on Medium https://ift.tt/YDFcn5B
via IFTTT

Why Generative-AI Apps’ Quality Often Sucks and What to Do About It

How to get from PoCs to tested high-quality applications in production

Image licensed from elements.envato.com, edit by Marcel Müller, 2025

The generative AI hype has rolled through the business world in the past two years. This technology can make business process executions more efficient, reduce wait time, and reduce process defects. Some interfaces like ChatGPT make interacting with an LLM easy and accessible. Anyone with experience using a chat application can effortlessly type a query, and ChatGPT will always generate a response. Yet the quality and suitability for the intended use of your generated content may vary. This is especially true for enterprises that want to use generative AI technology in their business operations.

I have spoken to countless managers and entrepreneurs who failed in their endeavors because they could not get high-quality generative AI applications to production and get reusable results from a non-deterministic model. On the other hand, I have also built more than three dozen AI applications and have realized one common misconception when people think about quality for generative AI applications: They think it is all about how powerful your underlying model is. But this is only 30% of the full story.

But there are dozens of techniques, patterns, and architectures that help create impactful LLM-based applications of the quality that businesses desire. Different foundation models, fine-tuned models, architectures with retrieval augmented generation (RAG) and advanced processing pipelines are just the tip of the iceberg.

This article shows how we can qualitatively and quantitatively evaluate generative AI applications in the context of concrete business processes. We will not stop at generic benchmarks but introduce approaches to evaluating applications with generative AI. After a quick analysis of generative AI applications and their business processes, we will look into the following questions:

  • In what context do we need to evaluate generative AI applications to assess their end-to-end quality and utility in enterprise applications?
  • When in the development life cycle of applications with generative AI, do we use different approaches for evaluation, and what are the objectives?
  • How do we use different metrics in isolation and production to select, monitor and improve the quality of generative AI applications?

This overview will give us an end-to-end evaluation framework for generative AI applications in enterprise scenarios that I call the PEEL (performance evaluation for enterprise LLM applications). Based on the conceptual framework created in this article, we will introduce an implementation concept as an addition to the entAIngine Test Bed module as part of the entAIngine platform.

1. Background: Business Processes and Generative AI

An organization lives by its business processes. Everything in a company can be a business process, such as customer support, software development, and operations processes. Generative AI can improve our business processes by making them faster and more efficient, reducing wait time and improving the outcome quality of our processes. Yet, we can further divide each process activity that uses generative AI even more.

Processes for generative AI applications. © 2025, Marcel Müller

The illustration shows the start of a simple business that a telecommunications company's customer support agent must go through. Every time a new customer support request comes in, the customer support agent has to give it a priority-level. When the work items on their list come to the point that the request has priority, the customer support agents must find the correct answer and write an answer email. Afterward, they need to send the email to the customers and wait for a reply, and they iterate until the request is solved.

We can use a generative AI workflow to make the “find and write answer” activity more efficient. Yet, this activity is often not a single call to ChatGPT or another LLM but a collection of different tasks. In our example, the telco company has built a pipeline using the entAIngine process platform that consists of the following steps.

  • Extract the question and generate a query to the vector database. The example company has a vector database as knowledge for retrieval augmented generation (RAG). We need to extract the essence of the customer’s question from their request email to have the best query and find the sections in the knowledge base that are semantically as close as possible to the question.
  • Find context in the knowledge base. The semantic search activity is the next step in our process. Retrieval-reranking structures are often used to get the top k context chunks relevant to the query and sort them with an LLM. This step aims to retrieve the correct context information to generate the best answer possible.
  • Use context to generate an answer. This step orchestrates a large language model using a prompt and the selected context as input to the prompt.
  • Write an answer email. The final step transforms the pre-formulated answer into a formal email with the correct intro and ending to the message in the company’s desired tone and complexity.

The execution of processes like this is called the orchestration of an advanced LLM workflow. There are dozens of other orchestration architectures in enterprise contexts. Using a chat interface that uses the current prompt and the chat history is also a simple type of orchestration. Yet, for reproducible enterprise workflows with sensitive company data, using a simple chat orchestration is not enough in many cases, and advanced workflows like those shown above are needed.

Thus, when we evaluate complex processes for generative AI orchestrations in enterprise scenarios, looking purely at the capabilities of a foundational (or fine-tuned) model is, in many cases, just the start. The following section will dive deeper into what context and orchestration we need to evaluate generative AI applications.

2. Concept

The following sections introduce the core concepts for our approach.

My team has built the entAIngine platform that is, in that sense, quite unique in that it enables low-code generation of applications with generative AI tasks that are not necessarily a chatbot application. We have also implemented the following approach on entAIngine. If you want to try it out, message me. Or, if you want to build your own testbed functionality, feel free to get inspiration from the concept below.

2.1. Context and Orchestration of Performance Evaluation for Generative AI Applications

When evaluating the performance of generative AI applications in their orchestrations, we have the following choices: We can evaluate a foundational model in isolation, a fine-tuned model or either of those options as part of a larger orchestration, including several calls to different models and RAG. This has the following implications.

Context and orchestration for LLM-based applications. © Marcel Müller, 2025

Publicly available generative AI models like (for LLMs) GPT-4o, Llama 3.2 and many others were trained on the “public wisdom of the internet.” Their training sets included a large corpus of knowledge from books, world literature, Wikipedia articles, and other Internet crawls from forums and block posts. There is no company internal knowledge encoded in foundational models. Thus, when we evaluate the capabilities of a foundational model in evaluation, we can only evaluate the general capabilities of how queries are answered. However, the extensiveness of company-specific knowledge bases that show “how much the model knows” cannot be judged. There is only company-specific knowledge in foundational models with advanced orchestration that inserts company-specific context.

For example, with a free account from ChatGPT, anyone can ask, “How did Goethe die?” The model will provide an answer because the key information about Goethe’s life and death is in the model’s knowledge base. Yet, the question “How much revenue did our company make last year in Q3 in EMEA?” will most likely lead to a heavily hallucinated answer which will seem plausible to inexperienced users. However, we can still evaluate the form and representation of the answers, including style and tone, as well as language capabilities and skills concerning reasoning and logical deduction. Synthetic benchmarks such as ARC, HellaSwag, and MMLU provide comparative metrics for those dimensions. We will take a deeper look into those benchmarks in a later section.

Fine-tuned models build on foundational models. They use additional data sets to add foundational knowledge into a model that has not been there before by further training of the underlying machine learning model. Fine-tuned models have more context-specific knowledge. Suppose we orchestrate them in isolation without any other ingested data. In that case, we can evaluate the knowledge base concerning its suitability for real-world scenarios in a given business process. Fine-tuning is often used to focus on adding domain-specific vocabulary and sentence structures to a foundational model.

Suppose, we train a model on a corpus of legal court rulings. In that case, a fine-tuned model will start using the vocabulary and reproducing the sentence structure that is common in the legal domain. The model can combine some excerpts from old cases but fails to quote the right sources.

Orchestrating foundational models or fine-tuned models with retrieval-ation (RAG) produces highly context-dependent results. However, this also requires a more complex orchestration pipeline.

For example, a telco company, like in our example above, can use a language model to create embeddings of their customer support knowledge base and store them in a vector store. We can now efficiently query this knowledge base in a vector store with semantic search. By keeping track of the text segments that are retrieved, we can very precisely show the source of the retrieved text chunk and use it as context in a call to a large language model. This lets us answer our question end-to-end.

We can evaluate how well our application serves its intended purpose end-to-end for such large orchestrations with different data processing pipeline steps.

Evaluating those different types of setups gives us different insights that we can use in the development process of generative AI applications. We will look deeper into this aspect in the next section.

2.2 Evaluation of Generative AI Applications in the Development Lifecycle

We develop generative AI applications in different stages: 1) before building, 2) during build and testing, and 3) in production. With an agile approach, these stages are not executed in a linear sequence but iteratively. Yet, the goals and methods of evaluation in the different stages remain the same regardless of their order.

Before building, we need to evaluate which foundational model to choose or whether to create a new one from scratch. Therefore, we must first define our expectations and requirements, especially w.r.t. execution time, efficiency, price and quality. Currently, only very few companies decide to build their own foundational models from scratch due to cost and updating efforts. Fine-tuning and retrieval augmented generation are the standard tools to build highly personalized pipelines with traceable internal knowledge that leads to reproducible outputs. In this stage, synthetic benchmarks are the go-to approaches to achieve comparability. For example, if we want to build an application that helps lawyers prepare their cases, we need a model that is good at logical argumentation and understanding of a specific language.

During building, our evaluation needs to focus on satisfying the quality and performance requirements of the application’s example cases. In the case of building an application for lawyers, we need to make a representative selection of limited old cases. Those cases are the basis for defining standard scenarios of the application based on which we implement the application. For example, if the lawyer specializes in financial law and taxation, we would select a few of the standard cases for which this lawyer has to create scenarios. Every building and evaluation activity that we do in this phase has a limited view of representative scenarios and does not cover every instance. Yet, we need to evaluate the scenarios in the ongoing steps of application development.

In production, our evaluation approach focuses on quantitatively evaluating the real-world usage of our application with the expectations of live users. In production, we will find scenarios that are not covered in our building scenarios. The goal of the evaluation in this phase is to discover those scenarios and gather feedback from live users to improve the application further.

The production phase should always feed back into the development phase to improve the application iteratively. Hence, the three phases are not in a linear sequence, but interleaving.

2.3. Benchmark Metrics for Evaluation

With the “what” and “when” of the evaluation covered, we have to ask “how” we are going to evaluate our generative AI applications. Therefore, we have three different methods: Synthetic benchmarks, limited scenarios and feedback loop evaluation in production.

For synthetic benchmarks, we will look into the most commonly used approaches and compare them.

The AI2 Reasoning Challenge (ARC) tests an LLM’s knowledge and reasoning using a dataset of 7787 multiple-choice science questions. These questions range from 3rd to 9th grade and are divided into Easy and Challenge sets. ARC is useful for evaluating diverse knowledge types and pushing models to integrate information from multiple sentences. Its main benefit is comprehensive reasoning assessment, but it’s limited to scientific questions.

HellaSwag tests commonsense reasoning and natural language inference through sentence completion exercises based on real-world scenarios. Each exercise includes a video caption context and four possible endings. This benchmark measures an LLM’s understanding of everyday scenarios. Its main benefit is the complexity added by adversarial filtering, but it primarily focuses on general knowledge, limiting specialized domain testing.

The MMLU (Massive Multitask Language Understanding) benchmark measures an LLM’s natural language understanding across 57 tasks covering various subjects, from STEM to humanities. It includes 15,908 questions from elementary to advanced levels. MMLU is ideal for comprehensive knowledge assessment. Its broad coverage helps identify deficiencies, but limited construction details and errors may affect reliability.

TruthfulQA evaluates an LLM’s ability to generate truthful answers, addressing hallucinations in language models. It measures how accurately an LLM can respond, especially when training data is insufficient or low quality. This benchmark is useful for assessing accuracy and truthfulness, with the main benefit of focusing on factually correct answers. However, its general knowledge dataset may not reflect truthfulness in specialized domains.

The RAGAS framework is designed to evaluate Retrieval Augmented Generation (RAG) pipelines. It is a framework especially useful for a category of LLM applications that utilize external data to enhance the LLM’s context. The frameworks introduces metrics for faithfulness, answer relevancy, context recall, context precision, context relevancy, context entity recall and summarization score that can be used to assess in a differentiated view the quality of the retrieved outputs.

WinoGrande tests an LLM’s commonsense reasoning through pronoun resolution problems based on the Winograd Schema Challenge. It presents near-identical sentences with different answers based on a trigger word. This benchmark is beneficial for resolving ambiguities in pronoun references, featuring a large dataset and reduced bias. However, annotation artifacts remain a limitation.

The GSM8K benchmark measures an LLM’s multi-step mathematical reasoning using around 8,500 grade-school-level math problems. Each problem requires multiple steps involving basic arithmetic operations. This benchmark highlights weaknesses in mathematical reasoning, featuring diverse problem framing. However, the simplicity of problems may limit their long-term relevance.

SuperGLUE enhances the GLUE benchmark by testing an LLM’s NLU capabilities across eight diverse subtasks, including Boolean Questions and the Winograd Schema Challenge. It provides a thorough assessment of linguistic and commonsense knowledge. SuperGLUE is ideal for broad NLU evaluation, with comprehensive tasks offering detailed insights. However, fewer models are tested compared to benchmarks similar to MMLU.

HumanEval measures an LLM’s ability to generate functionally correct code through coding challenges and unit tests. It includes 164 coding problems with several unit tests per problem. This benchmark assesses coding and problem-solving capabilities, focusing on functional correctness similar to human evaluation. However, it only covers some practical coding tasks, limiting its comprehensiveness.

MT-Bench evaluates an LLM’s capability in multi-turn dialogues by simulating real-life conversational scenarios. It measures how effectively chatbots engage in conversations, following a natural dialogue flow. With a carefully curated dataset, MT-Bench is useful for assessing conversational abilities. However, its small dataset and the challenge of simulating real conversations still need to be improved.

All those metrics are synthetic and aim to provide a relative comparison between different LLMs. However, their concrete impact for a use case in a company depends on the classification of the challenge in the scenario to the benchmark. For example, in use cases for tax accounts where a lot of math is needed, GSM8K would be a good candidate to evaluate that capability. HumanEval is the initial tool of choice for the use of an LLM in a coding-related scenario.

2.4. Real-life Scenario-based Evaluation

However, the impact of those benchmarks is rather abstract and only gives an indication of their performance in an enterprise use case. This is where working with real-life scenarios is needed.

Real-life scenarios consist of the following components:

  • case-specific context data (input),
  • case-independent context data,
  • a sequence of tasks to complete and
  • the expected output.

With real-life test scenarios, we can model different situations, like

  • multi-step chat interactions with several questions and answers,
  • complex automation tasks with multiple AI interactions,
  • processes that involve RAG and
  • multi-modal process interactions.

In other words, it does not help anyone to have the best model in the world if the RAG pipeline always returns mediocre results because your chunking strategy is not good. Also, if you do not have the right data to answer your queries, you will always get some hallucinations that may or may not be close to the truth. In the same way, your results will vary based on the hyperparameters of your chosen models (temperature, frequency penalty, etc.). And we cannot use the most powerful model for every use case, if this is an expensive model.

Standard benchmarks focus on the individual models rather than on the big picture. That is why we introduce the PEEL framework for performance evaluation of enterprise LLM applications, which gives us an end-to-end view.

The core concept of PEEL is the evaluation scenario. We distinguish between an evaluation scenario definition and an evaluation scenario execution. The conceptual illustration shows the overall concepts in black, an example definition in blue and the outcome of one instance of an execution in green.

The concept of evaluation scenarios as introduced by the PEEL framework © Marcel Müller

An evaluation scenario definition consists of input definitions, an orchestration definition and an expected output definition.

For the input, we distinguish between case-specific and case-independent context data. Case-specific context data changes from case to case. For example, in the customer support use case, the question that a customer asks is different from customer case to customer case. In our example evaluation execution, we depicted one case where the email inquiry reads as follows:

“Dear customer support,

my name is […]. How do I reset my router when I move to a different apartment?

Kind regards, […] “

Yet, the knowledge base where the answers to the question are located in large documents is case-independent. In our example, we have a knowledge base with the pdf manuals for the routers AR83, AR93, AR94 and BD77 stored in a vector store.

An evaluation scenario definition has an orchestration. An orchestration consists of a series of n >= 1 steps that get in the evaluation scenario execution executed in sequence. Each step has inputs that it takes from any of the previous steps or from the input to the scenario execution. Steps can be interactions with LLMs (or other models), context retrieval tasks (for example, from a vector db) or other calls to data sources. For each step, we distinguish between the prompt / request and the execution parameters. The execution parameters include the model or method that needs to be executed and hyperparameters. The prompt / request is a collection of different static or dynamic data pieces that get concatenated (see illustration).

In our example, we have a three-step orchestration. In step 1, we extract a single question from the case-specific input context (the customer’s email inquiry). We use this question in step 2 to create a semantic search query in our vector database using the cosine similarity metric. The last step takes the search results and formulates an email using an LLM.

In an evaluation scenario definition, we have an expected output and an evaluation method. Here, we define for every scenario how we want to evaluate the actual outcome vs. the expected outcome. We have the following options:

  • Exact match/regex match: We check for the occurrence of a specific series of terms/concepts and give as an answer a boolean where 0 means that the defined terms did not appear in the output of the execution and 1 means they did appear. For example, the core concept of installing a router at a new location is pressing the reset button for 3 seconds. If the terms “reset button” and “3 seconds” are not part of the answer, we would evaluate it as a failure.
  • Semantic match: We check if the text is semantically close to what our expected answer is. Therefore, we use an LLM and task it to judge with a rational number between 0 and 1 how well the answer matches the expected answer.
  • Manual match: Humans evaluate the output on a scale between 0 and 1.

An evaluation scenario should be executed many times because LLMs are non-deterministic models. We want to have a reasonable number of executions so we can aggregate the scores and have a statistically significant output.

The benefit of using such scenarios is that we can use them while building and debugging our orchestrations. When we see that we have in 80 out of 100 executions of the same prompt a score of less than 0,3, we use this input to tweak or prompts or to add other data to our fine-tuning before orchestration.

2.5. Feedback Collection and Adjustment in Production

The principle for collecting feedback in production is analogous to the scenario approach. We map each user interaction to a scenario. If the user has larger degrees of freedom of interaction, we might need to create new scenarios that we did not anticipate during the building phase.

The user gets a slider between 0 and 1, where they can indicate how satisfied they were with the output of a result. From a user experience perspective, this number can also be simplified into different media, for example, a laughing, neutral and sad smiley. Thus, this evaluation is the manual match method that we introduced before.

In production, we have to create the same aggregations and metrics as before, just with live users and a potentially larger amount of data.

3. Example Implementation as Part of entAIngine Test Bed

Together with the entAIngine team, we have implemented the functionality on the platform. This section is to show you how things could be done and to give you inspiration. Or if you want to use what we have implemented , feel free to.

We map our concepts for evaluation scenarios and evaluation scenario definitions and map them to classic concepts of software testing. The start point for any interaction to create a new test is via the entAIngine application dashboard.

entAIngine dashboard © Marcel Müller

In entAIngine, users can create many different applications. Each of the applications is a set of processes that define workflows in a no-code interface. Processes consist of input templates (variables), RAG components, calls to LLMs, TTS, Image and Audio modules, integration to documents and OCR. With these components, we build reusable processes that can be integrated via an API, used as chat flows, used in a text editor as a dynamic text-generating block, or in a knowledge management search interface that shows the sources of answers. This functionality is, at the moment, already completely implemented in the entAIngine platform and can be used as SaaS or is 100% deployed on-premise. It integrates to existing gateways, data sources and models via API. We will use the process template generator to evaluation scenario definitions.

When the user wants to create a new test, they go to “test bed” and “tests”.

On the tests screen, the user can create new evaluation scenarios or edit existing ones. When creating a new evaluation scenario, the orchestration (an entAIngine process template) and a set of metrics must be defined. We assume we have a customer support scenario where we need to retrieve data with RAG to answer a question in the first step and then formulate an answer email in the second step. Then, we use the new module to name the test, define / select a process template and pick and evaluator that will create a score for every individual test case.

Test definition © Marcel Müller, 2025
Test case (process template) definition © Marcel Müller, 2025

The Metrics are as defined above: Regex match, semantic match and manual match. The screen with the process definition is already existing and functional, together with the orchestration. The functionality to define tests in bull as seen below is new.

Test and test cases © Marcel Müller, 2025

In the test editor, we work on an evaluation scenario definition (“evaluate how good our customer support answering RAG is”) and we define in this scenario different test cases. A test case assigns data values to the variables in the test. We can try 50 or 100 different instances of test cases and evaluate and aggregate them. For example, if we evaluate our customer support answering, we can define 100 different customer support requests, define our expected outcome and then execute them and analyze how good the answers were. Once we designed a set of test cases, we can execute their scenarios with the right variables using the existing orchestration engine and evaluate them.

Metrics and evaluation © Marcel Müller, 2025

This testing is happening during the building phase. We have an additional screen that we use to evaluate real user feedback in the productive phase. The contents are collected from real user feedback (through our engine and API).

The metrics that we have available in the live feedback section are collected from a user through a star rating.

Conclusion: Testing and Quality

In this article, we have looked into advanced testing and quality engineering concepts for generative AI applications, especially those that are more complex than simple chat bots. The introduced PEEL framework is a new approach for scenario-based test that is closer to the implementation level than the generic benchmarks with which we test models. For good applications, it is important to not only test the model in isolation, but in orchestration.

Get in touch with me

I am working in my day-real-world applications with generative AI, especially in the enterprise. If you want to connect, feel free to add me or send a message on LinkedIn.


Why Generative-AI Apps’ Quality Often Sucks and What to Do About It was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.



from AI in Towards Data Science on Medium https://ift.tt/VAfiJb3
via IFTTT

Designing, Building & Deploying an AI Chat App from Scratch (Part 2)

Cloud Deployment and Scaling

Photo by Alex wong on Unsplash

1. Introduction

In the previous post, we built an AI-powered chat application on our local computer using microservices. Our stack included FastAPI, Docker, Postgres, Nginx and llama.cpp. The goal of this post is to learn more about the fundamentals of cloud deployment and scaling by deploying our app to Azure, making it available to real users. We’ll use Azure because they offer a free education account, but the process is similar for other platforms like AWS and GCP.

You can check a live demo of the app at chat.jorisbaan.nl. Now, obviously, this demo isn’t very large-scale, because the costs ramp up very quickly. With the tight scaling limits I configured I reckon it can handle about 10–40 concurrent users until I run out of Azure credits. However, I do hope it demonstrates the principles behind a scalable production system. We could easily configure it to scale to many more users with a higher budget.

I give a complete breakdown of our infrastructure and the costs at the end. The codebase is at https://github.com/jsbaan/ai-app-from-scratch.

A quick demo of the app at chat.jorisbaan.nl. We start a new chat, come back to that same chat, and start another chat.

1.1. Recap: local application

Let’s recap how we built our local app: A user can start or continue a chat with a language model by sending an HTTP request to http://localhost. An Nginx reverse proxy receives and forwards the request to a UI over a private Docker network. The UI stores a session cookie to identify the user, and sends requests to the backend: the language model API that generates text, and the database API that queries the database server.

Local architecture of the app. See part 1 for more details. Made by author in draw.io.

Table of contents

  1. Introduction
    1.1 Recap: local application
  2. Cloud architecture
    2.1 Scaling
    2.2 Kubernetes Concepts
    2.3 Azure Container Apps
    2.4 Azure architecture: putting it all together
  3. Deployment
    3.1 Setting up
    3.2 PostgreSQL server deployment
    3.3 Azure Container App Environment deployment
    3.4 Azure Container Apps deployment
    3.5 Scaling our Container Apps
    3.6 Custom domain name & HTTPS
  4. Resources & costs overview
  5. Roadmap
  6. Final thoughts
    Acknowledgements
    AI usage

2. Cloud architecture

Conceptually, our cloud architecture will not be too different from our local application: a bunch of containers in a private network with a gateway to the outside world, our users.

However, instead of running containers on our local computer with Docker Compose, we will deploy them to a computing environment that automatically scales across virtual or psychical machines to many concurrent users.

2.1 Scaling

Scaling is a central concept in cloud architectures. It means being able to dynamically handle varying numbers of users (i.e., HTTP requests). Uvicorn, the web server running our UI and database API, can already handle about 40 concurrent requests. It’s even possible to use another web server called Gunicorn as a process manager that employs multiple Uvicorn workers in the same container, further increasing concurrency.

Now, if we want to support even more concurrent request, we could give each container more resources, like CPUs or memory (vertical scaling). However, a more reliable approach is to dynamically create copies (replicas) of a container based on the number of incoming HTTP requests or memory/CPU usage, and distribute the incoming traffic across replicas (horizontal scaling). Each replica container will be assigned an IP address, so we also need to think about networking: how to centrally receive all requests and distribute them over the container replicas.

This “prism” pattern is important: requests arrive centrally in some server (a load balancer) and fan out for parallel processing to multiple other servers (e.g., several identical UI containers).

Photo of two prisms by Fernando @cferdophotography on Unsplash

2.2 Kubernetes Concepts

Kubernetes is the industry standard system for automating deployment, scaling and management of containerized applications. Its core concepts are crucial to understand modern cloud architectures, including ours, so let’s quickly review the basics.

  • Node: A physical or virtual machine to run containerized app or manage the cluster.
  • Cluster: A set of Nodes managed by Kubernetes.
  • Pod: The smallest deployable unit in Kubernetes. Runs one main app container with optional secondary containers that share storage and networking.
  • Deployment: An abstraction that manages the desired state of a set of Pod replicas by deploying, scaling and updating them.
  • Service: An abstraction that manages a stable entrypoint (the service’s DNS name) to expose a set of Pods by distributing incoming traffic over the various dynamic Pod IP addresses. A Service has multiple types:
    - A ClusterIP Service exposes Pods within the Cluster
    - A LoadBalancer Service exposes Pods to outside the Cluster. It triggers the cloud provider to provision an external public IP and load balancer outside the cluster that can be used to reach the cluster. These external requests are then routed via the Service to individual Pods.
  • Ingress: An abstraction that defines more complex rules for a cluster’s entrypoint. It can route traffic to multiple Services; give Services externally-reachable URLs; load balance traffic; and handle secure HTTPS.
  • Ingress Controller: Implements the Ingress rules. For example, an Nginx-based controller runs an Nginx server (like in our local app) under the hood that is dynamically configured to route traffic according to Ingress rules. To expose the Ingress Controller itself to the outside world, you can use a LoadBalancer Service. This architecture is often used.

2.3 Azure Container Apps

Armed with these concepts, instead of deploying our app with Kubernetes directly, I wanted to experiment a little by using Azure Container Apps (ACA). This is a serverless platform built on top of Kubernetes that abstracts away some of its complexity.

With a single command, we can create a Container App Environment, which, under the hood, is an invisible Kubernetes Cluster managed by Azure. Within this Environment, we can run a container as a Container App that Azure internally manages as Kubernetes Deployments, Services, and Pods. See article 1 and article 2 for detailed comparisons.

A Container App Environment also auto-creates:

  1. An invisible Envoy Ingress Controller that routes requests to internal Apps and handles HTTPS and App auto-scaling based on request volume.
  2. An external Public IP address and Azure Load Balancer that routes external traffic to the Ingress Controller that in turn routes it to Apps (sounds similar to a Kubernetes LoadBalancer Service, eh?).
  3. An Azure-generated URL for each Container App that is publicly accessible over the internet or internal, based on its Ingress config.

This gives us everything we need to run our containers at scale. The only thing missing is a database. We will use an Azure-managed PostgreSQL server instead of deploying our own container, because it’s easier, more reliable and scalable. Our local Nginx reverse proxy container is also obsolete because ACA automatically deploys an Envoy Ingress Controller.

It’s interesting to note that we literally don’t have to change a single line of code in our local application, we can just treat it as a bunch of containers!

2.4 Azure architecture: putting it all together

Here is a diagram of the full cloud architecture for our chat application that contains all our Azure resources. Let’s take a high level look at how a user request flows through the system.

Azure architecture diagram. Made by author in draw.io.
  1. User sends HTTPS request to chat.jorisbaan.nl.
  2. A Public DNS server like Google DNS resolves this domain name to an Azure Public IP address.
  3. The Azure Load Balancer on this IP address routes the request to the (for us invisible) Envoy Ingress Controller.
  4. The Ingress Controller routes the request to UI Container App, who routes it to one of its Replicas where a UI web server is running.
  5. The UI web server makes requests to the database API and language model API Apps, who both route it to one of their Replicas.
  6. A database API replica queries the PostgreSQL server hostname. The Azure Private DNS Zone resolves the hostname to the PostgreSQL server’s IP address.

3. Deployment

So, how do we actually create all this? Rather than clicking around in the Azure Portal, infrastructure-as-code tools like Terraform are best to create and manage cloud resources. However, for simplicity, I will instead use the Azure CLI to create a bash script that deploys our entire application step by step. You can find the full deployment script including environment variables here 🤖. We will go through it step by step now.

3.1 Setting up

We need an Azure account (I’m using a free education account), a clone of the https://github.com/jsbaan/ai-app-from-scratch repo, Docker to build and push the container images, the downloaded model, and the Azure CLI to start creating cloud resources.

We first create a resource group so our resources are easier to find, manage and delete. The --location parameter refers to the physical datacenter we’ll use to deploy our app’s infrastructure. Ideally, it is close to our users. We then create a private virtual network with 256 IP addresses to isolate, secure and connect our database server and Container Apps.

brew update && brew install azure-cli # for macos

echo "Create resource group"
az group create \
--name $RESOURCE_GROUP \
--location "$LOCATION"

echo "Create VNET with 256 IP addresses"
az network vnet create \
--resource-group $RESOURCE_GROUP \
--name $VNET \
--address-prefix 10.0.0.0/24 \
--location $LOCATION

3.2 PostgreSQL server deployment

Depending on the hardware, an Azure-managed PostgreSQL database server costs about $13 to $7000 a month. To communicate with Container Apps, we put the DB server within the same private virtual network but in its own subnet. A subnet is a dedicated range of IP addresses that can have its own security and routing rules.

We create the Azure PostgreSQL Flexible Server with private access. This means only resources within the same virtual network can reach it. Azure automatically creates a Private DNS Zone that manages a hostname for the database that resolves to its IP address. The database API will later use this hostname to connect to the database server.

We will randomly generate the database credentials and store them in a secure place: Azure KeyVault.

echo "Create subnet for DB with 128 IP addresses"
az network vnet subnet create \
--resource-group $RESOURCE_GROUP \
--name $DB_SUBNET \
--vnet-name $VNET \
--address-prefix 10.0.0.128/25

echo "Create a key vault to securely store and retrieve secrets, \
like the db password"
az keyvault create \
--name $KEYVAULT \
--resource-group $RESOURCE_GROUP \
--location $LOCATION

echo "Give myself access to the key vault so I can store and retrieve \
the db password"
az role assignment create \
--role "Key Vault Secrets Officer" \
--assignee $EMAIL \
--scope "/subscriptions/$SUBSCRIPTION_ID/resourceGroups/$RESOURCE_GROUP/providers/Microsoft.KeyVault/vaults/$KEYVAULT"

echo "Store random db username and password in the key vault"
az keyvault secret set \
--name postgres-username \
--vault-name $KEYVAULT \
--value $(openssl rand -base64 12 | tr -dc 'a-zA-Z' | head -c 12)
az keyvault secret set \
--name postgres-password \
--vault-name $KEYVAULT \
--value $(openssl rand -base64 16)

echo "While we're at it, let's already store a secret session key for the UI"
az keyvault secret set \
--name session-key \
--vault-name $KEYVAULT \
--value $(openssl rand -base64 16)

echo "Create PostgreSQL flexible server in our VNET in its own subnet. \
Auto-creates Private DS Zone."
POSTGRES_USERNAME=$(az keyvault secret show --name postgres-username --vault-name $KEYVAULT --query "value" --output tsv)
POSTGRES_PASSWORD=$(az keyvault secret show --name postgres-password --vault-name $KEYVAULT --query "value" --output tsv)
az postgres flexible-server create \
--resource-group $RESOURCE_GROUP \
--name $DB_SERVER \
--vnet $VNET \
--subnet $DB_SUBNET \
--location $LOCATION \
--admin-user $POSTGRES_USERNAME \
--admin-password $POSTGRES_PASSWORD \
--sku-name Standard_B1ms \
--tier Burstable \
--storage-size 32 \
--version 16 \
--yes

3.3 Azure Container App Environment deployment

With the network and database in place, let’s deploy the infrastructure to run containers — the Container App Environment (recall, this is a Kubernetes cluster under the hood).

We create another subnet with 128 IP addresses and delegate its management to the Container App Environment. The subnet should be big enough for every ten new replicas to get a new IP address in the subrange. We can then create the Environment. This is just a single command without much configuration.

echo "Create subnet for ACA with 128 IP addresses."
az network vnet subnet create \
--resource-group $RESOURCE_GROUP \
--name $ACA_SUBNET \
--vnet-name $VNET \
--address-prefix 10.0.0.0/25

echo "Delegate the subnet to ACA"
az network vnet subnet update \
--resource-group $RESOURCE_GROUP \
--vnet-name $VNET \
--name $ACA_SUBNET \
--delegations Microsoft.App/environments

echo "Obtain the ID of our subnet"
ACA_SUBNET_ID=$(az network vnet subnet show \
--resource-group $RESOURCE_GROUP \
--name $ACA_SUBNET \
--vnet-name $VNET \
--query id --output tsv)

echo "Create Container Apps Environment in our custom subnet.\\
By default, it has a Workload profile with Consumption plan."
az containerapp env create \
--resource-group $RESOURCE_GROUP \
--name $ACA_ENVIRONMENT \
--infrastructure-subnet-resource-id $ACA_SUBNET_ID \
--location $LOCATION

3.4 Azure Container Apps deployment

Each Container App needs a Docker image to run. Let’s first setup a Container Registry, and then build all our images locally and push them to the registry. Note that we simply copied the model file into the language model image using its Dockerfile, so we don’t need to mount external storage like we did for local deployment in part 1.

echo "Create container registry (ACR)"
az acr create \\
--resource-group $RESOURCE_GROUP \\
--name $ACR \\
--sku Standard \\
--admin-enabled true

echo "Login to ACR and push local images"
az acr login --name $ACR
docker build --tag $ACR.azurecr.io/$DB_API $DB_API
docker push $ACR.azurecr.io/$DB_API
docker build --tag $ACR.azurecr.io/$LM_API $LM_API
docker push $ACR.azurecr.io/$LM_API
docker build --tag $ACR.azurecr.io/$UI $UI
docker push $ACR.azurecr.io/$UI

Now, onto deployment. To create Container Apps we specify their Environment, container registry, image, and the port they will listen to for requests. The ingress parameter regulates whether Container Apps can be reached from the outside world. Our two APIs are internal and therefore completely isolated, with no public URL and no traffic ever routed from the Envoy Ingress Controller. The UI is external and has a public URL, but sends internal HTTP requests over the virtual network to our APIs. We pass these internal hostnames and db credentials as environment variables.

echo "Deploy DB API on Container Apps with the db credentials from the key \
vault as env vars. More secure is to use a managed identity that allows the \
container itself to retrieve them from the key vault. But for simplicity we \
simply fetch it ourselves using the CLI."
POSTGRES_USERNAME=$(az keyvault secret show --name postgres-username --vault-name $KEYVAULT --query "value" --output tsv)
POSTGRES_PASSWORD=$(az keyvault secret show --name postgres-password --vault-name $KEYVAULT --query "value" --output tsv)
az containerapp create --name $DB_API \
--resource-group $RESOURCE_GROUP \
--environment $ACA_ENVIRONMENT \
--registry-server $ACR.azurecr.io \
--image $ACR.azurecr.io/$DB_API \
--target-port 80 \
--ingress internal \
--env-vars "POSTGRES_HOST=$DB_SERVER.postgres.database.azure.com" "POSTGRES_USERNAME=$POSTGRES_USERNAME" "POSTGRES_PASSWORD=$POSTGRES_PASSWORD" \
--min-replicas 1 \
--max-replicas 5 \
--cpu 0.5 \
--memory 1

echo "Deploy UI on Container Apps, and retrieve the secret random session \
key the UI uses to encrypt session cookies"
SESSION_KEY=$(az keyvault secret show --name session-key --vault-name $KEYVAULT --query "value" --output tsv)
az containerapp create --name $UI \
--resource-group $RESOURCE_GROUP \
--environment $ACA_ENVIRONMENT \
--registry-server $ACR.azurecr.io \
--image $ACR.azurecr.io/$UI \
--target-port 80 \
--ingress external \
--env-vars "db_api_url=http://$DB_API" "lm_api_url=http://$LM_API" "session_key=$SESSION_KEY" \
--min-replicas 1 \
--max-replicas 5 \
--cpu 0.5 \
--memory 1

echo "Deploy LM API on Container Apps"
az containerapp create --name $LM_API \
--resource-group $RESOURCE_GROUP \
--environment $ACA_ENVIRONMENT \
--registry-server $ACR.azurecr.io \
--image $ACR.azurecr.io/$LM_API \
--target-port 80 \
--ingress internal \
--min-replicas 1 \
--max-replicas 5 \
--cpu 2 \
--memory 4 \
--scale-rule-name my-http-rule \
--scale-rule-http-concurrency 2

3.5 Scaling our Container Apps

Let’s take a look at how our Container Apps they scale. Container Apps can scale to zero, which means they have zero replicas and stop running (and stop incurring costs). This is a feature of the serverless paradigm, where infrastructure is provisioned on demand. The invisible Envoy proxy handles scaling based on triggers, like concurrent HTTP requests. Spawning new replicas may take some time, which is called cold-start. We set the minimum number of replicas to 1 to avoid cold starts and the resulting timeout errors for first requests.

The default scaling rule creates a new replica whenever an existing replica receives 10 concurrent HTTP requests. This applies to the UI and the database API. To test whether this scaling rule makes sense, we would have to perform load testing to simulate real user traffic and see what each Container App replica can handle individually. My guess is that they can handle a lot more concurrent request than 10, and we could relax the rule.

3.5.1 Scaling language model inference.

Even with our small, quantized language model, inference requires much more compute than a simple FastAPI app. The inference server handles incoming requests sequentially, and the default Container App resources of 0.5 virtual CPU cores and 1GB memory result in very slow response times: up to 30 seconds for generating 128 tokens with a context window of 1024 (these parameters are defined in the LM API’s Dockerfile).

Increasing vCPU to 2 and memory to 4GB gives much better inference speed, and handles about 10 requests within 30 seconds. I configured the http scaling rule very tightly at 2 concurrent requests, so whenever 2 users chat at the same time, the LM API will scale out.

With 5 maximum replicas, I think this will allow for roughly 10–40 concurrent users, depending on the length of the chat histories. Now, obviously, this isn’t very large-scale, but with a higher budget, we could increase vCPUs, memory and the number of replicas. Ultimately we would need to move to GPU-based inference. More on that later.

3.6 Custom domain name & HTTPS

The automatically generated URL from the UI App looks like https://chat-ui.purplepebble-ac46ada4.germanywestcentral.azurecontainerapps.io/. This isn’t very memorable, so I want to make our app available as subdomain on my website: chat.jorisbaan.nl.

I simply add two DNS records on my domain registrar portal (like GoDaddy). A CNAME record that links my chat subdomain to the UI’s URL, and TXT record to prove ownership of the subdomain to Azure and obtain a TLS certificate.

# Obtain UI URL and verification code
URL=$(az containerapp show -n $UI -g $RESOURCE_GROUP -o tsv --query "properties.configuration.ingress.fqdn")
VERIFICATION_CODE=$(az containerapp show -n $UI -g $RESOURCE_GROUP -o tsv --query "properties.customDomainVerificationId")

# Add a CNAME record with the URL and a TXT record with the verification code to domain registrar
# (Do this manually)

# Add custom domain name to UI App
az containerapp hostname add --hostname chat.jorisbaan.nl -g $RESOURCE_GROUP -n $UI
# Configure managed certificate for HTTPS
az containerapp hostname bind --hostname chat.jorisbaan.nl -g $RESOURCE_GROUP -n $UI --environment $ACA_ENVIRONMENT --validation-method CNAME

Container Apps manages a free TLS certificate for my subdomain as long as the CNAME record points directly to the container’s domain name.

The public URL for the UI changes whenever I tear down and redeploy an Environment. We could use a fancier service like Azure Front Door or Application Gateway to get a stable URL and act as reverse proxy with additional security, global availability, and edge caching.

4. Resources & costs overview

Now that the app is deployed, let’s look at an overview of all the Azure resources it app uses. We created most of them ourselves, but Azure also automatically created a Load balancer, Public IP, Private DNS Zone, Network Watcher and Log Analytics workspace.

Screenshot of all resources from Azure Portal.

Some resources are free, others are free up to a certain time or compute budget, which is part of the reason I chose them. The following resources incur the highest costs:

  • Load Balancer (standard Tier): free for 1 month, then $18/month.
  • Container Registry (standard Tier): free for 12 months, then $19/month.
  • PostgreSQL Flexible Server (Burstable B1MS Compute Tier): free for 12 months, then at least $13/month.
  • Container App: Free for 50 CPU hours/month or 2M requests/month, then $10/month for an App with a single replica, 0.5 vCPUs and 1GB memory. The LM API with 2vCPUs, 4GB memory costs about $50 per month for a single replica.

You can see that the costs of this small (but scalable) app can quickly add up to hundreds of dollars per month, even without a GPU server to run a stronger language model! That’s the reason why the app probably won’t be up when you’re reading this.

It also becomes clear that Azure Container Apps is more expensive then I initially thought: it requires a standard-Tier Load balancer for automatic external ingress, HTTPS and auto-scaling. We could get around this by disabling external ingress and deploying a cheaper alternative — like a VM with a custom reverse proxy, or a basic-Tier Load balancer. Still, a standard-tier Kubernetes cluster would have cost at least $150/month, so ACA can be cheaper at small scale.

5. Roadmap

Now, before we wrap up, let’s look at just a few of the many directions to improve this deployment.

Continuous Integration & Continuous Deployment. I would set up a CI/CD pipeline that runs unit and integration tests and redeploys the app upon code changes. It might be triggered by a new git commit or merged pull request. This will also make it easier to see when a service isn’t deployed properly. I would also set up monitoring and alerting to be aware of issues quickly (like a crashing Container App instance).

Lower latency: the language model server. I would load test the whole app — simulating real-world user traffic — with something like Locust or Azure Load Testing. Even without load testing, we have an obvious bottleneck: the LM server. Small and quantized as it is, it can still take up quite a while for lengthy answers, with no concurrency. For more users it would be faster and more efficient to run a GPU inference server with a batching mechanism that collects multiple generation requests in a queue — perhaps with Kafka — and runs batch inference on chunks.

With even more users, we might want several GPU-based LM servers that consume from the same queue. For GPU infrastructure I’d look into Azure Virtual Machines or something more fancy like Azure Machine Learning.

The llama.cpp inference engine is good for single-user CPU-based inference. When moving to a GPU-server, I would look into inference engines more suitable to batch inference, like vLLM or Huggingface TGI. And, obviously, a better (bigger) model for increased response quality — depending on the use case.

6. Final thoughts

I hope this project offers a glimpse of what an AI-powered web app in production may look like. I tried to balance realistic engineering with cutting about every corner to keep it simple, cheap, understandable, and limit my time and compute budget. Sadly, I cannot keep the app live for long since it would quickly cost hundreds of dollars per month. If someone can help with Azure credits to keep the app running, let me know!

Some closing thoughts about using managed services: Although Azure Container Apps abstracts away some of the Kubernetes complexity, it’s still extremely useful to have an understanding of the lower-level Kubernetes concepts. The automatically created invisible infrastructure like Public IPs, Load balancers and ingress controllers add unforeseen costs and make it difficult to understand what’s going on. Also, ACA documentation is limited compared to Kubernetes. However, if you know what you’re doing, you can set something up very quickly.

Acknowledgements

I heavily relied on the Azure docs, and the ACA docs in particular. Thanks to Dennis Ulmer for proofreading and Lucas de Haas for useful discussion.

AI usage

I experimented a bit more with AI tools compared to part 1. I used Pycharm’s CoPilot plugin for code completion and had quite some back-and-forth with ChatGPT to learn about the Azure or Kubernetes ecosystem, and to spar about bugs. I double-checked everything in the docs and most of the information was solid. Like part 1, I did not use AI to write this post, though I did use ChatGPT to paraphrase some bad-running sentences.


Designing, Building & Deploying an AI Chat App from Scratch (Part 2) was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.



from AI in Towards Data Science on Medium https://ift.tt/B0wgRFG
via IFTTT

Designing, Building & Deploying an AI Chat App from Scratch (Part 1)

Microservices Architecture and Local Development

Photo by Danist Soh on Unsplash

Introduction

The aim of this project is to learn about the fundamentals of modern, scalable web applications by designing, building and deploying an AI-powered chat app from scratch. We won’t use fancy frameworks or commercial platforms like ChatGPT. This will provide a better understanding of how real-world systems may work under the hood, and give us full control over the language model, infrastructure, data and costs. The focus will be on engineering, backend and cloud deployment, rather than the language model or a fancy frontend.

This is part 1. We will design and build a cloud-native app with several APIs, a database, private network, reverse proxy, and simple user interface with sessions. Everything runs on our local computer. In part 2, we will deploy our application to a cloud platform like AWS, GCP or Azure with a focus on scalability so actual users can reach it over the internet.

A quick demo of the app. We start a new chat, come back to that same chat, and start another chat. We will now build this app locally and make it available at localhost.

You can find the codebase at https://github.com/jsbaan/ai-app-from-scratch. Throughout this post I will link to specific lines of code with this hyperlink robot 🤖 (try it!)

Microservices and APIs

Modern web applications are often built using microservices — small, independent software components with a specific role. Each service runs in its own Docker container — an isolated environment independent of the underlying operating system and hardware. Services communicate with each other over a network using REST APIs.

You can think of a REST API as the interface that defines how to interact with a service by defining endpoints — specific URLs that represent the possible resources or actions, formatted like http://hostname:port/endpoint-name Endpoints, also called paths or routes, are accessed with HTTP requests that can have various types like GET to retrieve data or POST to create data. Parameters can be passed in the URL itself or in the request body or header.

Architecture

Let’s make this more concrete. We want a web page where users can chat with a language model and come back to their previous chats. Our architecture will look like this:

Local architecture of the app. Each service runs in its own Docker container and communicates over a private network. Made by author in draw.io.

The above architecture diagram shows how a user’s HTTP request to localhost on the left flows through the system. We will discuss and set up each individual service, starting with the backend services on the right. Finally, we discuss communication, networking and container orchestration.

The structure of this post follows the components in our architecture (click to jump to the section):

  1. Language model API. A llama.cpp language model inference server running the quantized Qwen2.5–0.5B-Instruct model 🤖.
  2. PostgreSQL database server. A database that stores chats and messages 🤖.
  3. Database API. A FastAPI and Uvicorn Python server that queries the PostgreSQL database 🤖.
  4. User interface. A FastAPI and Uvicorn Python server that serves HTML and support session-based authentication 🤖.
  5. Private Docker network. For communication between microservices 🤖.
  6. Nginx reverse proxy. A gateway between the outside world and network-isolated services 🤖.
  7. Docker Compose. A container orchestration tool to easily run manage our services together 🤖.

1. Language model API

Setting up the actual language model is pretty easy, nicely demonstrating that ML engineering is usually more about engineering than ML. Since I want our app to run on a laptop, model inference should be fast and CPU-based with low memory.

I looked at several inference engines, like Fastchat with vLLM or Huggingface TGI, but went with llama.cpp because it’s popular, fast, lightweight and supports CPU-based inference. Llama.cpp is written in C/C++ and conveniently provides a Docker image with its inference engine and a simple web server that implements the popular OpenAI API specification. It comes with a basic UI for experimenting, but we’ll build our own UI shortly.

As for the actual language model, I chose the quantized Qwen2.5–0.5B-Instruct model from Alibaba Cloud, whose responses are surprisingly coherent given how small it is.

1.1 Running the language model API

The beauty of containerized applications is that, given a Docker image, we can have it running in seconds without installing any packages. The docker run command below pulls the llama.cpp server image, mounts the model file that we downloaded earlier to the container’s filesystem, and runs a container with the llama.cpp server listening for HTTP requests at port 80. It uses flash attention and has max generation length of 512 tokens.

docker run
--name lm-api \
--volume $PROJECT_PATH/lm-api/gguf_models:/models \
--publish 8000:80 \ # add this to make the API accessible on localhost
ghcr.io/ggerganov/llama.cpp:server
-m /models/qwen2-0_5b-instruct-q5_k_m.gguf --port 80 --host 0.0.0.0 --predict 512 --flash-attn

Ultimately we will use Docker Compose to run this container together with the others 🤖.

1.2 Accessing the container

Since Docker containers are completely isolated from everything else on their host machine, i.e., our computer, we can’t reach our language model API yet.

However, we can break through a bit of networking isolation by publishing the container’s port 80 to our host machine’s port 8000 with --publish 8000:80 in the docker run command. This makes the llama.cpp server available at http://localhost:8000.

The hostname localhost resolves to the loopback IP address 127.0.0.1 and is part of the loopback network interface that allows a computer to communicate with itself. When we visit http://localhost:8000, our browser sends an HTTP GET request to our own computer on port 8000, which gets forwarded to the llama.cpp container listening at port 80.

1.3 Testing the language model API

Let’s test the language model server by sending a POST request with a short chat history.

curl -X POST <http://localhost:8000/v1/chat/completions> \
-H "Content-Type: application/json" \
-d '{
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "assistant", "content": "Hello, how can I assist you today?"},
{"role": "user", "content": "Hi, what is an API?"}
],
"max_tokens": 10
}'

The response is JSON and the generated text is under choices.message.content: “An API (Application Programming Interface) is a specification…”.

Perfect! Ultimately, our UI service will be the one to send requests to the language model API, and define the system prompt and opening message 🤖.

2. PostgreSQL database server

Next, let’s look into storing chats and messages. PostgreSQL is a powerful, open-source relational database and running a PostgreSQL server locally is just another another docker run command using its official image. We’ll pass some extra environment variables to configure the database name, username, and password.

docker run --name db --publish 5432:5432 --env POSTGRES_USER=myuser --env POSTGRES_PASSWORD=mypassword postgres

After publishing port 5432, the database server is available on localhost:5432. PostgreSQL uses its own protocol for communication and doesn’t understand HTTP requests. We can use a database client like psql to test the connection.

pg_isready -U joris -h localhost -d postgres
> localhost:5432 - accepting connections

When we deploy our application in part 2, we will use a database managed by a cloud provider to make our lives easier and add more security, reliability and scalability. However, setting one up locally like this is useful for local development and, perhaps later on, integration tests.

3. Database API

Databases often have a separate API server sitting in front to control access, enforce extra security, and provide a simple, standardized interface that abstracts away the database’s complexity.

We will build this API from scratch with FastAPI, a modern framework for building fast, production-ready Python APIs. We will run the API with Uvicorn, a high-performance Python web server that handles things like network communication and simultaneous requests.

3.1 Quick FastAPI example

Let’s quickly get a feeling for FastAPI and look at a minimal example app with a single GET endpoint /hello.

from fastapi import FastAPI

# FastAPI app object that the Uvicorn web server will load and serve
my_app = FastAPI()

# Decorator telling FastAPI that function below handles GET requests to /hello
@my_app.get("/hello")
def read_hello():
# Define this endpoint's response
return {"Hello": "World"}

We can serve our app at http://localhost:8080 by running the Uvicorn server.

uvicorn main.py:my_app --host 0.0.0.0 --port 8080

If we now send a GET request to our endpoint by visiting http://localhost:8080/hello in our browser, we receive the JSON response {"Hello": "World"} !

3.2 Connecting to the database

On to the actual database API. We define four endpoints in main.py 🤖 for creating or fetching chats and messages. You get a nice visual summary of these in the auto-generated docs, see below. The UI will call these endpoints to process user data.

A cool feature of FastAPI is that it automatically generates interactive documentation according to the OpenAPI specification with Swagger. If the Uvicorn server is running we can find it at http://hostname:port/docs. This is a screenshot of the doc web page.

The first thing we need to do is to connect the database API to the database server. We use SQLAlchemy, a popular Python SQL toolkit and Object-Relational Mapper (ORM) that abstracts away writing manual SQL queries.

We establish this connection in database.py 🤖 by creating the SQLAlchemy engine with a connection URL that includes the database hostname, username and password (remember, we configured these by passing them as environment variables to the PostgreSQL server). We also create a session factory that creates a new database session for each request to the database API.

3.3 Interacting with the database

Now let’s design our database. We define two SQLAlchemy data models in models.py 🤖 that will be mapped to actual database tables. The first is a Message model 🤖 with an id, content, speaker role, owner_id, and session_id (more on this later). The second is a Chat model 🤖, which I’ll show here to get a better feeling for SQLAlchemy models:

class Chat(Base):
__tablename__ = "chats"

# Unique identifier for each chat that will be generated automatically.
id = Column(UUID(as_uuid=True), primary_key=True, default=uuid.uuid4)

# Username associated with the chat. Index is created for faster lookups.
username = Column(String, index=True)

# Session ID associated with the chat.
# Used to "scope" chats, i.e., users can only access chats from their session.
session_id = Column(String, index=True)

# The relationship function links the Chat model to the Message model.
# The back_populates flag creates a bidirectional relationship.
messages = relationship("Message", back_populates="owner")

Database tables are typically created using migration tools like Alembic, but we’ll simply ask SQLAlchemy to create them in main.py 🤖.

Next, we define the CRUD (Create, Read, Update, Delete) methods in crud.py 🤖. These methods use a fresh database session from our factory to query the database and create new rows in our tables. The endpoints in main.py will import and use these CRUD methods 🤖.

3.4 Endpoint request and response validation

FastAPI is heavily based on Python’s type annotations and the data validation library Pydantic. For each endpoint, we can define a request and response schema that defines the input/output format we expect. Each request to or response from an endpoint is automatically validated and converted to the right data type and included in our API’s automatically generated documentation. If something about a request or response is missing or wrong, an informative error is thrown.

We define the Pydantic schemas for the database-api in schemas.py 🤖 and use them in the endpoint definitions in main.py 🤖, too. For example, this is the endpoint to create a new chat:

@app.post("/chats", response_model=schemas.Chat)
async def create_chat(chat: schemas.ChatCreate, db: Session = Depends(get_db)):
db_chat = crud.create_chat(db, chat)
return db_chat

We can see that it expects a ChatCreate request body and Chat response body. FastAPI verifies and converts the request and response bodies according to these schemas 🤖:

class ChatCreate(BaseModel):
username: str
messages: List[MessageCreate] = []
session_id: str

class Chat(ChatCreate):
id: UUID
messages: List[Message] = []

Note: our SQLAlchemy models for the database should not be confused with these Pydantic schemas for endpoint input/output validation.

3.5 Running the database API

We can serve the database API using Uvicorn, making it available at http://localhost:8001.

cd $PROJECT_PATH/db-api
uvicorn app.main.py:app --host 0.0.0.0 --port 8001

To run the Uvicorn server in its own Docker container, we create a Dockerfile 🤖 that specifies how to incrementally build the Docker image. We can then build the image and run the container, again making the database API available at http://localhost:8001 after publishing the container’s port 80 to host port 8001. We pass the database credentials and hostname as environment variables.

docker build --tag db-api-image $PROJECT_PATH/db-api
docker run --name db-api --publish 8001:80 --env POSTGRES_USERNAME=<username> --env POSTGRES_PASSWORD=<password> --env POSTGRES_HOST=<hostname> db-api-image

4. User interface

With the backend in place, let’s build the frontend. A web interface typically consists of HTML for structure, CSS for styling and Javascript for interactivity. Frameworks like React, Vue, and Angular use higher-level abstractions like JSX that can ultimately be transformed into HTML, CSS, and JS files to be bundled and served by a web server like Nginx.

Since I want to focus on the backend, I hacked together a simple UI with FastAPI. Instead of JSON responses, its endpoints now return HTML based on template files that are rendered by Jinja, a templating engine that replaces variables in the template with real data like chat messages.

To handle user input and interact with the backend (e.g., retrieve chat history from the database API or generate a reply via the language model API), I’ve avoided JavaScript altogether by using HTML forms 🤖 that trigger internal POST endpoints. These endpoints then simply use Python’s httpx library to make HTTP requests 🤖.

Endpoints are defined in main.py 🤖, HTML templates are in the app/templates directory 🤖, and the static CSS file for styling the pages is in the app/static directory 🤖. FastAPI serves the CSS file at http://hostname/static/style.css so the browser can find it.

Screenshot of the UI’s interactive documentation.

4.1 Homepage

The homepage allows users to enter their name to start or return to a chat 🤖. The submit button triggers a POST request to the internal /chats endpoint with username as form parameter, which calls the database API to create a new chat and then redirects to the Chat Page 🤖.

4.2 Chat Page

The chat page calls the database API to retrieve the chat history 🤖. Users can then enter a message that triggers a POST request to the internal /generate/{chat_id} endpoint with the message as form parameter 🤖.

The generate endpoint calls the database API to add the user’s message to the chat history, and then the language model API with the full chat history to generate a reply 🤖. After adding the reply to the chat history, the endpoint redirects to the chat page, that again retrieves and displays the latest chat history. We send POST request to the LM API using httpx, but we could use a more standardized LM API package like langchain to invoke its completion endpoint.

4.3 Authentication & user sessions

So far, all users can access all endpoints and all data. This means anyone can see your chat given your username or chat id. To remedy that, we will use session-based authentication and authorization.

We will store a first party & GDRP compliant signed session cookie in the user’s browser 🤖. This is just an encrypted dict-like object in the request/response header. The user’s browser will send that session cookie with each request to our hostname, so that we can identify and verify a user user and show them their own chats only.

As an extra layer of security, we “scope” the database API such that each chat row and each message row in the database contains a session id. For each request to the database API, we include the current user’s session id in the request header and query the database with both the chat id (or username) AND that session id. This way, the database can only ever return chats for the current user with its unique session id 🤖.

4.4 Running the UI

To run the UI in a Docker container, we follow the same recipe as the database API, adding the hostnames of the database API and language model API as environment variables.

docker build --tag chat-ui-image $PROJECT_PATH/chat-ui
docker run --name chat-ui --publish 8002:80 --env LM_API_URL=<hostname1> --env DB_API_URL=<hostname2> chat-ui-image

How do we know the hostnames of the two APIs? We will look networking and communication next.

5. Private Docker network

Let’s zoom out and take a look at our architecture again. By now, we have four containers: the UI, DB API, LM API, and PostgreSQL database. What’s missing is the network, reverse proxy and container orchestration.

Our app’s microservices architecture. Made by author in draw.io.

Until now, we used our computer’s localhost loopback network to send requests to an individual container. This was possible because we published their ports to our localhost. However, for containers to communicate with each other, they must be connected to the same network and know each others hostname/IP address and port.

We will create a user-defined bridge Docker network that provides automatic DNS resolution. This means that container names are resolved to their dynamic container’s IP address. The network also provides isolation, and therefore security: you have to be on the same network to be able to reach our containers.

docker network create --driver bridge chat-net

We connect all containers to it by adding --network chat-net to their docker run command. Now, the database API can reach the database at db:5432 and the UI can reach the database API at http://db-api and the language model API at http://lm-api. Port 80 is default for HTTP requests so we can omit it.

6. Nginx reverse proxy

Now, how do we — the user — reach our network-isolated containers? During development we published the UI container port to our localhost, but in a realistic scenario you typically use a reverse proxy. This is another web server that acts like a gateway, forwarding HTTP requests to containers in their private network, enforcing security and isolation.

Nginx is a web server often used as reverse proxy. We can easily run it using its official Docker image. We also mount a configuration file 🤖 in which we specify how Nginx should route incoming requests. As an example, the simplest possible configuration forwards all requests (location / ) from Nginx container’s port 80 to the UI container at http://chat-ui.

http { server { listen 80; location / { proxy_pass http://chat-ui } } }

Since the Nginx container is in same private network, we can’t reach it either. However, we can publish its port so it becomes the only access point of our entire app 🤖. A request to localhost now goes to the Nginx container, who forwards it to the UI and the UI’s response back to us.

docker run --network chat-net --publish 80:80 --volume $PROJECT_PATH/nginx.conf:/etc/nginx/nginx.conf nginx

In part 2 we will see that these gateway servers can also distribute incoming requests over copies of the same containers (load balancing) for scalability; enable secure HTTPS traffic; and do advanced routing and caching. We will use an Azure-managed reverse proxy rather this Nginx container, but I think it’s very useful to understand how they work and how to set one up yourself. It can also be significantly cheaper compared to a managed reverse proxy.

7. Docker Compose

Let’s put everything together. Throughout this post we manually pulled or built each image and ran its container. However, in the codebase I’m actually using Docker Compose: a tool designed to define, run and stop multi-container applications on a single host like our computer.

To use Docker Compose, we simply specify a compose.yml file 🤖 with build and run instructions for each service. A cool feature is that it automatically creates a user-defined bridge network to connect our services. Docker DNS will resolve the service names to container IP addresses.

Inside the project directory we can start all services with a single command:

docker compose up --build

Final thoughts

That wraps it up! We built an AI-powered chat web application that runs on our local computer, learning about microservices, REST APIs, FastAPI, Docker (Compose), reverse proxies, PostgreSQL databases, SQLAlchemy, and llama.cpp. We’ve built it with a cloud-native architecture in mind so we can deploy our app without changing a single line of code.

We will discuss deployment in part 2 and cover Kubernetes, the industry-standard container orchestration tool for large-scale applications across multiple hosts; Azure Container Apps, a serverless platform that abstracts away some of Kubernetes’ complexities; and concepts like load balancing, horizontal scaling, HTTPS, etc.

Roadmap

There is a lot we could do to improve this app. Here are some things I would work on given more time.

Language model. We now use a very general instruction-tuned language model as virtual assistant. I originally started this project to have a “virtual representation of me” on my website for visitors to discuss my research with, based on my scientific publications. For such a use case, an important direction is to improve and tweak the language model output. Perhaps that’ll become a part 3 of this series in the future.

Frontend. Instead of a quick FastAPI UI, I’d build a proper frontend using something like React, Angular or Vue to allow things like steaming LM responses and dynamic views rather than reloading the page every time. A more lightweight alternative that I’d like to experiment with is htmx, a library that provides modern browser features directly from HTML rather than javascript. It would be pretty straightforward to implement LM response streaming, for example.

Reliability. To make the system more mature, I’d add unit and integration tests and a better database setup like Alembic allowing for migrations.

Acknowledgements

Thanks to Dennis Ulmer, Bryan Eikema and David Stap for initial feedback or proofreading.

AI usage

I used Pycharm’s CoPilot plugin for code completion, and ChatGPT for a first version of the HTML and CSS template files. Towards the end, I started experimenting more with debugging and sparring too, which proved surprisingly useful. For example, I used it to learn about Nginx configurations and session cookies in FastAPI. I did not use AI to write this post, though I did use ChatGPT to paraphrase a few bad-running sentences.

Here are some additional resources that I found useful during this project.


Designing, Building & Deploying an AI Chat App from Scratch (Part 1) was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.



from AI in Towards Data Science on Medium https://ift.tt/CR2fL5t
via IFTTT