Top 5 Geospatial Data APIs for Advanced Analysis

Explore Overpass, Geoapify, Distancematrix.ai, Amadeus, and Mapillary for Advanced Mapping and Location Data

Kyle Glenn in Unsplash (Source: https://unsplash.com/es/@kylejglenn)

Geographic data is important in many analyses, enabling us to decide based on location and spatial patterns. Examples of projects where geodata can come in handy include predicting house prices, route optimization in transportation, or establishing a marketing strategy for business.

However, as a data scientist, you will frequently face the challenge of where to obtain this data. In many cases, there are public sources with information on geographic data; however, in many cases, the information they provide needs to be revised for the analyses we want to perform.

This article will evaluate five of the most useful APIs for obtaining large-scale geographic data. We will assess their usage, advantages and disadvantages, and the main applications of the information they provide. Think of this article as a fundamental foundation for the use and applications of these APIs, so that you can later delve deeper into all the tools they offer.

1. Overpass

The Overpass API allows access to the information available on the OpenStreetMap website. OpenStreetMap is an open geographic database, containing a wide range of geospatial data, from information about underground routes to road, mountain, or river locations.

The data available in OpenStreetMaps is open and is maintained by users across the globe, therefore, the level of completeness depends heavily on the region, since there are areas where we can find more active users and therefore the platform has more data. However, the degree of completeness is in most cases high, allowing us to gather a lot of information for our geographic analyses.

The Overpass API uses a language called Overpass QL for designing the queries for accessing the data available on OpenStreetMap. This highly customizable language allows us to create specific queries to access only the information of interest for our analysis from the platform.

Advantages

  • Completely free: because OpenStreetMap is an open database, the use of the API is also completely free.
  • Flexible queries: queries can be highly customized using the Overpass QL language to access only the information of interest. Other aspects, such as the data output format, can also be customized in the query. Also, through the query, you can easily filter the geographic data you want to obtain, as well as the search area for such data.
  • Global data: OpenStreetMap contains global data, as a consequence, the information accessible through the API is not limited to specific regions.

Disadvantages

  • Quality of the returned API data: the OpenStreetMap platform, as mentioned before, is an open website maintained by volunteers. Therefore, the data quality depends on the users, which as a result can lead to incomplete data in certain regions where user activity is low.
  • Necessary Learning for Query Construction: queries in the Overpass API are made using a language called Overpass QL, whose learning process can be particularly slow at first when one is not familiar with the language.
  • Post-processing requirement: the data returned by the API, whether in CSV or JSON format, contains the coordinates of the geographic elements but does not provide processed polygons or multi-polygons we can use directly in our analysis. Therefore, we will need to convert the raw data to obtain the different polygons.++

License

OSM data is free to use for any purpose, including commercial use, and use is governed by our distribution license, the ODbL

FAQ

Use Case — Retrieving Bus Stops in Cuxhaven

The following example shows how we can obtain all bus stops located in Cuxhaven using the Overpass API. Cuxhaven is a small town located in the northern part of Germany, on the shore of the North Sea.

The following code displays the query and the endpoint used to access the information via the API. The query specifies the search area (Cuxhaven), the type of element being searched (bus_stop), and the output format (json).

The get_overpass_data function is a generic function that can be used with any query to get a response from the Overpass API.

Overpass API Output (Image Created by the Author)

The bus stops are found in the key elements of the API response. The latitude and longitude of the bus stops are specified along with their names. Next, we proceed to visualize the results with Folium, but logically this information can also be used for many other analyses, such as studying how proximity to a bus station affects housing prices in Cuxhaven.

Bus Stops in Cuxhaven (Image Created by the Author)

Insightful Articles

If you want to learn more about the potential of the Overpass API for obtaining information from OpenStreetMap, read the following articles. The first article shows how to obtain administrative regions of a city (neighborhoods) from the API. As seen in the article, post processing of data to obtain polygons is necessary, as the API does not provide them directly.

Obtaining Geospatial Polygons for Administrative Areas of Munich via the Overpass API

The second article shows how to obtain subway routes in a city. This use case is more complex than the one shown earlier, as it not only involves defining the locations of the subway stations, but also how the stations are connected to form the various routes.

Subway Route Data Extraction with Overpass API: A Step-by-Step Guide

As you can see, with the Overpass API the range of data that can be retrieved is huge: everything available in OpenStreetMap can be retrieved with the proper query.

2. Distancematrix.ai

I recently discovered this API, and it is especially useful for geographic analyses where distance and time, taking into account the road network, play a role.

Distancematrix.ai | Compute the distance and travel time between points

The distancematix.ai API primarily offers two services: (1) calculations of the distance and travel time between points on a map, and (2) geocoding services. Therefore, its APIs are divided into the following groups:

  • The Distance Matrix APIs: this set of APIs allows calculating the duration of routes in terms of distance and time, taking into account traffic conditions. Distances can be obtained for various modes of transportation, including car, public transport, or walking.
  • The Geocoding APIs: this set of APIs allows for geocoding and reverse geocoding, meaning translating addresses into geographic coordinates (latitude and longitude) and vice versa.

Advantages

  • Availability of a free tier: although the API is not completely free, it offers a free plan with 1,000 monthly elements for the 4 available APIs. An element is considered the retrieval of a distance between a point of origin and a destination.
  • Scalability: the API allows for making calls with multiple origins and destinations simultaneously to obtain distances in a matrix format. This contributes to the execution of large-scale projects.
  • Modes of travel: the API allows you to obtain distances and times for 4 modes of transportation: driving, walking, bicycling, and transit (public transport).

Disadvantages

  • Direct competitors: the Google Distance Matrix API is a much more established direct competitor with data obtained from the Google platform.

License

On their support page, there is a list of industries that could benefit from the use of the API. Commercial use is allowed; however, no specific license is mentioned.

Support page Distancematrix.ai

Use Case — The Distance from the Hotels in Barcelona to the Airport

In the case of wanting to make a proximity analysis, one straight line distance gives an approximation but not the exact measure of how much time it will take to travel between two locations.

To obtain this, it is necessary to consider the road network or public transportation. This information is precisely what we can obtain with one of the endpoints available in the distancematrix.ai API.

The following analysis we are going to conduct involves evaluating the distance from all the hotels in Barcelona to El Prat Airport, located south of Barcelona and very close to the city. We want to assess how much time we need on public transportation to travel from each hotel to the airport. We are traveling in the morning and do not want to spend too much time getting to the airport from the hotel.

Firstly, we need to obtain a list of all the hotels in the city. This information is available in the city open data portal which includes a dataset with information about hotels located in Barcelona, including their location. You can download the dataset we will use in the analysis from the following link.

Hotels in the city of Barcelona - Open Data Barcelona

The file is read, selecting only the relevant information for the analysis.

DataFrame with Hotels Located in Barcelona (Image Created by the Author)

The location of the airport will be retrieved from Google Maps. This could be easily performed by going to a place of interest and right-clicking; after that, we view the latitude and longitude of the airport.

Latitude and longitud of the Barcelona Airport Subway Station (Image Created by the Author)

We now have the necessary information to build the pipeline. Below, the most relevant aspects of the pipeline, which is quite simple, are explained:

  • For each hotel in the DataFrame, we execute an API call, providing the origin and destination coordinates.
  • The selected transportation mode is transit. This mode of transport refers to public transportation.
  • We scheduled the departure time for 8 in the morning.
  • The travel distances and durations to the airport are stored in a DataFrame, along with the hotel names. Finally, this information is merged into the original DataFrame.
Hotel Dataset with Distance and Travel Time to the Airport (Image Created by the Author)

I would suggest looking at the API documentation. There it says, for example, that we can query multiple origin and destination combinations at a time by specifying any number of destinations and/or origins with the ‘or’ operator. However, in our pipeline, we configured only one destination and one origin per API call.

After obtaining the distances and travel times between the hotels and the airport, we will transform these columns to int by removing the indicators ‘km’ and ‘mins’. Then, we will have the dataset ready for visualizing the results.

Hotel Dataset with Distance and Travel Time to the Airport (Image Created by the Author)

The following map shows the obtained results. As observed, two city areas are particularly well connected to the airport by public transportation. These areas correspond to the Sants station and its surroundings, as well as the Passeig de Gràcia area. As can be observed, there is an area that is particularly poorly connected to the airport, which corresponds to the northern part of the city near the beach.

Travel Distance Time in Minutes from Barcelona Hotels to the Airport (Image Created by the Author)

As seen in the image below, it only takes 19 minutes to get to the airport from the Hotel Barcelo Sants.

Hotels Around the Sants Station (Image Created by the Author)

This is an elementary example of how we can use such valuable information as the time needed between an origin and a destination. However, the potential of this information is much greater and can be used in a wide range of studies, from the logistics sector to package delivery analysis. On the distancematrix.ai website, we can read about more detailed use cases.

3. Geoapify

Geoapify is a platform that offers a set of APIs for a wide range of geospatial services.

Geoapify Location Platform: Maps, Geocoding, Routing, and APIs

Its available services and APIs are divided into five main services:

  • Maps: this service contains APIs for obtaining high-resolution images of maps, which can be used for geospatial analysis reports or marketing presentations. The maps are highly customizable, allowing the addition of markers and geometries to highlight specific areas.
  • Address & Location: this service contains APIs for geocoding and reverse geocoding, allowing the conversion of addresses to latitude/longitude and vice versa. Additionally, Geoapify provides a service for address auto-completion, which returns a standardized address from a free-form address input.
  • Routes: this service provides exact travel time and distance between locations for multiple transportation modes. In addition, the API response includes the route geometry as MultiLineString and details such as the speed limit for each section.
  • Places: this API provides points of interest for more than 500 categories such as restaurants, tourist attractions, and supermarkets. The POI search can be conducted using a bounding box, radius, city, or isoline. The source for the POIs is OpenStreetMap, meaning we rely on the quality this platform provides, which is usually for some POIs lower than Google Maps. Additionally, as we have seen before, the same information can be obtained with the Overpass API, which is completely free.
  • Reachability & Analysis: this service enables the analysis of the reachability of locations using isochrones and isodistances for multiple traveling modes.

As we can observe, many of the services that Geoapify offers are available on other platforms or APIs. This should not surprise us, as nowadays many companies offer similar services.

Advantages

  • Extensive suite of APIs: the range of geospatial services offered by this platform is extensive, in contrast to other platforms that only specialize in a specific service.
  • Availability of a free tier: although the API is not completely free, it offers a fairly generous plan with 3,000 credits per day. Paid accounts are also quite economical compared to other APIs; the most expensive one provides 100,000 credits per day for 249 euros.

Disadvantages

  • Limited coverage: some services rely on information available from OpenStreetMap. However, as we mentioned earlier, this information is not always complete, depending on the geographic region.
  • Alternative tools: as mentioned earlier, multiple companies are offering similar geospatial services and analyses. Therefore, it is possible to find platforms that specialize exclusively in one of the services provided by Geoapify.

License

Geoapify Free plan can be used in commercial projects. However, you must provide an appropriate Geoapify attribution or link to the website. The correct attribution is Powered by <a href="https://ift.tt/f2E6n78>

Pricing | Geoapify Location Platform

Use Case — Walking Distance Accessibility Analysis

As already stated, Geoapify offers a great variety of services. In this example, the Isochrone API available in the Reachability & Analysis section, will be used. This specific service returns a polygon including an area reachable by a certain transportation mode within a specified travel time from the location of choice. Here, we will calculate distances on foot; however, besides walking distances, the API also calculates isochrones for other means of transportation modes like driving, cycling, or transit.

The get_isochrone function performs an API call to get the isochrones around one location of Moratalaz’s neighborhood from Madrid, but you can put any location you want for your study or test. The desired walking distance was set as 10 minutes; the time is provided in seconds, since the API ranges are given in this unit.

Part of the Isochrone API response (Image Created by the Author)

As shown above, the API response does not provide a MultiPolygon, but a list of its coordinates. Therefore, for further analysis using this data, we will often need to convert it into a Polygon.

The following code shows how to convert the API response into a MultiPolygon which we can easily visualize using folium.

Isochrone Around an Input Location (Image Created by the Author). Powered by Geoapify.

However, you’re probably wondering what type of analysis these isodistance polygons are useful for. Here, we’ve limited ourselves to visualizing them, but this type of analysis could be useful for evaluating, for example, the services available around a home, such as restaurants, supermarkets, and stores. This could influence housing prices. Another possible analysis could be to evaluate the area of the city that can be reached by public transport in less than 30 minutes. These temporal analyses are far more accurate than simple distance radii.

4. Amadeus

The Amadeus API is not specifically an API devoted to general geospatial data, but rather to the data corresponding to the tourism sector. However, many of the services they offer can be considered as geographic data of interest to various analyses.

Next, we will mention the main services they offer related to geospatial data:

  • Hotels APIs: this API provides the location of more than 150.000 hotels worldwide. You can search for hotels inside a city or an area. You can use the HotelID provided in the output response in the Hotel Search API to obtain details about the hotel room prices and services.
  • Points of Interest APIs: the API makes available information on points of interest regarding the tourism sector.

If you are interested in the tourism sector, it might be of interest for you to enter and find out what Amadeus is offering at the following link.

Connect to Amadeus travel APIs | Amadeus for Developers

Advantages

  • Sector-specific API: Amadeus API is one of the most powerful APIs for retrieving tourism-related data outperforming other popular and generic APIs such as Google Maps or OpenStreetMap.

Disadvantages

  • Complicated documentation: the documentation offered is more complex compared to other API documentation. It takes quite a bit of time to become familiar with it before starting to use the API and building the pipeline for data extraction.

License

On their support page, there is a list of industries that could benefit from the use of the API. Commercial use is allowed; however, no specific license is mentioned.

Connect to Amadeus travel APIs | Amadeus for Developers

Use Case — Obtaining Hotels in Ingolstadt

The proposed example uses the Hotel List APIto retrieve a list of hotels located in Ingolstadt. We will search hotels using the geocode of the city rather than by city name. The API provides multiple filtering options such as amenities available in the hotel, hotel starts, hotel chain, or radius from the specified input location. In this case, we will only use the proximity filter for our search.

First, we need to log in to the platform and go to the My Self-Service Workspace section. There, in My Apps section, we can create our app. Once the app is created, Amadeus will provide us with two keys, an API key and an API Secret.

Amadeus for Developers uses OAuth to authenticate access requests. OAuth generates an access token which grants the client permission to access a protected resource. Once you have created an app and received your API Key and API Secret, you can generate an access token by sending a POST request to the authorization server:

https://test.api.amadeus.com/v1/security/oauth2/token

The get_access function demonstrates how to get the access token using the OAuth authentication method.

The OAuth authentication method provides higher security as tokens are temporary and expire after a certain period, in contrast to API keys, which remain always the same. This authentication method contrasts with the methods used by the other APIs mentioned in the article, in which all, except for the Overpass API that does not require authentication, perform authentication simply by using an API Key.

Once the App is authenticated, we need to obtain the Ingolstadt coordinates (latitude and longitude) from Google Maps. Additionally, we have evaluated different distance radii on the map to select the appropriate one. To measure a distance in Google Maps you have to right-click on the map and select Measure Distance. Then, click on the destination to obtain the measurement. After evaluating different distances, we have selected a radius of 5 for the hotel retrieval.

Analyzing Distance from the City Center with Google Maps (Image created by the Author)

Finally, we perform the API call to the Hotel List API endpoint, obtaining a list of hotels located within a 5-kilometer radius of Ingolstadt city center.

Hotel List in Ingolstadt provided by the Amadeus API (Image Created by the Author)

For the Hotel List API, we have a total of 2000 requests per month, free with the trial version.

One aspect I would like to point out is that the Amadeus API did not provide all the available hotels in Ingolstadt for the evaluated radius. Some hotels are not present in the response. The positive aspect is that for the hotels provided, price tracking can be done easily, information that can be useful for more advanced analysis.

5. Mapillary

The Mapillary API allows to access all images available on the Mapillary platform. This platform is an open database of street images provided by volunteers across the globe. The Mapillary API is especially useful for projects where a large number of street images need to be analyzed without manually collecting the images. These projects could include urban planning, traffic condition analysis, or traffic sign recognition.

Advantages

  • Notable number of images: this is a free database that contains an extensive number of images, all of them with their corresponding metadata. This database is continuously being developed and updated.

Disadvantages

  • Complicated documentation: the documentation offered is more complex compared to other API documentation. It takes quite a bit of time to become familiar with it before starting to use the API and building the pipeline for data extraction.
  • Limited Coverage: the images available on the platform depend on user contributions, which is why there are regions with extensive coverage and others where it is minimal.

License

When you upload imagery to Mapillary, you give Mapillary the rights to use the images for commercial purposes. However, you still own the full rights to the images you contributed, and you always will.

Licenses

Use Case — Images of Valencia City Center

We use the Mapillary API to obtain images of the Valencia city center. Before actually retrieving images, we must define a bounding box that will determine the search area for images.

A bounding box is defined by two sets of coordinates, the bottom-left and the top-right corners. OpenStreetMap has in the section Export a possibility to export manually selected bounded boxes.

Selection of a Bounding Box in OpenStreetMap (Image created by the Author)

This bounding box is used to select the tiles that define the image search area. The Mapillary API does not define the search area using bounding boxes but rather uses a zoom-level {z}, an x-tile coordinate {x}, and a y-tile coordinate {y}.

We will not go into detail about how to obtain both the x-tile and y-tile from a latitude, longitude, and zoom level, but if you are interested in understanding how the calculations are done, I invite you to read the following article.

How to calculate number of tiles in a bounding box for OpenStreetMaps

We will perform the transformation using the mercantile library. This library provides a function to fetch all the available tiles inside a bounding box. We will make an API call for each available tile in the bounding box to get the available images. The API response is in vector tile format; we will convert it to GeoJSON format using the vt_bytes_to_geojson function.

The function get_tiles_in_bbox provides as output a list of GeoJSON elements each of them containing information on the images available in the tiles within the bounding box.

Function response (Image created by the Author)

For each image, the following information is provided:

  • captured_at (int): timestamp in ms since epoch
  • compass_angle (int): the compass angle of the image
  • creator_id (int): unique user ID of the image owner (not username)
  • id (int): ID of the image
  • sequence_id (string): ID of the sequence this image belongs to
  • organization_id (int): ID of the organization this image belongs to. It can be absent
  • is_pano (bool): if it is a panoramic image

Now we have all the image IDs and can proceed with their download. Below is shown how to download an image and visualize it in the Jupyter Notebook. However, a pipeline could be created to download all available pictures. Additionally, using the sequence_id, the images corresponding to the same sequence could be grouped into a single folder.

Visualization of the image in the Jupyter Notebook (Image Created by the Author)

It is important to note that many of the images provided in response to the last API call are outside the bounding box we previously defined. This is because the tile is much larger in dimension than the bounding box. Therefore, if we only want to retrieve the images located within the bounding box, we should use the image coordinates and check if they are within the bounding box to filter them accordingly.

APIs allow for the large-scale acquisition of relevant information. In many projects, geographic data plays a crucial role, so being able to obtain it with the tools available on the market is something essential that we, as data scientists, need to be capable of. First things first: understand what’s at our disposal and what that can offer. In this post, we evaluated 5 APIs that allow obtaining geographic data, but there are many more that are not included in this top 5. I invite you to do some further research and try the tools mentioned in this post; you will find they are really useful.

Thank you very much for reading,

Amanda Iglesias


Top 5 Geospatial Data APIs for Advanced Analysis was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.



from Datascience in Towards Data Science on Medium https://ift.tt/1hjy4oL
via IFTTT

Getting Started with Powerful Data Tables in your Python Web Apps

Getting Started with Powerful Data Tables in Your Python Web Apps

Using AG Grid to build a Finance app in pure Python with Reflex

These past few months, I’ve been exploring various data visualization and manipulation tools for web applications. As a Python developer, I often need to handle large datasets and display them in interactive, customizable tables. One question that consistently bothered me was: How can I build a powerful data grid UI that integrates seamlessly with my Python backend?

There are countless options out there to build sophisticated data grids, but as a Python engineer, I have limited experience with JavaScript or any front-end framework. I was looking for a way to create a feature-rich data grid using only the language I’m most comfortable with — Python!

I decided to use Reflex, an open-source framework that lets me build web apps entirely in Python. What’s more, Reflex now offers integration with AG Grid, a feature-rich data grid library designed for displaying and manipulating tabular data in web applications which offers a wide array of functionalities including:

- In-place cell editing

- Real-time data updates

- Pagination and infinite scrolling

- Column filtering, reordering, resizing, and hiding

- Row grouping and aggregation

- Built-in theming

Disclaimer: I work as a Founding Engineer at Reflex where I contribute to the open-source framework.

In this tutorial we will cover how to build a full Finance app from scratch in pure Python to display stock data in an interactive grid and graph with advanced features like sorting, filtering, and pagination — Check out the full live app and code.

Setup

First we import the necessary libraries, including yfinance for fetching the stock data.

import reflex as rx
from reflex_ag_grid import ag_grid
import yfinance as yf
from datetime import datetime, timedelta
import pandas as pd

Fetching and transforming data

Next, we define the State class, which contains the application’s state and logic. The fetch_stock_data function fetches stock data for the specified companies and transforms it into a format suitable for display in AG Grid. We call this function when clicking on a button, by linking the on_click trigger of the button to this state function.

We define state variables, any fields in your app that may change over time (A State Var is directly rendered into the frontend of the app).

The data state variable stores the raw stock data fetched from Yahoo Finance. We transform this data to round the values and store it as a list of dictionaries, which is the format that AG Grid expects. The transformed data is sorted by date and ticker in descending order and stored in the dict_data state variable.

The datetime_now state variable stores the current datetime when the data was fetched.

# The list of companies to fetch data for
companies = ["AAPL", "MSFT", "GOOGL", "AMZN", "META"]

class State(rx.State):
# The data fetched from Yahoo Finance
data: pd.DataFrame
# The data to be displayed in the AG Grid
dict_data: list[dict] = [\{}]
# The datetime of the current fetched data
datetime_now: datetime = datetime.now()

def fetch_stock_data(self):
self.datetime_now = datetime.now()
start_date = self.datetime_now - timedelta(days=180)

# Fetch data for all tickers in a single download
self.data = yf.download(companies, start=start_date, end=self.datetime_now, group_by='ticker')
rows = []
for ticker in companies:
# Check if the DataFrame has a multi-level column index (for multiple tickers)
if isinstance(self.data.columns, pd.MultiIndex):
ticker_data = self.data[ticker] # Select the data for the current ticker
else:
ticker_data = self.data # If only one ticker, no multi-level index exists

for date, row in ticker_data.iterrows():
rows.append({
"ticker": ticker,
"date": date.strftime("%Y-%m-%d"),
"open": round(row["Open"], 2),
"high": round(row["High"], 2),
"mid": round((row["High"] + row["Low"]) / 2, 2),
"low": round(row["Low"], 2),
"close": round(row["Close"], 2),
"volume": int(row["Volume"]),
})

self.dict_data = sorted(rows, key=lambda x: (x["date"], x["ticker"]), reverse=True)
rx.button(
"Fetch Latest Data",
on_click=State.fetch_stock_data,
)

Defining the AG Grid columns

Columns of AG Grid by Author

The column_defs list defines the columns to be displayed in the AG Grid. The header_name is used to set the header title for each column. The field key represents the id of each column. The filter key is used to insert the filter feature.

column_defs = [
ag_grid.column_def(field="ticker", header_name="Ticker", filter=ag_grid.filters.text, checkbox_selection=True),
ag_grid.column_def(field="date", header_name="Date", filter=ag_grid.filters.date),
ag_grid.column_def(field="open", header_name="Open", filter=ag_grid.filters.number),
ag_grid.column_def(field="high", header_name="High", filter=ag_grid.filters.number),
ag_grid.column_def(field="low", header_name="Low", filter=ag_grid.filters.number),
ag_grid.column_def(field="close", header_name="Close", filter=ag_grid.filters.number),
ag_grid.column_def(field="volume", header_name="Volume", filter=ag_grid.filters.number),
]

Displaying AG Grid

AG Grid by Author

Now for the most important part of our app, AG Grid itself!

• id is required because it uniquely identifies the Ag-Grid instance on the page

• column_defs is the list of column definitions we defined earlier

• row_data is the data to be displayed in the grid, which is stored in the dict_data State var

• pagination, pagination_page_size and pagination_page_size_selector parameters enable pagination with specific variables in the grid

• theme enables you to set the theme of the grid

ag_grid(
id="myAgGrid",
column_defs=column_defs,
row_data=State.dict_data,
pagination=True,
pagination_page_size=20,
pagination_page_size_selector=[10, 20, 50, 100],
theme=State.grid_theme,
on_selection_changed=State.handle_selection,
width="100%",
height="60vh",
)

Changing AG Grid Theming

Changing AG Grid Theme by Author

We set theme using the grid_theme State var in the rx.select component.

Every state var has a built-in function to set it’s value for convenience, called set_VARNAME, in this case set_grid_theme.

class State(rx.State):
...
# The theme of the AG Grid
grid_theme: str = "quartz"
# The list of themes for the AG Grid
themes: list[str] = ["quartz", "balham", "alpine", "material"]

rx.select(
State.themes,
value=State.grid_theme,
on_change=State.set_grid_theme,
size="1",
)

Showing Company Data in a Graph

Showing 6 Months of Selected Company Data by Author

The on_selection_changed event trigger, shown in the AG grid code above, is called when the user selects a row in the grid. This calls the function handle_selection method in the State class, which sets the selected_rows state var to the new selected row and calls the function update_line_graph.

The update_line_graph function gets the relevant ticker and uses it to set the company state var. The Date, Mid, and DateDifference data for that company for the past 6 months is then set to the state var dff_ticker_hist.

Finally it is rendered in an rx.recharts.line_chart, using rx.recharts.error_bar to show the DateDifference data which are the highs and the lows for the day.

class State(rx.State):
...
# The selected rows in the AG Grid
selected_rows: list[dict] = None
# The currently selected company in AG Grid
company: str
# The data fetched from Yahoo Finance
data: pd.DataFrame
# The data to be displayed in the line graph
dff_ticker_hist: list[dict] = None

def handle_selection(self, selected_rows, _, __):
self.selected_rows = selected_rows
self.update_line_graph()

def update_line_graph(self):
if self.selected_rows:
ticker = self.selected_rows[0]["ticker"]
else:
self.dff_ticker_hist = None
return
self.company = ticker

dff_ticker_hist = self.data[ticker].reset_index()
dff_ticker_hist["Date"] = pd.to_datetime(dff_ticker_hist["Date"]).dt.strftime("%Y-%m-%d")

dff_ticker_hist["Mid"] = (dff_ticker_hist["Open"] + dff_ticker_hist["Close"]) / 2
dff_ticker_hist["DayDifference"] = dff_ticker_hist.apply(
lambda row: [row["High"] - row["Mid"], row["Mid"] - row["Low"]], axis=1
)

self.dff_ticker_hist = dff_ticker_hist.to_dict(orient="records")


rx.recharts.line_chart(
rx.recharts.line(
rx.recharts.error_bar(
data_key="DayDifference",
direction="y",
width=4,
stroke_width=2,
stroke="red",
),
data_key="Mid",
),
rx.recharts.x_axis(data_key="Date"),
rx.recharts.y_axis(domain=["auto", "auto"]),
data=State.dff_ticker_hist,
width="100%",
height=300,
)

Conclusion

Using AG Grid inside the Reflex ecosystem empowered me as a Python developer to create sophisticated, data-rich web applications with ease. Whether you’re building complex dashboards, data analysis tools, or an application that demands powerful data grid capabilities, Reflex AG Grid has you covered.

I’m excited to see what you’ll build with Reflex AG Grid! Share your projects, ask questions, and join the discussion in our community forums. Together, let’s push the boundaries of what’s possible with Python web development!

If you have questions, please comment them below or message me on Twitter at @tgotsman12 or on LinkedIn. Share your app creations on social media and tag me, and I’ll be happy to provide feedback or help retweet!


Getting Started with Powerful Data Tables in your Python Web Apps was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.



from Datascience in Towards Data Science on Medium https://ift.tt/XW3QFTV
via IFTTT

How to succeed with AI: Combining Kafka and AI Guardrails

Why real-time data and governance are non-negotiable for AI

Photo by Sid Verma on Unsplash

Kafka is great. AI is great. What happens when we combine both? Continuity.

AI is changing many things about our efficiency and how we operate: sublime translations, customer interactions, code builder, driving our cars etc. Even if we love cutting-edge things, we’re all having a hard time keeping up with it.

There is a massive problem we tend to forget: AI can easily go off the rails without the right guardrails. And when it does, it’s not just a technical glitch, it can lead to disastrous consequences for the business.

From my own experience as a CTO, I’ve seen firsthand that real AI success doesn’t come from speed alone. It comes from control — control over the data your AI consumes, how it operates, and ensuring it doesn’t deliver the wrong outputs (more on this below).

The other part of the success is about maximizing the potential and impact of AI. That’s where Kafka and data streaming enter the game

Both AI Guardrails and Kafka are key to scaling a safe, compliant, and reliable AI.

AI without Guardrails is an open book

One of the biggest risks when dealing with AI is the absence of built-in governance. When you rely on AI/LLMs to automate processes, talk to customers, handle sensitive data, or make decisions, you’re opening the door to a range of risks:

  • data leaks (and prompt leaks as we’re used to see)
  • privacy breaches and compliance violations
  • data bias and discrimination
  • out-of-domain prompting
  • poor decision-making

Remember March 2023? OpenAI had an incident where a bug caused chat data to be exposed to other users. The bottom line is that LLMs don’t have built-in security, authentication, or authorization controls. An LLM is like a massive open book — anyone accessing it can potentially retrieve information they shouldn’t. That’s why you need a robust layer of control and context in between, to govern access, validate inputs, and ensure sensitive data remains protected.

There is where AI guardrails, like NeMo (by Nvidia) and LLM Guard, come into the picture. They provide essential checks on the inputs and outputs of the LLM:

  • prompt injections
  • filtering out biased or toxic content
  • ensuring personal data isn’t slipping through the cracks.
  • out-of-context prompts
  • jailbreaks
Image by the author

https://github.com/leondz/garak is an LLM vulnerability scanner. It checks if an LLM can be made to fail in a way we don’t want. It probes for hallucination, data leakage, prompt injection, misinformation, toxicity generation, jailbreaks, and many other weaknesses.

What’s the link with Kafka?

Kafka is an open-source platform designed for handling real-time data streaming and sharing within organizations. And AI thrives on real-time data to remain useful!

Feeding AI static, outdated datasets is a recipe for failure — it will only function up to a certain point, after which it won’t have fresh information. Think about ChatGPT always having a ‘cut-off’ date in the past. AI becomes practically useless if, for example, during customer support, the AI don’t have the latest invoice of a customer asking things because the data isn’t up-to-date.

Methods like RAG (Retrieval Augmented Generation) fix this issue by providing AI with relevant, real-time information during interactions. RAG works by ‘augmenting’ the prompt with additional context, which the LLM processes to generate more useful responses.

Guess what is frequently paired with RAG? Kafka. What better solution to fetch real-time information and seamlessly integrate it with an LLM? Kafka continuously streams fresh data, which can be composed with an LLM through a simple HTTP API in front. One critical aspect is to ensure the quality of the data being streamed in Kafka is under control: no bad data should enter the pipeline (Data Validations) or it will spread throughout your AI processes: inaccurate outputs, biased decisions, security vulnerabilities.

A typical streaming architecture combining Kafka, AI Guardrails, and RAG:

Image by the author

Gartner predicts that by 2025, organizations leveraging AI and automation will cut operational costs by up to 30%. Faster, smarter.

Should we care about AI Sovereignty? Yes.

AI sovereignty is about ensuring that you fully control where your AI runs, how data is ingested, processed, and who has access to it. It’s not just about the software, it’s about the hardware as well, and the physical place things are happening.

Sovereignty is about the virtual, physical infrastructure and geopolitical boundaries where your data resides. We live in a physical world, and while AI might seem intangible, it’s bound by real-world regulations.

For instance, depending on where your AI infrastructure is hosted, different jurisdictions may demand access to your data (e.g. the States!), even if it’s processed by an AI model. That’s why ensuring sovereignty means controlling not just the code, but the physical hardware and the environment where the processing happens.

Technologies like Intel SGX (Software Guard Extensions) and AMD SEV (Secure Encrypted Virtualization) offer this kind of protection. They create isolated execution environments that protect sensitive data and code, even from potential threats inside the host system itself. And solutions like Mithril Security are also stepping up, providing Confidential AI where the AI provider cannot even access the data processed by their LLM.

Image by the author

Conclusion

It’s clear that AI guardrails and Kafka streaming are the foundation to make use-cases relying on AI successful. Without Kafka, AI models operate on stale data, making them unreliable and not very useful. And without AI guardrails, AI is at risk of making dangerous mistakes — compromising privacy, security, and decision quality.

This formula is what keeps AI on track and in control. The risks of operating without it are simply too high.


How to succeed with AI: Combining Kafka and AI Guardrails was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.



from Datascience in Towards Data Science on Medium https://ift.tt/IXoa4VK
via IFTTT

AI-Powered Corrosion Detection for Industrial Equipment: A Scalable Approach with AWS

A Complete AWS ML Solution with SageMaker, Lambda, and API Gateway

Photo by Monstera Production: https://www.pexels.com/photo/textured-background-of-metal-lattice-against-brick-wall-7794453/

Introduction

Industries like manufacturing, energy, and telecommunications require extensive quality control to ensure that their equipment remains operational. One persistent issue that most components are subject to is corrosion: the gradual degradation of metals caused by environmental factors. If left unchecked, corrosion can lead to health hazards, machinery downtime, and infrastructure failure.

This project demonstrates an approach for fully automating the corrosion detection process with the use of cloud computing. Specifically, it utilizes Amazon Sagemaker, Lambda, and API Gateway to build a scalable, efficient, and fault-tolerant quality control solution.

Data

The data for this project was procured by the Synthetic Corrosion Dataset (CC BY 4.0), which contains hundreds of synthetic images. Each image is classified as either Corrosion or Not Corrosion.

The data source provides the images in separate folders for training, testing, and validation datasets, so splitting is unnecessary. The training, validation, and testing sets have 270, 8, and 14 images, respectively.

Image of Corrosion (Left) and Image of No Corrosion (Right) (Created by Author)

All images are stored in an s3 bucket with the following directory structure:

/train
/Corrosion
/Not Corrosion
/test
/Corrosion
/Not Corrosion
/valid
/Corrosion
/Not Corrosion

The Workflow

Cloud Solution (Created by Author)

In the cloud solution, a user submits an image classification request to the API integrated with a Lambda function. The Lambda function fetches the image in the S3 bucket and then classifies it using the SageMaker endpoint. The result of the classification is returned to the user as an API response.

Preprocessing the Data

The ImageDataGenerator in the Keras library loads, preprocesses, and transforms images. All images are normalized, while only the training data is augmented with operations such as rotations and flipping.

Image augmentation is an essential step, given the small number of images available.

Keras automatically assigns labels to the images based on the folder they are in:

Creating the Model

Sagemaker Model (Created by Author)

The next step is to define the neural network architecture of the model that is to be trained. Given the low volume of data accessible, there is merit in using a pre-trained model, which already has configured weights that can discern features in images.

The project leverages MobileNetV2, a high-performance model that is relatively memory-efficient.

Training the Model

The model is trained for 20 epochs, with early stopping included to reduce run time.

Deploying the Model

Sagemaker Endpoint (Created by Author)

This model must now be deployed to a Sagemaker endpoint.

To do so, it is first saved as a tar.gz file and exported to S3.

Given that the current model is custom-made, it will need to be converted into a Tensorflow object that is compatible with SageMakers containers before deployment.

With the TensorFlowModel object created, the model can be deployed with a simple one-liner:

For clarity on the syntax used for deploying the model, please check out the Sagemaker documentation.

Creating the Lambda Function

Lambda Function (Created by Author)

By calling the endpoint with a Lambda function, applications outside of Sagemaker will be able to utilize the model to classify images.

The lambda function will do the following:

  1. Access the image in the given S3 directory
  2. Preprocess the image to be compatible with the model
  3. Generate and output the model’s prediction

A quick test with a test event using an image in S3 as input confirms that the function is operational. Here is the test image, named “pipe.jpg”.

Test Image (Created by Author)

The image is classified with the following test event:

{
"s3_bucket": "corrosion-detection-data",
"s3_key": "images-to-classify/pipe.jpg"
}

As shown below, the image is classified as “Corrosion”.

Test Result (Created by Author)

Building the API

API Gateway (Created by Author)

Creating an API that integrates the Lambda function increases both the usability and security of the Sagemaker model.

In AWS, this can be accomplished by creating a REST API in the API Gateway console:

REST API (Created by Author)

A task like image classification can only be done through a POST request since users need to send information to the server. Thus, a POST method that integrates the lambda function is created in the REST API:

Once the method is integrated with the Lambda function, the API can be deployed for use, thereby allowing other applications access to the SageMaker model.

For instance, a CURL command in the AWS CLI can use the API to identify images. The following is the syntax:

curl -X POST <API Gateway Invoke URL>\
-H "Content-Type: application/json" \
-d '{
"s3_bucket": <S3 Bucket Name>,
"s3_key": <S3 Key Name>
}'
Code Output (Created by Author)

The API is now fully operational!

Benefits of the Solution

Utilizing cloud computing services to handle everything from model training to API deployment brings many benefits.

  1. Efficiency

SageMaker enables models to be trained quickly and deployed. Furthermore, API Gateway and Lambda would allow the users to classify images from a single interface in near real-time.

2. Scalability

Amazon Lambda and Sagemaker both offer the scalability needed to adjust to changes in workloads. This ensures that the solutions remain operational regardless of the amount of traffic.

3. Security

AWS allows users to create mechanisms such as API keys and rate limits to protect the API (and the underlying model) from malicious actors. This guarantees that only authorized users will be able to access the API.

4. Cost Efficiency

Both Amazon SageMaker and Lambda use pay-as-you-go models, meaning that there will be risks of paying for overprovisioning. Both services scale according to the workload and will only charge for compute power used when processing a request.

Limitations (and Potential Fixes)

Despite the number of advantages of using this cloud solution, there are certain areas in which it is lacking that could be addressed with some minor changes to the workflow.

  1. Minimal Training Data

The training data is lacking in both quantity and variety. Most pictures are of pipes and corrosion, so it is unclear how the model would classify other objects, such as boilers and turbine blades. To improve the model’s general performance across different use cases, a more extensive data collection effort is required.

2. No Support for Batching

The current approach allows users to identify images one at a time. However, this could be a tedious endeavor as the number of images needing classification rises. Batching would be an appropriate remedy for this issue, offering a simple way to classify multiple images at once

3. No Real-Time Alerts

Corrosion found in equipment needs to be dealt with as soon as possible. However, the current cloud architecture does not trigger any notifications when corrosion is detected in any image. An SNS topic that pushes messages whenever the model identifies corrosion would help end users address these cases in real-time.

Conclusion

Photo by Alexas_Fotos on Unsplash

The combination of Sagemaker, Lambda, and API Gateway allows for an efficient, automated, and scalable quality control solution. While the project focused on the classification of corrosive objects, the architecture can be applied to other computer vision use cases.

For access to the code, please check out the GitHub repository:

GitHub - anair123/Corrosion-Detection-With-AWS

Thank you for reading!


AI-Powered Corrosion Detection for Industrial Equipment: A Scalable Approach with AWS was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.



from Datascience in Towards Data Science on Medium https://ift.tt/vbN0kJO
via IFTTT