AI-Powered Corrosion Detection for Industrial Equipment: A Scalable Approach with AWS

A Complete AWS ML Solution with SageMaker, Lambda, and API Gateway

Photo by Monstera Production: https://www.pexels.com/photo/textured-background-of-metal-lattice-against-brick-wall-7794453/

Introduction

Industries like manufacturing, energy, and telecommunications require extensive quality control to ensure that their equipment remains operational. One persistent issue that most components are subject to is corrosion: the gradual degradation of metals caused by environmental factors. If left unchecked, corrosion can lead to health hazards, machinery downtime, and infrastructure failure.

This project demonstrates an approach for fully automating the corrosion detection process with the use of cloud computing. Specifically, it utilizes Amazon Sagemaker, Lambda, and API Gateway to build a scalable, efficient, and fault-tolerant quality control solution.

Data

The data for this project was procured by the Synthetic Corrosion Dataset (CC BY 4.0), which contains hundreds of synthetic images. Each image is classified as either Corrosion or Not Corrosion.

The data source provides the images in separate folders for training, testing, and validation datasets, so splitting is unnecessary. The training, validation, and testing sets have 270, 8, and 14 images, respectively.

Image of Corrosion (Left) and Image of No Corrosion (Right) (Created by Author)

All images are stored in an s3 bucket with the following directory structure:

/train
/Corrosion
/Not Corrosion
/test
/Corrosion
/Not Corrosion
/valid
/Corrosion
/Not Corrosion

The Workflow

Cloud Solution (Created by Author)

In the cloud solution, a user submits an image classification request to the API integrated with a Lambda function. The Lambda function fetches the image in the S3 bucket and then classifies it using the SageMaker endpoint. The result of the classification is returned to the user as an API response.

Preprocessing the Data

The ImageDataGenerator in the Keras library loads, preprocesses, and transforms images. All images are normalized, while only the training data is augmented with operations such as rotations and flipping.

Image augmentation is an essential step, given the small number of images available.

Keras automatically assigns labels to the images based on the folder they are in:

Creating the Model

Sagemaker Model (Created by Author)

The next step is to define the neural network architecture of the model that is to be trained. Given the low volume of data accessible, there is merit in using a pre-trained model, which already has configured weights that can discern features in images.

The project leverages MobileNetV2, a high-performance model that is relatively memory-efficient.

Training the Model

The model is trained for 20 epochs, with early stopping included to reduce run time.

Deploying the Model

Sagemaker Endpoint (Created by Author)

This model must now be deployed to a Sagemaker endpoint.

To do so, it is first saved as a tar.gz file and exported to S3.

Given that the current model is custom-made, it will need to be converted into a Tensorflow object that is compatible with SageMakers containers before deployment.

With the TensorFlowModel object created, the model can be deployed with a simple one-liner:

For clarity on the syntax used for deploying the model, please check out the Sagemaker documentation.

Creating the Lambda Function

Lambda Function (Created by Author)

By calling the endpoint with a Lambda function, applications outside of Sagemaker will be able to utilize the model to classify images.

The lambda function will do the following:

  1. Access the image in the given S3 directory
  2. Preprocess the image to be compatible with the model
  3. Generate and output the model’s prediction

A quick test with a test event using an image in S3 as input confirms that the function is operational. Here is the test image, named “pipe.jpg”.

Test Image (Created by Author)

The image is classified with the following test event:

{
"s3_bucket": "corrosion-detection-data",
"s3_key": "images-to-classify/pipe.jpg"
}

As shown below, the image is classified as “Corrosion”.

Test Result (Created by Author)

Building the API

API Gateway (Created by Author)

Creating an API that integrates the Lambda function increases both the usability and security of the Sagemaker model.

In AWS, this can be accomplished by creating a REST API in the API Gateway console:

REST API (Created by Author)

A task like image classification can only be done through a POST request since users need to send information to the server. Thus, a POST method that integrates the lambda function is created in the REST API:

Once the method is integrated with the Lambda function, the API can be deployed for use, thereby allowing other applications access to the SageMaker model.

For instance, a CURL command in the AWS CLI can use the API to identify images. The following is the syntax:

curl -X POST <API Gateway Invoke URL>\
-H "Content-Type: application/json" \
-d '{
"s3_bucket": <S3 Bucket Name>,
"s3_key": <S3 Key Name>
}'
Code Output (Created by Author)

The API is now fully operational!

Benefits of the Solution

Utilizing cloud computing services to handle everything from model training to API deployment brings many benefits.

  1. Efficiency

SageMaker enables models to be trained quickly and deployed. Furthermore, API Gateway and Lambda would allow the users to classify images from a single interface in near real-time.

2. Scalability

Amazon Lambda and Sagemaker both offer the scalability needed to adjust to changes in workloads. This ensures that the solutions remain operational regardless of the amount of traffic.

3. Security

AWS allows users to create mechanisms such as API keys and rate limits to protect the API (and the underlying model) from malicious actors. This guarantees that only authorized users will be able to access the API.

4. Cost Efficiency

Both Amazon SageMaker and Lambda use pay-as-you-go models, meaning that there will be risks of paying for overprovisioning. Both services scale according to the workload and will only charge for compute power used when processing a request.

Limitations (and Potential Fixes)

Despite the number of advantages of using this cloud solution, there are certain areas in which it is lacking that could be addressed with some minor changes to the workflow.

  1. Minimal Training Data

The training data is lacking in both quantity and variety. Most pictures are of pipes and corrosion, so it is unclear how the model would classify other objects, such as boilers and turbine blades. To improve the model’s general performance across different use cases, a more extensive data collection effort is required.

2. No Support for Batching

The current approach allows users to identify images one at a time. However, this could be a tedious endeavor as the number of images needing classification rises. Batching would be an appropriate remedy for this issue, offering a simple way to classify multiple images at once

3. No Real-Time Alerts

Corrosion found in equipment needs to be dealt with as soon as possible. However, the current cloud architecture does not trigger any notifications when corrosion is detected in any image. An SNS topic that pushes messages whenever the model identifies corrosion would help end users address these cases in real-time.

Conclusion

Photo by Alexas_Fotos on Unsplash

The combination of Sagemaker, Lambda, and API Gateway allows for an efficient, automated, and scalable quality control solution. While the project focused on the classification of corrosive objects, the architecture can be applied to other computer vision use cases.

For access to the code, please check out the GitHub repository:

GitHub - anair123/Corrosion-Detection-With-AWS

Thank you for reading!


AI-Powered Corrosion Detection for Industrial Equipment: A Scalable Approach with AWS was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.



from Datascience in Towards Data Science on Medium https://ift.tt/vbN0kJO
via IFTTT

What to Expect in your First 90 Days as A Data Scientist

What to Expect in Your First 90 Days as A Data Scientist

Coming from a data analyst without a PhD or technical background

Image from Pixabay by Pexels

“Welcome to the team! We’re so excited to have you here!”

I couldn’t believe that I had landed my first role as a Data Scientist.

I don’t have a PhD nor do I have a technical background. Instead, I come from a finance background and had worked in FinTech for several years as a risk analyst.

I felt an equal sense of excitement but also imposter syndrome, even after passing rounds of rigorous technical coding interviews.

Coming from an analyst background, I’ve observed that the biggest differences with working in a data science role include:

  • Managing timelines of data science projects,
  • Communicating with stakeholders on project timelines and implementation.
  • Learning the technical aspects such as adopting a test-driven development approach to writing production-level code,

The first 3 months of onboarding as a Data Scientist can make or break your experience.

I’m here to walk you through what to expect in your first 90 days on the job to help ease the nerves around the transition and help you build confidence.

Month #1

In my experience, the first month is about:

  • Understanding the organizational structure and team norms,
  • Learning about business priorities and knowing where the biggest impact areas are,
  • Getting yourself up to speed with the tech stack and systems architecture, and setting up your environment.

In the first month, I focused on:

  1. Understanding the business: I strived to understand the biggest challenges the business was facing and how my role aligned with the company’s objectives.
  2. Understanding the data: I reviewed dashboards and familiarized myself with important business metrics and how they’re calculated.
  3. Familiarizing myself with the company’s data infrastructure: I sought to understand how schemas and tables are organized.
  4. Meeting the team: I set up introductory meetings with colleagues I would be working closely with to build rapport and understand their needs.
  5. Discussing expectations with my manager: I worked with my manager to understand and set expectations for my role and work in the first 90 days.

I personally like to do some lightweight work early on, to help navigate my way around the new codebase, build my confidence, and feel more comfortable with committing code.

Some examples of small projects to start with include:

  • Adding a chart to an existing dashboard,
  • Suggesting and calculating a new metric,
  • Adding a simple new feature calculation to the ML repo.

Month #2

The second month is typically where you start to dive into a project.

Scoping out your project

Data Science projects are usually larger in scope and take a longer time to complete than analysis projects. In my opinion, project management skills are often an overlooked yet important area to develop for Data Scientists.

Specifically, I would work with my manager to:

  • Define the project scope,
  • Estimate project milestones and timelines,
  • Gather project requirements and success metrics.

Ironing out the project scope has been crucial to the success of my projects. It’s important to understand:

  • Is this just a model refresh with the same features?
  • If we’re adding in new features, which feature groups should we explore?
  • What quantitative evidence do we have that suggests these feature groups will provide a strong orthogonal signal that isn’t already captured in the current model?

Managing stakeholders

Stakeholders from other teams may enthusiastically suggest feature groups to explore.

It’s important to listen and prioritize them, but also understand that feature engineering is a meaty and time-consuming part of the model development process, so be sure to include them in your project scope.

Since data science projects usually take a month or two to launch, it is important to keep stakeholders updated on the project status. To do so, I like to:

  • Update my JIRA tickets regularly with my progress at least once or twice a week,
  • Document my process and findings along the way, including what worked and didn’t work,
  • Check in regularly early and often with my manager and key stakeholders to align on approach.

Stakeholder management is extremely important for getting buy-in and ensuring you’re on the right track. Bring them along on your journey to keep them engaged and excited about your output!

Month #3

Presenting your work

This is where it all comes together — finalizing and presenting model results to stakeholders.

The purpose of the presentation is to:

  • Show model results with clear visualizations,
  • Quantify the business impact of the model improvement,
  • Get stakeholder buy-in on your work and recommendations with deployment and model usage.

Make sure to address any stakeholder concerns before moving on.

If all goes well, you can move on to the implementation phase after this.

Deploying your model

Every company has their own process, but typically, you would work closely with engineering (data engineers, infrastructure engineers and/or software engineers) on model deployment.

More mature companies tend to a well-defined process around model deployment with good documentation that you can follow.

When putting the model in production, I would make sure to:

  • Understand the deployment tools,
  • Include integration tests to make sure everything runs end-to-end,
  • Put the model into shadow mode first to log scores and ensure score distribution is as expected.

After the model is in production, be sure to set up a monitoring dashboard and alerts for feature and/or model score drifts. Trust me, this will happen sooner or later unfortunately.

Once you’ve verified the model results in shadow mode, report back to the business team and work with the analysts on recommendations of how to use your model. Your work only makes an impact when your model is being used in production!

Congratulations on coming this far! I always like to check in regularly with my manager and gather feedback both from them and peers to build trust.

Summary

The scope and responsibilities of data science roles can vary at different companies. The process I described is typical of a machine learning-focused Data Scientist or ML engineer role, but the general advice is still applicable around:

  • Project management,
  • Stakeholder management and communication,
  • Understanding and working with different tools,

Are you an analyst looking to transition into data science? Are you feeling trapped in your current role?

Are you overwhelmed by the amount of resources out there and don’t know where to start?

Are you feeling a sense of imposter syndrome because you don’t have a PhD or a technical background?

I’ve created a FREE Five-Day Email Course to jump-start your data science career. I transitioned to data science in 2020 without a technical background, and I want to help others do the same. 🚀


What to Expect in your First 90 Days as A Data Scientist was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.



from Datascience in Towards Data Science on Medium https://ift.tt/F6uARUY
via IFTTT

How to Use HyDE for Better LLM RAG Retrieval

Building an advanced local LLM RAG pipeline with hypothetical document embeddings

Implementing HyDE is very simple in Python. Image by the author

Large Language Models (LLMs) can be improved by giving them access to external knowledge through documents.

The basic Retrieval Augmented Generation (RAG) pipeline consists of a user query, an embedding model that converts text into embeddings (high-dimensional numerical vectors), a retrieval step that searches for documents similar to the user query in the embedding space, and a generator LLM that uses the retrieved documents to generate an answer [1].

In practice, the RAG retrieval part is crucial. If the retriever does not find the correct document in the document corpus, the LLM has no chance to generate a solid answer.

A problem in the retrieval step can be that the user query is a very short question — with imperfect grammar, spelling, and punctuation — and the corresponding document is a long passage of well-written text that contains the information we want.

The user query reads “was ronald reagan a democrat” whereas the document is a long well-written text from Wikipedia. But the query and the document both go into the embedding model to compute embeddings.
A query and the corresponding passage from the MS MARCO dataset, illustrating that typically query and document have different lengths and formats. Image by the author

HyDE is a proposed technique to improve the RAG retrieval step by converting the user question into a hypothetical document.

In this article, you will learn about the HyDE technique and how and when to use it to improve your own RAG pipeline.

Table Of Contents

· HyDE Retrieval
Contriever
When to Use HyDE
· Implementing HyDE
· Is Implementing HyDE Worth It?
· Conclusion
· References

HyDE Retrieval

Hypothetical Document Embeddings (HyDE) were first proposed in the paper “Precise Zero-Shot Dense Retrieval without Relevance Label” in 2022 [2].

The goal of HyDE is to transform the user query into a “document” so that the retriever has an easier task.

The original image from the HyDE paper, consisting of the instruction, query, generated document, contriever, and real document
An illustration of the HyDE model from [2]: A pre-trained LLM transforms the user query into a hypothetical fake document. The retriever then uses the fake document to search for similar real documents in the knowledge database.

HyDE uses an off-the-shelf LLM (e.g. ChatGPT, Llama, etc.) with a simple instruction — like ”write a document that answers the question” — to convert the user query into a generated fake document. This transformation of a short user question into a longer hypothetical text passage is the central idea of HyDE.

This generated fake document will most likely contain hallucinated numbers and false statements.

However, this does not matter because the fake document is encoded into an embedding vector by the encoder model and used for semantic similarity search.

According to the HyDE paper, the encoding model acts as a lossy compressor that filters out the hallucinated details of the generated fake document. This leaves us with a vector embedding that should be very similar to the embeddings of our corpus of real documents.

Finally, the contriever uses the generated fake documents to search for the closest real documents in the document embedding space. This is usually done via dot product or cosine similarity.

In summary, instead of performing a similarity search in the query — document embedding space, HyDE performs a similarity search in the (hypothetical) document — (real) document embedding space.

Contriever

What is a contriever and why does HyDE use one?

The HyDE paper is strongly motivated by the fact that there is not always a large enough dataset available to train a retriever for query-document similarity search.

A contriever is a retriever (embedding model) trained with contrastive learning. Contrastive learning is a form of self-supervised learning where no labels are required for the training dataset [3].

This is particularly useful when large amounts of labeled data are not available, such as when trying to train a retriever in a language other than English.

An embedding model trained with contrastive learning tries to distinguish between semantically similar text (high score) and semantically dissimilar text (low score).

During the contrastive training process, text pairs are selected either from the same document (positive pair) or from different documents (negative pair). The retriever is then trained to distinguish between positive and negative document pairs.

A visualization of the contriever training process. A high score is assigned to a positive document k+ which comes from the same document as the query q. Low scores are assigned to all other documents.
Training a retriever using contrastive learning: The positive document text is taken from the same document as the query text. Negative document text is sampled from different documents. The retriever is trained to assign high scores to positive documents and low scores to negative documents. Image by the author

The trained contriever can then be used as is or it can be used as a pre-trained model for further fine-tuning with labeled data.

The contriever is trained in a self-supervised manner by searching for similarities between documents, i.e. no labeled data is required. And the HyDE instruction transforms user questions into this document space by creating fake documents.

When to Use HyDE

Your choice of the embedding model is critical to understanding when to use HyDE to improve RAG retrieval.

A popular general-purpose and free-to-use encoder model is the all-MiniLM-L12-v2 model from the sentence-transformers package, hosted on Hugging Face.

On the model’s Hugging Face model card, we can read the following about the model’s background:

The project aims to train sentence embedding models on very large sentence level datasets using a self-supervised contrastive learning objective. We used the pretrained microsoft/MiniLM-L12-H384-uncased model and fine-tuned in on a 1B sentence pairs dataset.

This means that this encoder model is exactly what HyDE was made for: it was trained without labels using self-supervised contrastive learning on document-document data pairs.

So, HyDE should be able to improve retrieval performance for this embedding model!

On the other hand, you do not need to use HyDE if your encoder model has been specifically trained in a supervised manner for semantic search — especially asymmetric semantic search.

Asymmetric semantic search means that you have a short question and you are looking for a longer paragraph to answer that question — exactly what RAG is typically used for.

A popular training dataset for this type of encoder model is the MS MARCO dataset, which was originally a question-answering dataset containing real Bing questions and human-generated answers.

Encoder models from the Sentence Transformers library, such as the “msmarco-*” models and the “multi-qa-*” models, are already trained on labeled question-document data and therefore should (in theory) not benefit from using HyDE.

For most commercial embedding models, such as the text-embeddingmodels from OpenAI, we don’t know how they were trained, so HyDE may or may not improve retrieval performance.

Implementing HyDE

Let’s implement a basic version of HyDE in Python.

We start with a simple LLM class that initializes a local Qwen2.5–0.5B-Instruct model. The model is small enough that it can also run on the CPU if no GPU is available on your local machine.

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch


class LLM:
def __init__(
self,
model_name="Qwen/Qwen2.5-0.5B-Instruct",
):
self.device = "cuda" if torch.cuda.is_available() else "cpu"

self.model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype="auto",
).to(self.device)
self.tokenizer = AutoTokenizer.from_pretrained(model_name)

def generate(self, prompt, temperature=0.7, max_new_tokens=256):

messages = [{"role": "user", "content": prompt}]
text = self.tokenizer.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
model_inputs = self.tokenizer([text], return_tensors="pt").to(self.device)

generated_ids = self.model.generate(
**model_inputs,
max_new_tokens=max_new_tokens,
do_sample=True,
temperature=temperature,
)
generated_ids = [
output_ids[len(input_ids) :]
for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]

return self.tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]

Next, we need an encoder model to compute sentence embeddings. To get a local all-MiniLM-L12-v2 contriever model, we only need a few lines of code using the sentence_transformers library:

from sentence_transformers import SentenceTransformer

encoder_model = SentenceTransformer("all-MiniLM-L12-v2", device="cpu")

With these two ingredients, we can already compute hypothetical document encodings:

qwen = LLM()
question = "was ronald reagon a democrat?"
hypothetical_document = qwen.generate(
f"Write a paragraph that answers the question. Question: {question}"
)

>> print(hypothetical_document)

Printing our hypothetical document gives us the following paragraph:

Ronald Reagan, born on November 6, 1924, in Grand Prairie, Texas, was a renowned American politician who served as the 36th President of the United States from 1981 to 1989. As the first Republican candidate for president and the first sitting president since Richard Nixon’s resignation, Reagan faced significant challenges during his tenure.
Reagan’s political career began with a successful run for office against the Democratic nominee Hubert Humphrey in the 1968 presidential election. However, he quickly became disillusioned with the Democratic Party’s stance on civil rights and social justice

At first glance, this looks like maybe something from Wikipedia. However, there are many hallucinated facts in it. But since this is not a real document, it is okay to have errors.

Next, we can get a real passage of text from Wikipedia and compute the embeddings of the question, the Wikipedia document, and the hypothetical document.

wikipedia = """Ronald Wilson Reagan[a] (February 6, 1911 – June 5, 2004) was an American politician and actor who served as the 40th president of the United States from 1981 to 1989. 
A member of the Republican Party, he became an important figure in the American conservative movement, and his presidency is known as the Reagan era. """

hypothetical_document_embedding = encoder_model.encode(hypothetical_document)
question_embedding = encoder_model.encode(question)
wikipedia_embedding = encoder_model.encode(wikipedia)

Now, we can check if the hypothetical document embedding is actually closer to the real document embedding than the question embedding.

We can use the similarity function from the encoder model, which uses the cosine similarity measure under the hood.

The cosine similarity measure goes from -1 to +1, where -1 means that the embedding vectors point in opposite directions, 0 means that they are exactly perpendicular, and +1 means that they point in the same direction.

>> print(encoder_model.similarity(hypothetical_document_embedding, wikipedia_embedding))
>> tensor([[0.8039]])

>> print(encoder_model.similarity(question_embedding, wikipedia_embedding))
>> tensor([[0.4566]])

As we can see, the hypothetical document embedding is much closer to the real document embedding in our embedding space. Thus, HyDE has successfully reduced the domain gap between question and document.

A 2D-chart showing the embedding vectors for the question, the hypothetical document, and the real document. The vector arrows show that the hypothetical document embedding is close to the real document embedding.
Visualization of HyDE, which performs similarity search in the (fake) document — (real) document embedding space. Image by the author

However, it also took some additional computation to generate the hypothetical document using our LLM. This is the disadvantage of using HyDE.

Is Implementing HyDE Worth It?

A recent study called “Searching for Best Practices in Retrieval-Augmented Generation” [4] looked at different retrieval methods for RAG. The study found that HyDE improved the retrieval performance compared to the baseline embedding model.

Furthermore, the combination of hybrid search with HyDE produced the best overall results.

Interestingly, they also found that concatenating the original query with the hypothetical document produced even better results.

On the other hand, HyDE increases latency and cost by requiring additional LLM calls to transform each query into a fake document.

Considering the best performance and tolerated latency, we recommend Hybrid Search with HyDE as the default retrieval method. Taking efficiency into consideration, Hybrid Search combines sparse retrieval (BM25) and dense retrieval (Original embedding) and achieves notable performance with relatively low latency [4]

Conclusion

HyDE is an advanced technique for improving the retrieval part of a RAG pipeline.

By creating hypothetical fake documents from a query, we can perform similarity search in the document-document embedding space, instead of the question-document embedding space.

HyDE has been proposed for a use case where the embedding model is not already fine-tuned for semantic search with labeled question-document data.

Since HyDE requires only a few additional LLM calls, it is very easy to implement.

So give it a try and see if HyDE can improve your RAG retrieval.

HyDE is a building block in your toolkit that can be combined with other advanced RAG techniques, such as hybrid search and using a reranker after the retrieval part.

References

[2] P. Lewis et al., Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks (2021), arXiv:2005.11401

[2] L. Gao, X. Ma, J. Lin, J. Callan, Precise Zero-Shot Dense Retrieval without Relevance Labels (2022), arXiv:2212.10496

[3] G. Izacard et al., Unsupervised Dense Information Retrieval with Contrastive Learning (2022), Transactions on Machine Learning Research (08/2022)

[4] X. Wang et al., Searching for Best Practices in Retrieval-Augmented Generation (2024), arXiv:2407.01219

Read more articles in my series on how to improve RAG retrieval performance


How to Use HyDE for Better LLM RAG Retrieval was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.



from Datascience in Towards Data Science on Medium https://ift.tt/Tufv6g0
via IFTTT

Running visibility analysis in QGIS

Running Visibility Analysis in QGIS

Learn how to easily run visibility analysis with free GIS software and data, and use the results to create stunning visualizations

One of my all-time favourite types of spatial analysis is visibility analysis. It’s a really simple concept, allowing you to work out — theoretically — where something can be seen from. There are two main forms of visibility analysis — also know as viewshed analysis or ZTVs (Zones of Theoretical Visibility). These are:

  • The “standard” viewshed: where can I see from this location? E.g. if I’m stood on top of a mountain, what will my view be?
  • The “reverse” viewshed: where can see a location? E.g. if I’m stood on top of a mountain, who can see me?

It works by taking a set of observer points and a relief model, then calculating the line of sight between every part of that relief model and the observer points. Sounds like it would be a slow process, right? It can be. But the results are amazing, and indispensible for some applications.

One of the most common applications is to help people to understand where a new development — such as a wind farm, solar panel site or highway — will be visible from. Analysts can then model different scenarios based on their findings, such as what happens to visibility if they plant a line of trees in this exact location? Visibility analysis can also be used to drive pricing strategies for billboards or help sell penthouse apartments.

In this article, I’ll be sharing how you can easily run visibility analysis with QGIS — an open-source, free GIS software. But first, here are some examples of the end result…

Source: created by the author
Source: created by the author

Want to try?

You will need…

Software:

  • QGIS downloaded and installed, which you can do from here
  • The QGIS plugin Visibility Analysis installed. To do this, open QGIS and go to the Plugins drop-down. Select Manage and Install plugins… and search for Visibility Analysis, then click Install.

Data:

  • A Digital Surface Model — an elevation model which includes surface features like buildings and trees. You can find some great free resources for this type of data in my guide to free GIS data sources. You should choose a model which is relevant for both the extent you’ll be looking at and the level of detail and accuracy you require.
  • A point layer which you want to assess the visibility of.

In this tutorial, we’ll be evaluating where you can see London’s Houses of Parliament from. In the first section, we’ll be sourcing the two above datasets for this analysis — so if you already have your data sourced, keep reading!

Section 1: Data sourcing

#1 Digital Surface Model

Our analysis will be a super localized look at views of the Houses of Parliament, so we want our data to be as detailed as possible — and that means LiDAR data! LiDAR — which stands for Light Detection And Ranging (love a really forced acronym!) — is hyper-local, hyper-accurate 3D data — which is often hyper-expensive as a result. However, there are some great free sources available, which I’ve detailed in this blog. In England, the Environment Agency publishes free LiDAR data for about 60% of the country. This is typically in areas which are flood-prone, which luckily (well, I suppose not) our study area is.

To source the data:

  1. Head to the Defra Survey Data Download platform. This data is made available via the Open Government Licence. This data can be freely used for any purpose (including commercial) with the license: © Environment Agency copyright and/or database right 2024. All rights reserved.
  2. Draw your area of interest — make sure you draw a bigger area than you think you need.
  3. Select Get available tiles. Choose the product to download— we want LiDAR Composite First Return DSM, which will include the canopy height of vegetation (as opposed to last return, which will have the underlying terrain height). You can also select the year and resolution — generally the coarser the resolution, the better the coverage; I’m using a 2m resolution.
  4. Download all tiles, and unzip them all in your downloads folder.
  5. Head to QGIS. Click Add Raster Layer and add all of your raster files to the project.
  6. If you have any small gaps in your data, you can estimate these by using the Fill nodata… tool in the Raster menu under Analysis.
  7. Now, we need to merge our tiles into one large file. First, we need to know the type of raster data we are using; double click on any of the layers in the Layer panel to bring up its properties. In the Information tab under Information from Provider, note the data type; mine is Float32.
A screenshot of QGIS
Source: created by author

8. Now, open the Merge… tool from the Raster menu under Miscellaneous. Select all of the raster tiles to merge, ensure the output data type is correct for your inputs (mine is Float32) and set the output path for your merged raster — and run!

Here’s our merged layer! I’ve changed the symbology to hillshade to understand more about the shape of the data.

A screenshot of QGIS
Source: created by author

#2 Visibility points

Now I need a point grid which I’ll run the visibility analysis for, which covers the Tower of London.

  1. First, I need a polygon to build my point grid from. I’ve taken the Tower of London building outline from Ordnance Survey’s Zoomstack, but you can use anything — or build your own.
  2. Now head to Vector > Research > Create Grid, setting the grid extent to your polygon layer. The more detailed your grid resolution, the more accurate and detailed your visibility analysis will be — but there will be a trade-off with processing time. I’m going to use a resolution of 5m. Run!
  3. Now, delete all of the points which do not fall inside your polygon. You can do this manually, with a Select by Location or just by using the Clip tool.
A screenshot of QGIS
Source: created by author

And that’s our two layers sorted — now on to the visibility analysis!

Section 2: Visibility analysis

  1. First, we need to create some parameters for our analysis. Open the Create Viewpoints tool. You can just search for this in the Processing Toolbox.
  2. Set the Observer locations as your clipped points and your digital elevation model as the merged raster layer.
  3. The default radius is set to 5000 metres which is fine for this use case, but you may wish to change it based on your requirements.
  4. What we’re running here is a reverse viewshed; we don’t want to know what can be seen from the points, but where can see the points. So, change the Observer height to 0, and the Target height to 1.6 (this is the standard “human eye” height used in visibility analysis). Run this — if you open the attribute table of the layer that has been created, you will see the parameters you have set have been added as attributes.
  5. From the Processing Toolbox, open the Viewshed tool which is where we’ll be running our visibility analysis from! Set the analysis type as binary, the Observer locations to those we created in step 4, the Digital elevation model to our merged raster, and make sure the “Combining multiple outputs” parameter is set to addition.
  6. Run! This can be quite a slow processing depending on your computer, the number of points, DSM extent and resolution — so go and pop the kettle on.

And when it’s finished, it should look a little something like…

A screenshot of QGIS
Source: created by author

And that’s our visibility analysis done! You can see the lighter areas are where more of the Houses of Parliament are visible from, and the darker areas are where it can’t be seen at all. I love the way the source of the analysis is rendered like a light source, casting shadows where it can’t be seen.

One final step you may want to take here is to remove surface features from the viewshed results. As you can see from this analysis, some of the visible locations are actually on top of nearby buildings or trees— where it’s unlikely someone is going to actually be stood. I’m skipping this step because I know Central London consists of a lot of rooftop gardens and penthouse views, so this isn’t so much of a false positive in this area.

To do this you would need to:

  1. Download a Digital Terrain Model (DTM) for your study area— this is the equivalent to your DSM, but with all surface features from the DSM (e.g. buildings, trees) removed.
  2. Use the expression below in the Raster Calculator. This calculates the difference between the DTM and DSM, to obtain only surface feature heights. Then, where the surface height feature exceeds zero, the value of the viewshed is set to 0.
(@viewshed) * (("(@dsm" - "@dtm") <= 0)

Finally, the output!

Section 3: The final output

Depending on your use case, there are lot of different ways of visualizing this output.

My use case is just “I want to make a damn sexy map” and — honestly — the results of the analysis have this kind of spooky, shadowy effect that for me does 90% of the job.

All I’m going to do is:

  1. Layer my original DSM on top of the viewshed, set the symbology type to hillshade and the blending mode to multiply. This gives quite a subtle contouring (the make-up style, not the GIS line type!) to the results which helps give the analysis some shape and context. You can see the difference below without hillshade (left) and with (right).
Two maps side by side comparing styling
Source: created by author

2. I’ve added a “firefly” effect to the footprint of the Houses of Parliament. This type of cartographic style really helps features stand out against a darker basemap. Is it wildly misleading to show the UK’s governmental buildings in a fun neon vibe? Yes. Does it look good so I’m going to do it anyway? Also yes. You can learn how to use firefly style cartography in QGIS here.

3. Finally, I’ve just added on a few contextual labels and a minimalist legend. I really want the analysis to be the star of the show here, so I’ve kept everything super subtle and light touch.

And here’s the result…

Source: created by the author

Conclusion

I hope you enjoyed this tutorial! Whether you’re using it as part of a visual impacts assessment, environmental study or just — like me — to make some really funky maps, hopefully you’ve seen how easy it is to run such a powerful form of analysis — with entirely free and open tools and software! The biggest challenge often isn’t the analysis itself — which is fairly straightforward — but learning to balance detail and extent with processing time, and that will come as you familiarise yourself with the process.

If you’ve followed along, please do find me on X and LinkedIn and share what you’ve produced — I’d love to see!


Running visibility analysis in QGIS was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.



from Datascience in Towards Data Science on Medium https://ift.tt/c4PD3Fv
via IFTTT