Design and implementation

To address the issue of generating walkability-friendly and user-customizable pedestrian routes, our approach is divided into four parts: (1) data aggregation, conflation, and pre-processing, (2) the development of a specialized fine-tuning pipeline for sentence embedders, leveraging contrastive learning to learn representations of generally walkable (and unwalkable) place descriptions, (3) inference of point-wise scores based on “general walkability” and preference-specific criteria from generated comprehensive embedding sets, and (4) integration of the point-wise scores in an A*-based path-finding algorithm.

Data Preparation

As already as discussed earlier, we concluded that the Foursquare and Overture Maps suffered from various insufficiencies. In the context of our work, both exhibited low temporal accuracy and focused on a relatively narrow selection of geospatial features with normalized but limited descriptions. Furthermore (in contrast to OSM), the feasibility of efficiently aggregating additional information from external sources in both of these datasets was minimal, as they only ever referenced private websites or social media profiles. Subsequently, OSM was eventually chosen to constitute the skeleton of our knowledge base.

OSM Pre-Processing

Feature Type Quantity (in thousands) with Wikidata Reference
Ways 19.1 362
Segmented Ways 38.6 362
Nodes 34.6 1086
Buildings 35.9 133
Outdoor areas 2.3 35
Summary of extracted OSM feature counts for Cambridge, UK.

To construct a robust knowledge base from OSM and to minimize the risk of losing potentially useful information or data points, we chose to manually implement our own filters and process raw OSM data (instead of relying on existing third-party post-processed datasets or APIs).

The segment network used in our work was created from segmented OSM “ways”, where each segment is defined at both ends either by a junction with another segment or an isolated end. In the particular case of Cambridge, OSM holds all kinds of transportation segments, from highways to unofficial “desire paths”. Next, all nodes, as well as recorded buildings, were extracted and stored. However, for both of these feature types, only the entries with some informative descriptions were kept. Lastly, relevant outdoor areas were extracted, such as playgrounds, water bodies, or parks. Where appropriate, these areas were conflated, since raw data from OSM sometimes suffers from redundant or segmented area entries. Furthermore, for all OSM buildings, ways, and nodes, a written English description from Wikidata was scraped and appended to the database whenever available. In the context of our model, and similarly to some user-uploaded text descriptions of nodes in OSM, Wikidata’s descriptions suffer from non-regularity. The database presents descriptions of varying lengths and informative values. Therefore, the scraped descriptions were cleaned of, for example, unwanted geographical names (since those were expected to provide little benefit later on), and shortened where appropriate. The resulting quantities for each of these feature types in the table above.

Tree Dataset

Since, particularly for the geographical regions we were interested in (the UK), greenery can play a vital role for a data-driven inference of walkability, having accurate estimates about the locations and quantities of trees is highly valuable. Although trees (and other greenery) are a common node type in OSM data, their representation underestimates reality. Within the boundaries of Cambridge, OSM tracks fewer than 3.5 thousand trees, substantially underestimating the actual counts. In contrast, the specialized tree datasets (as introduced earlier) offer a more comprehensive and reliable source of tree-related data. Therefore, the VOM data was leveraged. Specifically, this project relies on a processed version of the VOM raster, after a tree segmentation completed with the lidR package (Roussel, Goodbody, and Tompalski 2025). This version of the dataset was kindly provided by Andrés Camilo Zúñiga-González (an AI4ER Ph.D. Student at the University of Cambridge) (Zúñiga González 2025), and served as a sole source of tree records for this project. Entries of trees from OSM were, henceforth, ignored. Within the boundaries of Cambridge, the segmented VOM supplied over 102 thousand trees.

Open Greenspace Dataset

The final “supplementary” dataset used was the “Greenspace Dataset”. Nevertheless, as it narrowly specializes in public green spaces (such as public parks or playgrounds), the Greenspace Dataset was used to merely enhance the spatial accuracy and fill in any gaps in the OSM data. Furthermore, for Cambridge, it only included 398 entries. Therefore, the Greenspace Dataset and OSM areas were iteratively matched and merged on descriptions and spatial parameters, and stored in one database.

Point-Geospatial Context Dataset

This aggregated knowledge base was used to create final point-to-geospatial-context mappings. First, a set of points was sampled from each of the segments in 10-meter intervals. For each of these points, all entities within a pre-defined buffer zone were recorded. These buffer zones were set to a 40-meter radius for buildings, and 30-meter radius for all other feature types. Furthermore, each of these segment points was also mapped to any outdoor areas it intersected.

Given a specific point on a segment, these mappings were then used to retrieve text descriptions of the features from the parsed datasets. For each data type (such as nodes or areas), a priority mechanism selected the most desirable attributes (such as building or business type, or Wikidata description). The entity descriptions were then compiled into sentence descriptions. While the exact structure of the sentence description was also subject to much experimentation (partly because some sentence encoders are better suited to specific structures), the eventual structure of the descriptions introduced the different feature types in order, transitioned between these types with consistent connector phrases, and represented missing entities of a given feature type with “nothing”. Specifically, the default descriptions followed this format:

[segment infrastructure description];
        IN AREAS: [list of areas];
        NEARBY: [list of nearby nodes and buildings].

Encoder Fine-Tuning

To produce representations from the assembled dataset of point-to-description mappings, we used sentence encoders (that are more closely discussed in. However, while the ability to make semantic associations was the key reason for picking up pre-trained sentence encoders, these models had to first be lightly re-focused towards representing our specific descriptions. This was achieved through a contrastive fine-tuning process.

Finetuning Dataset

To create a dataset for the encoder fine-tuning, a set of compiled place descriptions was encoded with an off-the-shelf encoder (specifically, with the “all-MiniLM-L6-v2” from the “sentence-transformers” library (Reimers and Gurevych 2019)). Afterwards, 12,500 unique data points were selected based on their respective embeddings with furthest-point sampling to maximize the degree of diversity within the dataset.

These points were then scored and labeled on the basis of walkability with the Mistral 7b language model (Jiang et al. 2023). The language model was prompted to assign a numerical score on a scale of zero to ten, where zero stood for the least walkable descriptions (such as descriptions of points on highways) and ten for the most walkable descriptions (such as descriptions from park footpaths). The prompt used for this purpose related to the concepts of walkability summarized earlier, particularly the work of Alfonzo (Alfonzo 2005).

Embedding Architecture

There’s a plethora of pre-trained, publicly available sentence encoders, many of which advertise a similar plethora of domain versatility in information retrieval, sentence similarity, or clustering tasks. Hence, the selection of the most suitable encoder models was a highly iterative process. Moreover, the strategy of employing these encoder models was also initially unclear, and two main options were considered.

The first option was encompassing all of the desired information for a given point into a singular sentence, and then using a single encoder to generate the point embeddings. This approach offered much simplicity, but imposed the risks of relying too heavily on the encoder model’s ability to extract and represent all of the important features. Moreover, this approach was less flexible for potential future implementations, where, for instance, not all features should be used to generate embeddings.

The second option was to generate each feature or section of the description individually, potentially with different encoder models, later composing these embeddings into a singular vector. A similar approach is developed in, for instance, the aforementioned work by Tempelmeier et. al. (Tempelmeier, Gottschalk, and Demidova 2021). Therefore, several implementations of this approach were tested, none with satisfying results. In some of the attempts, a set of embeddings of individual features of a given point was composed by simply finding the average of those feature embeddings. Alternatively, the composed vector was generated via a fusion component, which was also trained during the fine-tuning phase.

Nonetheless, none of the attempts to compose embeddings of individual features into a singular vector proved useful. The models were prone to over-clustering (pulling samples of the same samples too close together) during the contrastive fine-tuning phase, and generally failed to retain the ability of the original off-the-shelf models to later make relevant semantic associations.

Hence, this work relies on a single encoder architecture, processing descriptions composed of singular sentences. Furthermore, the fine-tuning of the sentence encoders was done via LoRA adapters. The adapters were injected into each of the pre-trained models, and while the models’ weights remained frozen during the fine-tuning, the adapters’ weights adjusted to the contrastive objective.

Contrastive Fine-Tuning

With the LLM-labeled dataset, sentence encoders were fine-tuned using the Triplet Loss-based strategy. This strategy was implemented by simply splitting the training examples into a positive and a negative bin. The threshold for the positive bin was a score assigned by the LLM higher than or equal to seven, and in the negative bin, the scores of the data points were lower than or equal to three. In order to create a clear contrast between the “walkable” and the “unwalkable”, data points that fell into neither of the two bins were discarded. After this indexing, the positive bin contained 5390 examples, and the negative bin 1060 examples. This disparity between the sizes of the two bins was most likely caused by the fact that points with low walkability scores were frequently associated with fewer features (e.g., high-speed roads in urban outskirts) whereas highly walkable places were more commonly surrounded by heterogeneous elements (e.g., paths surrounded by amenities or places). Hence, there were fewer unique points with poor walkability than unique points with high walkability.

During the training, and due to the contrasting cardinalities of the two bins, the dataloader sampled the positive and negative examples randomly for each iterated anchor. Furthermore, every time an example data-point was used, its list of associated areas and of nearby nodes and buildings was first randomly shuffled to embed an extent of permutation invariance into the encoder.

Extended with the LoRA adapters, the models adjusted to the fine-tuning objective after only a few epochs and only required minimal training durations. Although no model was fine-tuned for more than fifteen epochs, generally only models trained for fewer than five epochs proved useful. Unsurprisingly, due to the contrastive objective and the crudeness of the data bins, the prevention of over-clustering was essential. While in downstream tasks, thoroughly fine-tuned encoders successfully managed to classify examples as walkable or non-walkable, the differences in representations were significant, and neglected other features present in the examples.

Urban Embeddings and Scoring

Leveraging the ability of sentence encoders to independently project individual examples into the embedding space, we developed an anchor-based method for the generation of absolute walkability scores. Furthermore, because of the use of anchors and the encoder’s ability to highlight semantic associations, we were able to further readjust the scoring pipeline and generate not only general walkability scores but also scores reflective of more specific pedestrian preferences.

Walkability Scoring

Although simple distance metrics, such as cosine similarity, are very frequently used for tasks such as embedding-based retrieval, their outputs reflect relative relationships only within the considered set of examples. For instance, if plain cosine similarity was used to infer walkability indices in a specific area, the obtained “scores” would imply walkability only relatively to the other points in the sample, and not to any general expectations regarding walkability.

Therefore, we used an anchor-based linear scaling approach to establish these expectations. The approach considers three anchor vectors. A completely negative anchor (representing highly unwalkable data points), a neutral anchor (representing data points of average walkability), and a positive anchor (representing data points with the highest possible walkability indices). These anchors were used to establish a set of thresholds, i.e., where specific ranges of walkability indices begin in the embedding space and where they end. Each respective threshold was defined as the cosine distance from the positive example. More specifically, since in this work we used three thresholds, the negative anchor defined the distance-from-the-positive-anchor threshold for all walkability scores equal to zero, and the neutral anchor for scores equal to five. Since distances in the embedding space may be proportionately different than the actual walkability scores, the neutral example was added with the intention of adjusting for this inequality and improving the scoring system’s outputs. Then, for an embedding of a given example, the embedding was situated into the threshold scale based on its similarity to the positive anchor, and its absolute score was calculated through linear scaling and the two thresholds as points of reference.

To obtain each of the anchors, a set of manually selected example sentences was constructed. Each sentence was meant to provide a specific, strong example of the type of descriptions the given anchor represents. Each sentence was then embedded with the fine-tuned encoder, and the entire set was averaged to produce the final vectorized anchor. The curation of the sentences used in the anchors was, nevertheless, not guided by any exact notions, and after a number of experimental iterations, all three sets consisted of twelve exemplary sentences, following the sentence structure.

Embedding Sets

A significant advantage of using a similarity-based scoring system lies in its computational efficiency, once the point-wise embeddings are generated. After obtaining a fine-tuned model, the preferences (such as the various reference points) are reflected only in the anchors, and not in the representations of the geospatial points. Therefore, to generate scores, the system only needs to embed the few walkability anchors and perform the linear-scaling scoring. Since cosine similarity is particularly easy and computationally inexpensive, this process is very quick and allows for the geospatial embeddings of the entire area of interest to be pre-computed. Therefore, a dataset of mappings from points (defined with geographical coordinates) to embedded descriptions can be stored and used later in various downstream tasks.

Custom Preference Scoring

Despite the specialized fine-tuning, the embeddings created from descriptions of geospatial points can be used for more than strictly general walkability-focused tasks, such as preferences towards particular geospatial areas or elements. In fact, by adjusting the anchors used in our linear scoring method, more specific pedestrian preferences can be used to generate the walkability scores. If the fine-tuning performed is sufficiently light, these preferences are then reflected in the embeddings generated by the encoder. Subsequently, the scoring pipeline rewards data points closer to those preference-adjusted embeddings and generates scores that lean towards the initial preferences. Specific implementations of this feature are discussed in the Evaluation chapter of this series.

Path-Finding

With access to point-wise walkability indices generated by our scoring pipeline, capable of producing evaluations of unrestricted spatial granularity, we assembled a new routing algorithm. Unlike existing approaches, our algorithm did not have to rely on costs calculated with manually fine-tuned static profiles. Instead, it was supported by scores calculated based on embeddings generated by the custom sentence encoders, and thus reflected the variety of our aggregated geospatial data. We used our OSM segment database to construct an infrastructure network. Then, we combined aggregates of the walkability or specific preference-based scores with the segment lengths to calculate total costs for each of the segments in the network. To generate paths in this network, we used an A*-based searching algorithm. The implementation of our A* was relatively straightforward. It relied on a unidirectional search with no particular tweaks or optimizations (such as contraction hierarchies). This was because, in the scope of this work, pedestrian routing in urban areas was our only focus. Hence, similar adjustments and optimizations, often implemented by existing path-finding frameworks, were deemed unnecessary.

Cost Estimation

Establishing an effective approach to calculating the overall cost-so-far $g(n)$ for the A* algorithm required more nuance. This was primarily because of the point-based approach, where highly desirable (or undesirable) features often reflected over only a few points. Moreover, depending on the anchor configuration, considerable differences in points were reflected only by marginal differences in the scores. Therefore, an effective prevention of the “average” points outweighing the critically important points was necessary. Similarly, finding a working balance between the distance (which still had to be reflected in the scores calculation) was crucial for the generation of desirable routes.

\[segment\ cost = \frac{n}{\sum_{i=1}^{n} \frac{1}{inv.\ score_i + \delta}} * segment\ length\]

Considering these factors, a harmonic mean-based approach was eventually adopted. To calculate a score for a specific segment,the above formula was used, with the $\delta$ constant equal to $10^{-6}$ and scores proportionately inverted so that lower scores were “better” and resulted in lower costs.

Heuristic Function

Similarly to related path-finding frameworks and implementations, the heuristic function used in this work remained simple. In fact, our A* simply used the total Euclidean distance between the iterated and the target nodes, scaled by the globally lowest calculated cost. By scaling the distance with the lowest cost, the heuristic remained a guaranteed underestimate of the true path cost and was, therefore, admissible. In this way, A* received an informed estimate with a minimal computational overhead and without the risk of sub-optimality.

References

  • Alfonzo, M. A. (2005). To Walk or Not to Walk? The Hierarchy of Walking Needs. Environment and Behavior, 37(6), 808–836.
  • Jiang, A. Q., Sablayrolles, A., Mensch, A., Bamford, C., Chaplot, D. S., de las Casas, D., Bressand, F., et al. (2023). Mistral 7B. https://arxiv.org/abs/2310.06825
  • Reimers, N., & Gurevych, I. (2019). Sentence-BERT: Sentence Embeddings Using Siamese BERT-Networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics. https://arxiv.org/abs/1908.10084
  • Roussel, J.-R., Goodbody, T. R. H., & Tompalski, P. (2025). The lidR Package. https://r-lidar.github.io/lidRbook/
  • Tempelmeier, N., Gottschalk, S., & Demidova, E. (2021). GeoVectors: A Linked Open Corpus of OpenStreetMap Embeddings on World Scale. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management, 4604–4612.
  • Zúñiga González, A. C. (2025). Post-Processed LiDAR Point-Cloud Dataset. Unpublished dataset, provided by the author.