Walkability Chapter 5: Evaluation (Part 2: Walkability Assessment)
Categories:WalkabilityAI/ML
Walkability Assessment
To address the inability of the open-source frameworks to identify highly walkable (or otherwise interesting) urban spaces, we utilize our walkability assessment tool (as presented in the previous parts). Furthermore, we perform our walkability assessment experiments with two distinct sentence transformers to build a more comprehensive overview and underline the dangers of over-clustering under the contrastive fine-tuning.
In this section, we also highlight the accessibility and comprehensiveness of our approach to defining specific preferences. In contrast to the complicated routing profiles of the open-source baseline frameworks, our method relies on preferences expressed through plain natural language sentences. Therefore, we demonstrate how our framework provides a solution to our third research question - how can we simplify user inputs.
Experimental Encoder Models
Besides the design of the sentence embedding strategy, the selection of the specific pre-trained sentence encoder and the degree of fine-tuning proved equally critical. While most of the considered encoders were trained on similar large text corpora lacking any particular thematic specializations, the responses to fine-tuning were very diverse. Consequently, the selection of considered encoders eventually narrowed to two models: “all-mpnet-base-v2” and “all-MiniLM-L12-v2”. Both of these encoders are part of HuggingFace’s “sentence-transformers” library (Reimers and Gurevych 2019).
all-mpnet-base-v2 projects text inputs into a 768-dimensional vector space, and is a fine-tuned variation of MPNet - a transformer-based model improving over BERT and RoBERTa by relying on masked language modeling and permutation-based training, thus improving the model’s ability to capture semantic dependencies (Song et al. 2020).
The second sentence encoder, all-MiniLM-L12-v2, is based on MiniLM, an approach developed with the goal of compressing large transformer-based models, such as BERT, while minimizing loss in performance (Wang et al. 2020). The approach relies on a deep self-attention distillation, where a smaller “student” model learns by mimicking the self-attention behavior of a larger “teacher” model. Similar to all-mpnet-base-v2, all-MiniLM-L12-v2 is fine-tuned under a contrastive objective, but outputs embeddings of only 384 dimensions.
General Walkability
The same settings were used for both of the encoder models. In the anchor-based scoring system, outputs from the models were weighted against embeddings of identical preference anchors, generated from the same twelve sentences.
Table: Average general walkability score across various fine-tuning epochs
| Encoder Model | Vanilla | 1 ep. | 2 eps. | 5 eps. | 10 eps. | 15 eps. |
|---|---|---|---|---|---|---|
| all-mpnet-base-v2 | 0.03 | 3.94 | 3.94 | 3.63 | 3.94 | 3.80 |
| all-MiniLM-L12-v2 | 0.01 | 2.45 | 2.65 | 3.42 | 2.97 | 3.23 |
The scores generated based on these three encoders exhibited several shared patterns. As illustrated, embeddings generated by the off-the-shelf encoders generally completely failed to relate to the anchor embeddings. As a result, most of the representations generated with these vanilla encoders resulted in extremely negative scores.
Nevertheless, both encoder models also demonstrated an ability to align their projections and adjust to the specific settings extremely quickly. The averages of the inferred scores jumped significantly after only a single fine-tuning epoch. This highlighted the encoders’ ability to adjust to the specific format of the point description sentences and the efficiency of the contrastive fine-tuning approach.
Furthermore, in terms of the mean walkability scores, both all-mpnet-base-v2 and all-MiniLM-L12-v2 achieved relative consistency after the initial alignment during the first fine-tuning phase.

However, the relative consistency of the scores generated by our encoders did not imply stale training. Instead, with rising numbers of fine-tuning epochs, the encoders started over-clustering under the contrastive objective. Due to the multi-anchored scoring system, however, this did not result only in extremely positive or negative scores, but also in extremely “average” scores. This is well apparent the table, where the scores of a rather positive example converge towards 5 as the fine-tuning proceeds. We hypothesize this is because of the progressively expanding margin between the highly positive (walkability scores greater or equal to 7) and negative (walkability scores smaller or equal to 3) examples, which place examples “in the middle” into relative proximity to the neutral anchor. This over-clustering not only results in distorted final scores but also suppresses the models’ ability to derive associations between various semantic features.
Table: Variance of general walkability scores over fine-tuning epochs
| Encoder Model | Vanilla | 1 ep. | 2 eps. | 5 eps. | 10 eps. | 15 eps. |
|---|---|---|---|---|---|---|
| all-mpnet-base-v2 | 0.04 | 10.18 | 9.77 | 7.04 | 5.11 | 5.58 |
| all-MiniLM-L12-v2 | 0.01 | 7.74 | 7.76 | 7.33 | 5.30 | 5.11 |
The negative effects of prolonged fine-tuning are also reflected in the variances of final score. Across all three models, the generated scores attain peak variance after one or two epochs of fine-tuning, but begin falling as the training continues. As discussed earlier, this is presumably because of the increasing distances between the projections of the positive and negative anchors. However, as the contrastive fine-tuning shifts the projections to maximize this distance, the models’ original ability to extract features also starts to vanish. Therefore, high variance of the scores is, in this case, desirable because it reflects the system’s attention to individual features.
Reflecting upon these observations, fine-tuning the models over two epochs appears as a generally reasonable approach. During such short fine-tuning, the encoders adjust to the task and sentence description formatting while maintaining a high variability of outputs. The outputs generated by these models are also generally agreeable upon manual review. For instance, while footpath segments in parks or pedestrian zones receive high scores, segments associated with private infrastructure or service areas are generally rated very poorly.
Greenery-focused objective
In the first hypothetical preference set, the scoring pipeline was situated to evaluate points with a preference towards the presence of greenery and green spaces. However, as the notion of greenery already constitutes an important aspect of evaluation under the general walkability criterion (which is also embedded in the encoder fine-tuning), this specific configuration aimed to merely emphasize the greenery preference. Therefore, a new set of positive anchors was created to reflect this objective, mainly consisting of common relevant elements, such as trees, public furniture, or parks and gardens.
Table: Percentual difference between the mean average of greenery and general walkability scores, calculated based on embeddings from the same model
| Encoder Model | Vanilla | 1 ep. | 2 eps. | 5 eps. | 10 eps. | 15 eps. |
|---|---|---|---|---|---|---|
| all-mpnet-base-v2 | 587.21% | -21.62% | -11.12% | -7.46% | -4.30% | -6.65% |
| all-MiniLM-L12-v2 | 1504.94% | -17.18% | -10.98% | -12.76% | -1.98% | -0.92% |
Table: Variance in greenery-focused scores across fine-tuning epochs
| Encoder Model | Vanilla | 1 ep. | 2 eps. | 5 eps. | 10 eps. | 15 eps. |
|---|---|---|---|---|---|---|
| all-mpnet-base-v2 | 0.82 | 5.29 | 6.78 | 4.85 | 4.38 | 3.94 |
| all-MiniLM-L12-v2 | 0.48 | 5.25 | 5.36 | 4.80 | 5.16 | 4.96 |
Despite being partially redundant to the general case, the reemphasis on greenery was still reflected in the generated outputs. In fact, greenery-focused scores were typically lower than general walkability ones, but they converged to the general scores as the fine-tuning went on. This, we hypothesize, is also a result of the over-clustering phenomenon. Furthermore, mirroring the findings in the previous example, the gradual suppression of the features’ variability reflects in the score variance too: as the number of epochs increases, the variance of the scores decreases.

The overall scores, nonetheless, reflected most expectations. As illustrated here, points in parks and close to natural elements were generally rated high, and low in dense urban areas.
Shopping-focused objective
In the next experiment, we conceived a hypothetical preference towards shopping-related areas (such as shopping malls and places near various kinds of stores) and embedded it into the scoring pipeline. Again, this was done simply by rewriting the set of positive anchor sentences. In this example, we further measured the ability of the scoring mechanism and, more importantly, of the generated embeddings to reflect individual elements directly stated in the anchors. Although a preference towards shopping areas does not necessarily require a high degree of the encoder’s ability to create semantic associations (as the number of related features and terms is much more limited), this objective was situated further from the general walkability objective than the greenery-focused case. While we could expect shopping areas to be, on average, relatively walkable, they are not correlated with walkability in a generalizable way.
Table: Percentual difference between the mean average of shopping-focused and general walkability scores, calculated based on embeddings from the same model
| Encoder Model | Vanilla | 1 ep. | 2 eps. | 5 eps. | 10 eps. | 15 eps. |
|---|---|---|---|---|---|---|
| all-mpnet-base-v2 | -98.56% | -24.29% | -20.36% | -10.07% | -4.76% | -6.93% |
| all-MiniLM-L12-v2 | -94.22% | -19.46% | -10.84% | -10.49% | -0.90% | -2.12% |
Table: Variance in shopping-focused scores through fine-tuning epochs
| Encoder Model | Vanilla | 1 ep. | 2 eps. | 5 eps. | 10 eps. | 15 eps. |
|---|---|---|---|---|---|---|
| all-mpnet-base-v2 | 0.00 | 4.78 | 4.63 | 4.60 | 4.19 | 3.85 |
| all-MiniLM-L12-v2 | 0.00 | 4.28 | 4.62 | 4.61 | 5.12 | 5.03 |
We hypothesize that this narrowed intersection of the shopping-focused objective and the general walkability is reflected in the performed measurements, which diverge from trends observed under the greenery-focused objective. In the scores analysis, an unforeseen spike in the difference margin between the general walkability and store-focused scores appears after the fifteenth epoch of fine-tuning across both encoders. We hypothesize this is because some of the features that are under the shopping objective, expected to be close together, are pulled apart by the contrastive training. Similar noise, likely rooted in the same conflict of representations, is observed in the scores’ variance measurements.


Despite that, the embeddings generated by lightly fine-tuned encoders still produced relevant point-wise scores with high variance. Although for completeness, the visual comparison between the shopping- and general walkability-focused scores is included, as, in this case, the visualization of the actual scores indicates the overall accuracy better.
Historically-focused objective
In the next experimental case, the scoring pipeline is repositioned to reward points associated with historical elements, such as old buildings, monuments, or museums. This case was meant to represent an objective even more distant from general walkability than the store-focused one. In terms of the relatedness to the definition of walkability that is used in the contrastive task, the historical elements are even more semantically distant than the factors defined by the shopping- or greenery-focused objectives. Furthermore, the notion of historicity was expected to be more challenging to capture in the textual anchors.
Table: Percentual difference between the mean average of historically-focused and general walkability scores, calculated based on embeddings from the same model
| Encoder Model | Vanilla | 1 ep. | 2 eps. | 5 eps. | 10 eps. | 15 eps. |
|---|---|---|---|---|---|---|
| all-mpnet-base-v2 | 587.21% | -21.62% | -11.12% | -7.46% | -4.30% | -6.65% |
| all-MiniLM-L12-v2 | 1504.94% | -17.18% | -10.98% | -12.76% | -1.98% | -0.92% |
Table: Variance in historically-focused scores through fine-tuning epochs
| Encoder Model | Vanilla | 1 ep. | 2 eps. | 5 eps. | 10 eps. | 15 eps. |
|---|---|---|---|---|---|---|
| all-mpnet-base-v2 | 0.78 | 8.50 | 5.58 | 4.73 | 4.34 | 3.89 |
| all-MiniLM-L12-v2 | 1.02 | 6.24 | 5.72 | 4.99 | 6.99 | 5.71 |
Mirroring these challenges, a similar “noise” to the one in the shopping-based case is present in the scores evaluation here. For instance, in the case of the architecture based on all-mpnet-base-v2, the convergence towards the general walkability scores is not as consistent as it was in the greenery-focused case. Similarly, the variance of scores generated with a model based on all-MiniLM-L12-v2 resembles a similar behavior, as shown.

Nonetheless, even in these challenging settings, the scores generated with lightly fine-tuned encoders have seemed to satisfy our objective, as highlighted by the visualization in 1.10{reference-type=”ref+label” reference=”img:difference-historical-mpnet2eps”}.
Safety-focused objective
Finally, we utilize our scoring system in a difficult-to-define yet highly practical safety-oriented objective. By relying on the richness of data provided by OSM, elements that typically contribute to the feeling of public safety (such as street lighting, security cameras, or public service-related facilities and infrastructure) are used in the anchor definitions. Nevertheless, due to the loose correlation between these particular elements and the general walkability evaluation, generating scores under this objective proved to be the most difficult.
Table: Percentual difference between the mean average of safety-focused and general walkability scores, calculated based on embeddings from the same model
| Encoder Model | Vanilla | 1 ep. | 2 eps. | 5 eps. | 10 eps. | 15 eps. |
|---|---|---|---|---|---|---|
| all-mpnet-base-v2 | -99.73% | -27.07% | -20.21% | 16.50% | 11.43% | 20.77% |
| all-MiniLM-L12-v2 | -99.60% | -32.54% | -22.02% | -12.53% | 57.89% | 56.92% |
Table: Variance in safety-focused scores through fine-tuning epochs
| Encoder Model | Vanilla | 1 ep. | 2 eps. | 5 eps. | 10 eps. | 15 eps. |
|---|---|---|---|---|---|---|
| all-mpnet-base-v2 | 0.00 | 4.46 | 4.87 | 10.16 | 6.93 | 9.03 |
| all-MiniLM-L12-v2 | 0.00 | 3.50 | 4.29 | 4.82 | 16.20 | 15.53 |
Unlike any of the previous preference-specific cases, the safety-focused objective caused the mean average scores to rise higher than the mean of the general walkability scores, and never converged. Furthermore, the variance of the safety-focused scores was slightly inconsistent, variously rising and falling.

The generated safety-focused map reflected these observations. As demonstrated, scores of certain areas (such as parks) generally seemed to suffer under these specific preferences, whereas other areas did unexpectedly well. We conclude this is due to both the high diversity and the sparsity of geospatial records that could be used to reliably measure safety levels across entire urban areas. Furthermore, we argue this was also caused by the obvious semantic divergence of elements associated with the fine-tuning objective (general walkability) and the scoring objective (safety).
References
- Reimers, Nils, & Gurevych, Iryna. (2019). Sentence-BERT: Sentence Embeddings Using Siamese BERT-Networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics. https://arxiv.org/abs/1908.10084
- Song, Kaitao, Tan, Xu, Qin, Tao, Lu, Jianfeng, & Liu, Tie-Yan. (2020). MPNet: Masked and Permuted Pre-Training for Language Understanding. Advances in Neural Information Processing Systems, 33, 16857–16867.
- Wang, Wenhui, Wei, Furu, Dong, Li, Bao, Hangbo, Yang, Nan, & Zhou, Ming. (2020). MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers. Advances in Neural Information Processing Systems, 33, 5776–5788.