Here To Stay

Between living and leaving: the right to inhabit the city

The Project

Here To Stay is a Linked Open Data project reflecting on the phenomenon of touristification in the city of Bologna. In this context, the limited availability of housing is a well-known issue for students and, more broadly, for families. The aim of the project is to explore the relationship between these two dynamics and to reflect on the social impacts of the increasing flow of visitors in the city, producing informed conclusions and graphical visualizations to showcase the data analysis.

Tourism in Italy was regulated through a comprehensive codification for the first time in 2011. Prior to this, the sector relied on fragmented and independent laws, lacking a unified legislation.
The 2011 Tourism Code consolidated national tourism regulations, defining tourism activities, accommodation types, market operators, and travelers’ rights within a unified legal framework. Among the types of accommodation are also short-term accommodation rentals, which were then categorized as 'residential leases for tourist purposes' governed by the Civil Code. By treating these units as private leases rather than professional tourist facilities, the law created a regulatory gap that favored the rapid conversion of residential housing into short-term tourist accommodation, directly impacting the dynamics of the local housing market.
The datasets we chose to make part of our project cover the time span between 2014 and 2024, a period which we considered to be long enough to evaluate evolutions in the number of accommodation facilities in the city of Bologna and sufficiently distant from the entry into force of the Tourism Code to be able to visualize its long-term effects.
The hypothesis at the core of our project is that the increase in the number of facilities in the city centre and outer zones has favoured the rise in rental prices, resulting in a diminished housing supply in a market characterized by constant demand and consequently leading to the upward trend in prices. The datasets included in our study were selected to verify our initial hypothesis and then evaluate the effects of subsequent legislations introduced within the analyzed time span and proceed to focus on the impact on the student demographics.

Datasets

The six original datasets we selected come from different sources and have been published under different licenses. Some of them came in a format which needed conversion to become usable to conduct the analysis, so we pointed out both the original and the converted format for them. The operations of data cleaning and minimization were carried out using the Knime platform.
The six mashup datasets were derived from the analysis and work conducted on the six source ones using Knime as well. Visualizations have been produced to provide graphical evidence for the answers to our research questions.

  • All
  • Original
  • Mashup
D1 - Accommodation Facilities in Bologna
  • List of establishments belonging to the category of accommodation activities
  • Source: Open Data Bologna
  • Format: .csv
  • URL: D1
  • License: CC-BY 4.0
D2 - Accommodation Capacity
  • Capacity of collective tourist accommodation by type of accommodation and municipality
  • Source: ISTAT
  • Format: .xlsx, (.csv)
  • URL: D2
  • License: CC-BY 4.0
D3 - Tourist Flow
  • Occupancy in collective accommodation establishments at municipality level by type of accomodation and residence of guests
  • Source: ISTAT
  • Format: .xlsx, (.csv)
  • URL: D3
  • License: CC-BY 4.0
D4 - Legislations
  • National legislations on tourist rentals
  • Source: Normattiva
  • Format: .csv
  • URL: D4
  • License: CC-BY 4.0
D5 - Rental Prices
  • Rental prices of residential properties offered for rent through the Idealista platform in Bologna
  • Source: Idealista
  • Format: .html, (.csv)
  • URL: D5
  • License: Proprietary
D6 - Student Enrollment by Residence
  • Students enrolled by residence and university of enrollment
  • Source: MUR
  • Format: .csv
  • URL: D6
  • License: Public Domain
MD1 - Istat VS Open Data Accommodations
  • Comparison between ISTAT and municipal Open Data on accommodation establishments in Bologna
  • Source: D1, D2
  • Format: .csv
  • URL: MD1
  • License: CC-BY 4.0
MD2 - Tourism Carrying Capacity
  • Estimate of tourism carrying capacity in Bologna
  • Source: D2, D3
  • Format: .csv
  • URL: MD2
  • License: CC-BY 4.0
MD3 - Legislative Impact
  • Impact of legislations on tourist rentals in Bologna
  • Source: D1, D2, D4
  • Format: .csv
  • URL: MD3
  • License: CC-BY 4.0
MD4 - Price VS Facilities Growth
  • Comparison between growth rate of rental prices and accommodation establishments in Bologna
  • Source: D1, D5
  • Format: .csv
  • URL: MD4
  • License: CC-BY 4.0
MD5 - Regional Enrollment VS Rent Prices
  • Comparison between university enrollments and rental prices at regional level
  • Source: D5, D6
  • Format: .csv
  • URL: MD5
  • License: CC-BY 4.0
MD6 - Regional Sensitivity Index
  • Regional sensitivity index to student enrollments
  • Source: D5, D6
  • Format: .csv
  • URL: MD6
  • License: CC-BY 4.0

Datasets Analyses

Quality Analysis

The team conducted the quality analysis of the original datasets adhering to the Linee guida nazionali per la valorizzazione del patrimonio informative pubblico (AgID). This framework evaluates data quality based on four key dimensions:

  • Accuracy measures the degree to which data correctly represents the real-world values it intends to model. It is calculated as the ratio of correct values to the total values requiring validation. The guidelines distinguish between syntactic accuracy (conformity to a defined format or structure) and semantic accuracy (factual correctness regarding the real-world entity).
  • Coherence ensures data is free from internal contradictions and adheres to defined semantic rules (e.g. verifying that aggregated totals match the sum of disaggregated components). The metric is defined as the ratio of attributes satisfying logic rules to the total attributes subject to those rules.
  • Completeness assesses the presence of values for all the expected attributes. This dimension is evaluated through three specific measures:
    • Schema completeness: the percentage of null values relative to the total number of expected values;
    • Record completeness: the ratio of non-null data elements within a single record compared to the total number of fields where completeness is applicable for that record;
    • Population Completeness: the percentage of missing values relative to a specific reference population.
  • Timeliness evaluates how up-to-date the dataset is relative to its usage context.

Accuracy Syntactic accuracy: the dataset demonstrates strong structural consistency, with key text fields adhering to uniform formatting conventions (e.g., standardized uppercase text). Additionally, the addresses are validated against the official municipal toponymy (SUAP system), which ensures that these elements are syntactically valid and minimizes geocoding failures. Semantic accuracy: certain classification fields may contain generic placeholder values (such as “non definite”). While syntactically correct, these values represent a semantic gap, as the records lack descriptive specificity regarding the type of facility. The overall internal accuracy is solid. The dataset reliably represents the administrative reality of the authorized facilities.
Coherence The dataset demonstrates high internal coherence. The classification of business types utilizes a controlled vocabulary with no free-text contradictions or logical conflict between columns.
Completeness Schema and Record completeness: the schema is fully defined, containing all core attributes for identification. While mandatory fields are densely populated, we observed a significant percentage of null values in secondary columns, reducing the effective completeness of the records.
Population completeness: the dataset provides a complete enumeration of the authorized facilities, covering the entire Municipality of Bologna. Notably, the population includes both Active and Ceased entities.
Timeliness We detected a discrepancy between the declared metadata and the system logs. While the metadata indicates an “Annual” update frequency, the “Last Processing” date suggests the dataset is re-processed automatically on a much more frequent basis (likely daily or weekly). Given the recent processing date, the dataset offers high currency, reflecting the real-time status of the administrative database despite the conservative “Annual” label.

Accuracy Syntactic accuracy: being an official statistical product, the data adheres to strict Istat validation standards. Categorical variables follow standardized codes, and numerical fields are strictly typed, ensuring seamless integration.
Semantic accuracy: the count of traditional facilities is highly accurate. However, regarding the extra-alberghiero sector, semantic accuracy relies on administrative declarations provided by local bodies. While the data correctly represents the registered reality, it may suffer from a semantic gap due to delayed registrations or omitted declarations by hosts.
Coherence The dataset respects all defined semantic and mathematical rules. We verified that aggregated totals strictly match the sum of their disaggregated components.
Completeness Schema and Record Completeness: all expected attributes required for the analysis are present and fully populated for the Municipality of Bologna. There are no missing values in the core metrics.
Population Completeness: while the dataset covers 100% of the legal/administrative population (registered businesses), it exhibits a significant population gap regarding the «real» tourism phenomenon. Istat datasets primarily capture registered facilities, suffering from under-reporting of the informal market – specifically short-term rentals that operate as private leases rather than registered businesses (Case Vacanze). Therefore, this dataset must be interpreted as representing legal capacity, which is a subset of real capacity.
Timeliness The dataset has a known Annual frequency, but is subject to a significant publication lag (typically 12-18 months) due to the validation process required for official statistics.
While precise, the timeliness is insufficient for real-time monitoring of the housing emergency. The analysis must acknowledge that the most recent wave of conversions from residential to tourist rentals (occurring in the last 12 months) may not yet be captured in this specific dataset.

Accuracy Syntactic accuracy: the dataset adheres to strict Istat validation standards. Categorical variables follow standardized statistical codes, ensuring zero formatting errors.
Semantic accuracy: traditional sector (Esercizi alberghieri) as reporting is strictly enforced and structurally aligned with Public Security registration obligations (Alloggiati Web), ensuring a near-census coverage of the official flows. For Extra-alberghiero sector the semantic accuracy is vulnerable to under-reporting. Since flow data relies on hosts declaring the number of guests, there is a financial incentive for less professional operators to under-report the actual number of nights spent. Consequently, the data represents the declared flows rather than the absolute real ones.
Coherence We verified the mathematical consistency of the dataset, ensuring that aggregate figures strictly match the sum of their disaggregated components, and that the logical constraints between related variables were respected across all records.
The classification of accommodation types is consistent with the Capacity dataset (D2), allowing for the calculation of derived indicators without classification conflicts.
Completeness Schema and record completeness: the dataset provides a complete matrix of attributes, including client residence and type of accommodation. There are no missing values for the Municipality of Bologna in the aggregate totals.
Population completeness: while the dataset covers 100% of flows in registered facilities, it suffers from the same population gap identified in the Capacity dataset (D2). It completely excludes the «informal tourism» sector. Therefore, these figures should be interpreted as Official Tourist Demand, which is significantly lower than the total City Users pressure on the territory.
Timeliness The complexity of collecting and validating flow data (which changes daily) results in a «publication lag» of 12–18 months. Data is typically aggregated annually in the main extracts. If the project requires analysing seasonality, the user must utilize monthly disaggregation, which are often released with a further delay compared to annual totals.

Accuracy Syntactic accuracy: the text was retrieved directly from Normattiva, the only database legally guaranteeing the reliability of the Italian regulatory corpus. This ensures zero transcription errors compared to secondary sources.
Semantic accuracy: the semantic accuracy in a legal context is defined by Vigenza. We verified that the texts selected correspond to the version currently in force (vigente) at the time of extraction (January 2026). We recorded the URI for each record, ensuring a permanent and unambiguous reference to the official legal act.
Coherence The dataset follows a consistent, manually defined schema, with all records adhering to this structure without missing data. The dataset demonstrates high internal coherence, as the selected laws form a unified legislative framework specifically dedicated to the regulation of short-term rentals (affitti brevi). We verified that the selected acts are complementary, creating a logical continuum from general rules to specific administrative obligations, without contradictions in their application scope.
Completeness Schema and record completeness: every record in the dataset contains all necessary metadata fields required for the qualitative analysis.
Population completeness: unlike statistical datasets where 100% population coverage is the goal, this dataset relies on targeted selection. The population is intentionally partial, defined as the subset of regulations impacting the touristification and housing scenario in Bologna.
Timeliness The dataset is defined by its temporal validity relative to the project timeline. It represents a legislative snapshot aligned with the analysis window. While the dataset is static and will not be dynamically updated after the project’s conclusion, its currency is high for the scope of the research. It reflects the regulatory framework active at the moment of the analysis, serving as a baseline for interpreting the data.

Accuracy Syntactic accuracy: the data is generated automatically by the portal’s algorithms without manual data entry or transcription phases. This ensures a syntactic error rate close to zero.
Semantic accuracy: the dataset exhibits a known semantic bias, since it represents the «Prezzo Richiesto» (the asking price in the listing) rather than the «Prezzo di Transazione» (the actual amount agreed). In a high-demand market like Bologna, the «Prezzo Richiesto» serves as a strong proxy for market trends, but it implies a systematic exclusion of the negotiation margin. As states in the methodology, the indices are based on active listings (offerta attiva) reflecting seller expectations rather than finalized economic exchanges.
Coherence The time series is internally consistent, calculating averages based on a standardized definition of Euro per square meter (€/m²). We identified a discontinuity point in the time series of March 2019, when Idealista updated its calculation methodology to improve sample reliability. Consequently, comparisons crossing this date should be interpreted with caution.
Completeness Population completeness: the dataset covers exclusively the supply listed on the Idealista marketplace. This introduces a single-source bias, as it excludes competing supply present only on other portals and the private leases concluded via informal channels. We extracted data for both the Centro Storico and external zones to perform a comparative center-periphery analysis. This granularity is essential for measuring how price pressures in the tourist-heavy core propagate to outer zones.
Timeliness The dataset offers high currency (monthly updates). While official statistics (Istat) often lag by 12-18 months, Idealista reflects market changes in real-time. This reactivity allows for the detection of immediate market shocks that structural data would miss.

Accuracy Syntactic accuracy: data originates directly from the Anagrafe nazionale degli studenti e dei laureati (ANS), the administrative server used to track academic careers. University codes (Codici Ateneo) and degree course codes adhere to strict ministerial standards, ensuring zero syntactic errors.
Semantic accuracy: unlike survey-based estimations, these figures correspond to actual tuition fees paid and legally registered academic careers.
Coherence The dataset respects strict additive rules. We verified that the total number of enrolled students equals the sum of disaggregated components. The use of standardized Ministerial Codes allows for a seamless mash-up with other institutional databases, facilitating multidimensional analysis without join errors.
Completeness Schema and record completeness: the dataset provides a complete matrix of attributes with no missing values for the University of Bologna, with all critical segmentation variables fully populated.
Population completeness: the dataset covers 100% of the recognized university population in Italy. The dataset includes the Regione di residenza attribute. This is the vital component for our analysis, as it allows us to calculate the Student-Housing Mismatch. We can filter out local residents to isolate the specific subset of fuorisede students who actually generate demand on the rental market.
Timeliness The update frequency is annual (Academic Year cycle). The dataset is current up to the 2024/2025 academic year (provisional), providing an up-to-date baseline for statistical analysis.

Ethical Analysis

The ethical framework of this project is structured in accordance with the European Commission’s Guidelines on Ethics and Data Protection. To facilitate a rigorous assessment, we integrated the Principles of Data Ethics developed by DataEthics.eu alongside the Open Data Institute’s (ODI) Data Ethics Canvas.

Principles of Data Ethics

This project investigates the phenomenon of "touristification" (tourism-led gentrification) in Bologna and its impact on housing availability and university enrollment. At the core of this analysis lie the Right to the City and the Right to Education. The primary objective is to generate public value for residents and the student community rather than serving commercial or tourism-industry interests. By providing a data-driven interpretative framework, the study empowers citizens to visualize and understand the socio-economic forces reshaping their urban environment.

The study relies on the secondary use of administrative and statistical datasets, previously collected for other purposes. In accordance with GDPR principles, specifically those concerning processing for scientific and statistical purposes, we have implemented data minimisation protocols. Although direct involvement of data subjects was not possible due to the nature of the sources, we ensured that only the necessary information was processed. All findings are presented in an aggregated format, preventing any direct impact on the self-determination of the individuals represented in the data.

All data sources (Bologna Open Data, ISTAT, MUR-USTAT and Idealista) are publicly accessible and explicitly cited. All transformations, filtering methods and indicators (e.g., growth rates and correlations) are documented in a reproducible workflow. This documentation discloses the methodological limitations and potential biases inherent in both the original and integrated datasets.

Following the principle of Data Protection by Design, we selected datasets that were pre-anonymised at the source and limited the amount and granularity of data processed to what was strictly necessary for our research questions. For private-sector data (Idealista), we ensured that the extraction and usage remained within a clearly documented academic‑research purpose, avoiding any infringement of proprietary rights or individual privacy. Internal responsibility was maintained for every stage of data cleaning and aggregation to prevent the introduction of errors or misinterpretations. These choices are documented so that the research team can be held accountable for how data were selected, processed and interpreted.

Our research investigates "touristification" as a socio-economic phenomenon affecting the real estate market and the fundamental rights of citizens, with a specific focus on university students and vulnerable groups. In alignment with the principle of Equality, this analysis explicitly seeks to identify potential discrimination or stigmatization based on financial and social conditions. By crossing regional enrollment trends with rental price indices, the analysis seeks to identify patterns of "educational exclusion." This approach highlights how specific demographic or geographic groups may be disproportionately affected by the rising cost of living, thereby fulfilling the ethical mandate to protect vulnerable populations from systemic discrimination.

Analysis of Data Sources and Dataset-Specific Ethical Risks

To ensure a comprehensive ethical assessment, and given that the project integrates datasets from diverse origins, we analyzed each source's provenance and potential ethical risks.

Open Data - Comune di Bologna

The dataset regarding Tourist Accommodation Businesses (D1) was sourced from the Municipality of Bologna’s Open Data portal. This platform operates under European and national open data directives, ensuring that data are pre-anonymised and compliant with legislative principles.
The primary ethical risk identified was the presence of granular geospatial data (addresses and house numbers). While geospatial information does not fall under the category of personal data, it poses a risk of indirect de-anonymisation when cross-referenced with other datasets. Since many extra-hotel facilities are operated by private individuals at their place of residence, publishing such details could compromise their privacy. Furthermore, data concerning "ceased" activities could inadvertently disclose an individual’s economic failure, potentially impacting their human dignity. To mitigate these risks, we applied the principle of data minimisation, removing all specific geographic identifiers as they were not essential for the macro-level purposes of our analysis.

ISTAT

Tourism statistics (D2, D3) were derived from ISTAT, which operates under a rigorous ethical framework of independence and confidentiality aligned with the European Statistics Code of Practice. The team identified representation bias as the significant ethical concern. ISTAT datasets on tourist accommodations may suffer from "under-reporting," as they primarily capture registered facilities, potentially overlooking the informal or "grey" short-term rental market (the sommerso). This discrepancy is evident when comparing ISTAT data with the one collected by the Municipality of Bologna in Open Data, which provides more granular tracking of facility life cycles. To mitigate this bias of omission, we cross-referenced both sources to provide a more accurate and ethically responsible assessment of urban tourist pressure, ensuring transparency in our reporting.

Idealista

The inclusion of market data from Idealista (D5) represents the most significant ethical challenge of this project. Unlike institutional sources, Idealista is a commercial entity whose data is generated for profit-making purposes. Our secondary use is justified by the educational nature of this project. To address the lack of explicit consent from data subjects, we process the data exclusively in its most aggregated form to prevent any direct or indirect identification.
Furthermore, we commit to deleting the raw data from our public platforms following the academic assessment, retaining only the visualizations derived from its integration with other datasets. Throughout the project, the source has been transparently cited to avoid any misattribution.
Methodologically, we disclose a lack of transparency regarding the platform's proprietary algorithms; thus, the dataset is defined as a proxy for “asking prices” rather than finalized contracts. Ethically, it is used here only to represent the "market reality" faced by students and families.
Finally, we acknowledge a structural selection bias: the dataset reflects a "digital-only" marketplace. It does not account for the entire housing market, such as informal rental networks or informal rentals or historical long-term contracts that do not pass through digital platforms.

MUR-USTAT

Student enrollment data (D6) were collected from the MUR-USTAT Open Data platform. This dataset contains sensitive information concerning residence and citizenship of universities' students, therefore the ethical aspect of their handling has to be carefully considered. Our dataset contained information about the number of students by province of residence, province of study, university and disciplinary group. While the data arrived anonymised, we identified a risk of indirect re-identification through the combination of variables. For instance, crossing a specific province of residence with a niche "disciplinary group" could isolate individual student trajectories. In accordance with EU Ethics and Data Protection guidance, we mitigated this by clustering students at the regional level and deleting non-essential granular information, such as specific disciplinary codes, thereby upholding the principle of privacy by design.

Ethical risks and limitations

We acknowledge several limitations and biases that inform our ethical approach:

  • Correlation vs. Causation: We recognize that "correlation does not imply causation." The increase in rental prices is a significant factor, but not the sole driver of enrollment fluctuations. Other variables, such as regional demographic shifts, the improvement of academic offerings in students' home regions, or personal preferences, are not fully captured in this quantitative model.
  • Risk of Regional Stigmatization: We are aware that highlighting specific regions as "vulnerable" carries a risk of stigmatization. Our goal is to use this data to advocate for systemic policy changes rather than to label specific communities.

Legal Framework dataset (self-produced)

The final dataset (D4) consists of a self-produced record of national regulations governing short-term rentals in Italy. While this dataset does not contain personal data, its construction carries specific ethical responsibilities regarding objectivity and transparency.
The dataset was compiled by aggregating official legislative texts from Nomattiva, the official Italian government portal providing access to legislation. The primary ethical risk in self-producing a regulatory dataset is the potential for arbitrary selection. Deciding which laws are "relevant" can inadvertently bias the analysis toward a specific narrative. To minimize this risk and ensure methodological integrity, we performed a systematic search on Normattiva using specific keywords (“strutture ricettive”, “affitti brevi”, “locazioni brevi”). The results were manually reviewed and scraped to ensure a comprehensive overview of the regulatory evolution.
This dataset serves to investigate how legislative choices have attempted to regulate the short term rental market. It provides a necessary "human-centric" background to our analysis, shifting the focus from abstract statistics to the concrete impact of policy-making on the Right to Education and Housing.

Technical Analysis

This project is the result of several analyses produced on a curated catalog of datasets assembled for our study.
The catalog is composed of six original datasets, sourced from multiple providers, with some datasets coming from the same source and others from distinct institutions, and characterized by differences in access modality, metadata completeness, formats, and ease of retrieval. In some cases, the original data required format conversion or the construction of a coherent series, for example by annualizing values across multiple years or by reshaping tables to obtain comparable structures. One of the original datasets was created in-house by extracting and structuring information from legislative documents available on Normattiva.
Alongside these, we created six mashup datasets, obtained by combining and cross-referencing the information contained in the original sources into integrated tables; these derived datasets power all graphs, indicators, and data-driven inferences produced in the project.

Metadata model

All datasets have been assessed using the metadata model defined by Agenzia per l'Italia Digitale (AgID), which frames metadata quality as a four-level classification intended to describe how “attached” metadata are to the data they describe, and how informative they are. The model focuses on two dimensions: the strength of the data-metadata relationship (whether metadata are embedded, tightly linked, or merely external) and the level of detail provided (whether metadata describe the whole dataset only, or go deeper into its internal structure).
Most of the catalog falls under Level 2 (“weak”), because metadata are external to the data and describe the dataset as a whole rather than individual records or variables:

  • Open Data Bologna: metadata are primarily provided on the dataset’s descriptive web page, while the downloadable files act as distributions with limited embedded description.
  • Ustat MUR: metadata are similarly exposed through the dataset’s descriptive page, with dataset-level information kept separate from the data files.
  • ISTAT: metadata discovery depends on the SDMX access layer; it is necessary to retrieve the dataset identifier and then access the associated metadata through SDMX services rather than through a single descriptive landing page.
  • Our original dataset: metadata are captured in the project’s global Turtle metadata file, which serves as the external dataset-level description for the produced distributions.

A separate case is the original dataset constructed from Idealista data, which is closer to Level 1: the source is effectively restricted to being consulted only as rendered content on an HTML page and does not publish an appropriate, explicit metadata layer that can be linked to the dataset, so metadata must be reconstructed and documented ex-post by inference during collection and description.

DCAT-AP_IT framework

DCAT-AP_IT is the national metadata application profile adopted in Italy to describe catalogs, datasets, and their distributions in a uniform, machine-readable way, so that data published by different administrations can be discovered, compared, and harvested consistently across portals. It is the Italian profile of the European DCAT-AP, which is the common specification promoted at EU level to exchange public-sector dataset descriptions among data portals.
This work sits on top of the broader DCAT, the W3C Data Catalog Vocabulary: an RDF vocabulary designed to represent data catalogs on the Web by modeling core entities such as Catalog, Dataset, and Distribution, and by linking descriptions to access points, formats, licenses, themes, and responsible agents in a way that software can process automatically. Starting from this shared semantic backbone (currently standardized as DCAT v3), DCAT-AP defines a European interoperability layer by selecting and constraining how DCAT should be used for public-sector data portals, while DCAT-AP_IT further specializes that layer for the Italian context by prescribing the required classes, properties, controlled vocabularies, and modeling patterns expected by national catalogs and publication workflows.

Original datasets

All original datasets were re-metadated by us in a strictly DCAT-AP_IT-compliant form so they can be treated as compliant within a single corpus.
Concretely, we extracted (or, when necessary, inferred) the values for the mandatory elements (dcterms:identifier, dcterms:title, dcterms:description, dcterms:modified, dcat:theme, dcterms:rightsHolder, dcterms:accrualPeriodicity, and dcat:distribution) using the dcterms and dcat vocabularies, and we populated them consistently across the catalog.
We also implemented the required separation between dataset and distribution by modeling each distribution with its required fields (notably dcterms:format, dcterms:license, and dcat:accessURL), and we explicitly represented our own file conversions and segmentations when producing new distributions; where relevant, we captured temporal and spatial framing (e.g., dcterms:temporal and dcterms:spatial) to reflect annualized series or geographically scoped extractions.

Mashup datasets

All mashup datasets were metadated coherently under DCAT-AP_IT from the beginning, so that even the datasets created inside the project are described with the same up-to-date national standard used for public-sector catalogs in Italy, and can therefore be interpreted in the same way as the original sources in terms of Dataset/Distribution separation, access points, formats, licenses, themes and agents. In addition, we made provenance explicit by linking each mashup dataset to its originating dataset(s) using the W3C PROV ontology, a standard vocabulary specifically designed to represent and interchange provenance information, i.e. who/what generated a resource, from which inputs, through which transformations, so that readers can reconstruct the genealogy of each derived table.
We used in particular prov:wasDerivedFrom to state derivation relationships, and we set ourselves as dcterms:publisher for the newly produced outputs: this cross-referencing within metadata matters because it makes data-driven conclusions auditable, supports reuse without guesswork, and lets downstream users assess trust, scope, and limitations by immediately seeing which source(s) each mashup depends on.

Technical hardening

To reduce avoidable metadata drift, as an extra layer of technical care we added automated checks in our GitHub repository: a GitHub Action runs rapper, a command-line RDF parser and serializer from the Raptor RDF toolkit, on every push to ensure our Turtle files parse correctly and remain syntactically valid, treating RDF syntax errors like failing tests.
This continuous syntax check is complemented by a semantic compliance check: we also submitted our metadata Turtle file to the official DCAT-AP_IT validator published by AgID, which verifies structural congruence against the profile rules. At the time of publication of the project, the metadata file returned 0 errors, giving us an external confirmation that the file is not only well-formed RDF, but also aligned with the DCAT-AP_IT constraints expected by the national guidelines.


FAIR principles

The table below uses the “FAIR Principles overview” as a checklist to evaluate our outputs against a widely used reference framework for data stewardship. The FAIR Principles, promoted by the GO FAIR Initiative, define what it means for digital assets to be Findable, Accessible, Interoperable, and Reusable, and they stress machine-actionability: the goal is not only that humans can read a dataset description, but that software agents can discover it via identifiers and registries, understand enough metadata to interpret it, and follow explicit access and reuse conditions with minimal manual mediation.
By mapping each principle to concrete cataloging choices (identifiers, access URLs, formats, licenses, provenance links, and vocabulary usage), the checklist helps make our level of compliance explicit and auditable, and highlights where the catalog behaves like a robust, standards-aligned digital object rather than a one-off collection of files.

To check:
(Meta)data are assigned a globally unique and persistent identifier
Data are described with rich metadata (defined by R1 below)
Metadata clearly and explicitly include the identifier of the data they describe
(Meta)data are registered or indexed in a searchable resource

To check:
(Meta)data are retrievable by their identifier using a standardised communications protocol
The communication protocol is open, free, and universally implementable
The communication protocol allows for an authentication and authorisation procedure, where necessary
Metadata are accessible, even when the data are no longer available

To check:
(Meta)data use a formal, accessible, shared, and broadly applicable language for knowledge representation.
(Meta)data use vocabularies that follow FAIR principles
(Meta)data include qualified references to other (meta)data

To check:
(Meta)data are richly described with a plurality of accurate and relevant attributes
(Meta)data are released with a clear and accessible data usage license
(Meta)data are associated with detailed provenance
(Meta)data meet domain-relevant community standards

Workflow

The original datasets were processed using KNIME platform. The process included data cleansing, handling missing values, and eliminating unnecessary fields for the purposes of the analysis, as well as combining multiple data sources to create the final mashup datasets.
The image below allows navigation through the workflow, while the full KNIME workflow can be downloaded here.

KNIME workflow

RDF Serialization

Below are the metadata produced for describing the datasets used and created within the project, following the DCAT-AP_IT standard and serialized in RDF/Turtle format. Please refer to the Technical Analysis section of the website for more details on the reasoning behind the chosen RDF metadata assertions and structure.

Sustainability

The sustainability of the data used in this project is evaluated based on source reliability.
The datasets (D1, D2, D3, D6) sourced from the Municipality of Bologna’s Open Data portal, ISTAT and MUR-USTAT are considered high-sustainability sources, since they are maintained under legal mandates for public statistics and administrative transparency. While these datasets do not currently use persistent identifiers (PIDs), their structured naming conventions and institutional metadata ensure they remain retrievable through official archives even in the event of URL reconfiguration.
The Idealista dataset (D5) represents a high-volatility source, since commercial newsroom URLs are highly likely to become obsolete or change. In compliance with intellectual property constraints, raw commercial datasets will be destroyed once the primary research objectives are achieved, as we do not hold the rights for long-term redistribution.
To ensure the long-term accessibility of our unique contributions, the legislation dataset (D4) and the integrated "mashup" datasets produced by the team have been assigned permanent identifiers through GitHub.
Here To Stay is the final project developed for the Open Access And Digital Ethics course (a.y. 2025/2026) within the Digital Humanities and Digital Knowledge Master's Degree (University of Bologna). As such, the project is not subject to active maintenance or future updates beyond the scope of its initial submission.

Conclusions

Touristification has been at the center of mainstream debate for more than a decade, affecting not only Italy but several other European countries. The growth of short-term rental platforms has intensified pressure on housing markets, raising living costs and straining urban coexistence. The slogan “tourists go home”, which emerged from citizen protests in Barcelona, has become a symbol of the urban malaise that runs among the ones who inhabit the city. To date, unlike cities such as Barcelona, which announced a ban on short-term rentals, Italy has relied on local regulations.
According to a recent analysis by Demoskopika on overtourism, Bologna currently faces a “moderate risk”, yet the consequences are already strikingly visible. In this context, the Municipality of Bologna attempted in December 2025 to introduce stricter rules for tourist rentals in the historic center, but these were overturned by the Council of State due to their impact on established economic interests. While datasets, as stated throughout our analysis, only partially reflect the human cost of the crisis, student protests and the growing difficulty in finding housing highlight a deeper issue: the housing question has become a factor of social exclusion and a violation of the residents’ Right to the City.
Although Airbnb is not the sole driver of this crisis, Bologna’s housing emergency is an undeniable reality, confirmed by the city's own strategic responses. In 2023, the Municipality launched the Piano per l’Abitare (“Plan for Housing”), a €200 million investment to create 3,000 housing units for marginalized groups, students and those seeking below-market rents.
We’re all aware that without structural intervention and a rethink of urban space as a public good, cities like Bologna face the loss of their most vital asset: the social and cultural diversity brought by the student population. By investigating this phenomenon through data, our analysis aimed to sharpen the focus on this crisis. We firmly believe that, beyond public debate, rigorous data analysis and visualization can stimulate new perspectives and ensure this issue remains at the forefront of the political agenda.

The Team

tomas

Tommaso Barbato

nic

Nicol D'Amelio

mirpin

Miriana Pinto

sara

Sara Roggiani

Licenses and Credits

Images and icons
Software
Web template
  • This website is built on the HTML5 template Scaffold