Between living and leaving: the right to inhabit the city
The Project
Here To Stay is a Linked Open Data project reflecting on the phenomenon of touristification in the city of Bologna. In this context, the limited availability of housing is a well-known issue for students and, more broadly, for families. The aim of the project is to explore the relationship between these two dynamics and to reflect on the social impacts of the increasing flow of visitors in the city, producing informed conclusions and graphical visualizations to showcase the data analysis.
Tourism in Italy was regulated through a comprehensive codification for the first time in 2011. Prior to this, the sector relied on fragmented and independent laws, lacking a unified legislation.
The 2011 Tourism Code consolidated national tourism regulations, defining tourism activities, accommodation types, market operators, and travelers’ rights within a unified legal framework.
Among the types of accommodation are also short-term accommodation rentals, which were then categorized as 'residential leases for tourist purposes' governed by the Civil Code. By treating these units as private leases rather than professional tourist facilities, the law created a regulatory gap that favored the rapid conversion of residential housing into short-term tourist accommodation, directly impacting the dynamics of the local housing market.
The datasets we chose to make part of our project cover the time span between 2014 and 2024, a period which we considered to be long enough to evaluate evolutions in the number of accommodation facilities in the city of Bologna and sufficiently distant from the entry into force of the Tourism Code to be able to visualize its long-term effects.
The hypothesis at the core of our project is that the increase in the number of facilities in the city centre and outer zones has favoured the rise in rental prices, resulting in a diminished housing supply in a market characterized by constant demand and consequently leading to the upward trend in prices. The datasets included in our study were selected to verify our initial hypothesis and then evaluate the effects of subsequent legislations introduced within the analyzed time span and proceed to focus on the impact on the student demographics.
Datasets
The six original datasets we selected come from different sources and have been published under different licenses. Some of them came in a format which needed conversion to become usable to conduct the analysis, so we pointed out both the original and the converted format for them. The operations of data cleaning and minimization were carried out using the Knime platform.
The six mashup datasets were derived from the analysis and work conducted on the six source ones using Knime as well. Visualizations have been produced to provide graphical evidence for the answers to our research questions.
All
Original
Mashup
D1 - Accommodation Facilities in Bologna
List of establishments belonging to the category of accommodation activities
The team conducted the quality analysis of the original datasets adhering to the Linee guida nazionali per la valorizzazione del patrimonio informative pubblico (AgID). This framework evaluates data quality based on four key dimensions:
Accuracy measures the degree to which data correctly represents the real-world values it intends to model. It is calculated as the ratio of correct values to the total values requiring validation. The guidelines distinguish between syntactic accuracy (conformity to a defined format or structure) and semantic accuracy (factual correctness regarding the real-world entity).
Coherence ensures data is free from internal contradictions and adheres to defined semantic rules (e.g. verifying that aggregated totals match the sum of disaggregated components). The metric is defined as the ratio of attributes satisfying logic rules to the total attributes subject to those rules.
Completeness assesses the presence of values for all the expected attributes. This dimension is evaluated through three specific measures:
Schema completeness: the percentage of null values relative to the total number of expected values;
Record completeness: the ratio of non-null data elements within a single record compared to the total number of fields where completeness is applicable for that record;
Population Completeness: the percentage of missing values relative to a specific reference population.
Timeliness evaluates how up-to-date the dataset is relative to its usage context.
Accuracy
Syntactic accuracy: the dataset demonstrates strong structural consistency, with key text fields adhering to uniform formatting conventions (e.g., standardized uppercase text). Additionally, the addresses are validated against the official municipal toponymy (SUAP system), which ensures that these elements are syntactically valid and minimizes geocoding failures.
Semantic accuracy: certain classification fields may contain generic placeholder values (such as “non definite”). While syntactically correct, these values represent a semantic gap, as the records lack descriptive specificity regarding the type of facility. The overall internal accuracy is solid. The dataset reliably represents the administrative reality of the authorized facilities.
Coherence
The dataset demonstrates high internal coherence. The classification of business types utilizes a controlled vocabulary with no free-text contradictions or logical conflict between columns.
Completeness
Schema and Record completeness: the schema is fully defined, containing all core attributes for identification. While mandatory fields are densely populated, we observed a significant percentage of null values in secondary columns, reducing the effective completeness of the records.
Population completeness: the dataset provides a complete enumeration of the authorized facilities, covering the entire Municipality of Bologna. Notably, the population includes both Active and Ceased entities.
Timeliness
We detected a discrepancy between the declared metadata and the system logs. While the metadata indicates an “Annual” update frequency, the “Last Processing” date suggests the dataset is re-processed automatically on a much more frequent basis (likely daily or weekly). Given the recent processing date, the dataset offers high currency, reflecting the real-time status of the administrative database despite the conservative “Annual” label.
Accuracy
Syntactic accuracy: being an official statistical product, the data adheres to strict Istat validation standards. Categorical variables follow standardized codes, and numerical fields are strictly typed, ensuring seamless integration.
Semantic accuracy: the count of traditional facilities is highly accurate. However, regarding the extra-alberghiero sector, semantic accuracy relies on administrative declarations provided by local bodies. While the data correctly represents the registered reality, it may suffer from a semantic gap due to delayed registrations or omitted declarations by hosts.
Coherence
The dataset respects all defined semantic and mathematical rules. We verified that aggregated totals strictly match the sum of their disaggregated components.
Completeness
Schema and Record Completeness: all expected attributes required for the analysis are present and fully populated for the Municipality of Bologna. There are no missing values in the core metrics.
Population Completeness: while the dataset covers 100% of the legal/administrative population (registered businesses), it exhibits a significant population gap regarding the «real» tourism phenomenon. Istat datasets primarily capture registered facilities, suffering from under-reporting of the informal market – specifically short-term rentals that operate as private leases rather than registered businesses (Case Vacanze). Therefore, this dataset must be interpreted as representing legal capacity, which is a subset of real capacity.
Timeliness
The dataset has a known Annual frequency, but is subject to a significant publication lag (typically 12-18 months) due to the validation process required for official statistics.
While precise, the timeliness is insufficient for real-time monitoring of the housing emergency. The analysis must acknowledge that the most recent wave of conversions from residential to tourist rentals (occurring in the last 12 months) may not yet be captured in this specific dataset.
Accuracy
Syntactic accuracy: the dataset adheres to strict Istat validation standards. Categorical variables follow standardized statistical codes, ensuring zero formatting errors.
Semantic accuracy: traditional sector (Esercizi alberghieri) as reporting is strictly enforced and structurally aligned with Public Security registration obligations (Alloggiati Web), ensuring a near-census coverage of the official flows. For Extra-alberghiero sector the semantic accuracy is vulnerable to under-reporting. Since flow data relies on hosts declaring the number of guests, there is a financial incentive for less professional operators to under-report the actual number of nights spent. Consequently, the data represents the declared flows rather than the absolute real ones.
Coherence
We verified the mathematical consistency of the dataset, ensuring that aggregate figures strictly match the sum of their disaggregated components, and that the logical constraints between related variables were respected across all records.
The classification of accommodation types is consistent with the Capacity dataset (D2), allowing for the calculation of derived indicators without classification conflicts.
Completeness
Schema and record completeness: the dataset provides a complete matrix of attributes, including client residence and type of accommodation. There are no missing values for the Municipality of Bologna in the aggregate totals.
Population completeness: while the dataset covers 100% of flows in registered facilities, it suffers from the same population gap identified in the Capacity dataset (D2). It completely excludes the «informal tourism» sector. Therefore, these figures should be interpreted as Official Tourist Demand, which is significantly lower than the total City Users pressure on the territory.
Timeliness
The complexity of collecting and validating flow data (which changes daily) results in a «publication lag» of 12–18 months. Data is typically aggregated annually in the main extracts. If the project requires analysing seasonality, the user must utilize monthly disaggregation, which are often released with a further delay compared to annual totals.
Accuracy
Syntactic accuracy: the text was retrieved directly from Normattiva, the only database legally guaranteeing the reliability of the Italian regulatory corpus. This ensures zero transcription errors compared to secondary sources.
Semantic accuracy: the semantic accuracy in a legal context is defined by Vigenza. We verified that the texts selected correspond to the version currently in force (vigente) at the time of extraction (January 2026). We recorded the URI for each record, ensuring a permanent and unambiguous reference to the official legal act.
Coherence
The dataset follows a consistent, manually defined schema, with all records adhering to this structure without missing data. The dataset demonstrates high internal coherence, as the selected laws form a unified legislative framework specifically dedicated to the regulation of short-term rentals (affitti brevi). We verified that the selected acts are complementary, creating a logical continuum from general rules to specific administrative obligations, without contradictions in their application scope.
Completeness
Schema and record completeness: every record in the dataset contains all necessary metadata fields required for the qualitative analysis.
Population completeness: unlike statistical datasets where 100% population coverage is the goal, this dataset relies on targeted selection. The population is intentionally partial, defined as the subset of regulations impacting the touristification and housing scenario in Bologna.
Timeliness
The dataset is defined by its temporal validity relative to the project timeline. It represents a legislative snapshot aligned with the analysis window. While the dataset is static and will not be dynamically updated after the project’s conclusion, its currency is high for the scope of the research. It reflects the regulatory framework active at the moment of the analysis, serving as a baseline for interpreting the data.
Accuracy
Syntactic accuracy: the data is generated automatically by the portal’s algorithms without manual data entry or transcription phases. This ensures a syntactic error rate close to zero.
Semantic accuracy: the dataset exhibits a known semantic bias, since it represents the «Prezzo Richiesto» (the asking price in the listing) rather than the «Prezzo di Transazione» (the actual amount agreed). In a high-demand market like Bologna, the «Prezzo Richiesto» serves as a strong proxy for market trends, but it implies a systematic exclusion of the negotiation margin. As states in the methodology, the indices are based on active listings (offerta attiva) reflecting seller expectations rather than finalized economic exchanges.
Coherence
The time series is internally consistent, calculating averages based on a standardized definition of Euro per square meter (€/m²). We identified a discontinuity point in the time series of March 2019, when Idealista updated its calculation methodology to improve sample reliability. Consequently, comparisons crossing this date should be interpreted with caution.
Completeness
Population completeness: the dataset covers exclusively the supply listed on the Idealista marketplace. This introduces a single-source bias, as it excludes competing supply present only on other portals and the private leases concluded via informal channels.
We extracted data for both the Centro Storico and external zones to perform a comparative center-periphery analysis. This granularity is essential for measuring how price pressures in the tourist-heavy core propagate to outer zones.
Timeliness
The dataset offers high currency (monthly updates). While official statistics (Istat) often lag by 12-18 months, Idealista reflects market changes in real-time. This reactivity allows for the detection of immediate market shocks that structural data would miss.
Accuracy
Syntactic accuracy: data originates directly from the Anagrafe nazionale degli studenti e dei laureati (ANS), the administrative server used to track academic careers. University codes (Codici Ateneo) and degree course codes adhere to strict ministerial standards, ensuring zero syntactic errors.
Semantic accuracy: unlike survey-based estimations, these figures correspond to actual tuition fees paid and legally registered academic careers.
Coherence
The dataset respects strict additive rules. We verified that the total number of enrolled students equals the sum of disaggregated components. The use of standardized Ministerial Codes allows for a seamless mash-up with other institutional databases, facilitating multidimensional analysis without join errors.
Completeness
Schema and record completeness: the dataset provides a complete matrix of attributes with no missing values for the University of Bologna, with all critical segmentation variables fully populated.
Population completeness: the dataset covers 100% of the recognized university population in Italy. The dataset includes the Regione di residenza attribute. This is the vital component for our analysis, as it allows us to calculate the Student-Housing Mismatch. We can filter out local residents to isolate the specific subset of fuorisede students who actually generate demand on the rental market.
Timeliness
The update frequency is annual (Academic Year cycle). The dataset is current up to the 2024/2025 academic year (provisional), providing an up-to-date baseline for statistical analysis.
Legal Analysis
The legal analysis of the source datasets used is necessary to ensure the long-term sustainability of the data production and publication lifecycle and to be able to create a service balanced between the public interest of data and the rights of the data subjects. Hence, the focus of this assessment is identifying specific usage limitations, jurisdictional purposes, intellectual property rights and licensing terms.
The analysis of the datasets has been conducted on the basis of a checklist provided to Public Administrations for the Open Data release, organized in the following sections: privacy issues, Intellectual Property Rights (IPR), licenses, limitations on public access, economical conditions and temporary aspects.
However, since the current project does not operate as a formal Public Administration, the text of the original questions was adapted to reflect an observer’s perspective. Questions exclusively pertaining to internal PA administrative procedures were marked as 'not applicable'.
Regarding Dataset D4, compiled by the team for the sake of the project using the legislations included on the Normattiva Open Data portal, the analysis was conducted with a dual focus: the dataset itself was assessed as a proprietary derivative, while questions regarding the platform infrastructure were answered in reference to the Normattiva portal, since the current webpage is only a report of the analysis conducted on the basis of open data and cannot be considered an open data portal. The ‘not applicable’ answers in this regard are justified by the fact that the dataset is not published on an open data portal and therefore is not updated nor indexable like usual open datasets.
To check:
D1
D2
D3
D4
D5
D6
Is the dataset free of any personal data as defined in the Regulation (EU) 2016/679?
Is the dataset free of any indirect personal data that could be used for identifying the natural person?
Is the dataset free of any particular personal data (art. 9 GDPR)?
Is the dataset free of any information that combined with common data available in the web, could identify the person?
Is the dataset free of any information related to human rights (e.g. refugees, witness protection, etc.)?
Was the risk of de-anonymization of the dataset taken in consideration before publishing?
If geolocalization capabilities are being used, is it sure that the geolocalization process can’t identify single individuals in some circumstances?
Does the open data platform respect all the privacy regulations (registration of the end-user, profiling, cookies, analytics, etc.)?
Is it clear who, in the open data platform, are the Controller and Processor of the privacy data of the system?
Have you checked the privacy regulation of the country where the dataset is physically stored?
Does the dataset have non-personal data?
Are you sure they are not “mixed data”?
To check:
D1
D2
D3
D4
D5
D6
Have you created and generated the dataset?
Are you the owner of the dataset?
Are you sure not to use third party data without the proper
authorization and license?
Have you checked if there are any limitations in your
national legal system for releasing some kind of datasets with
open license?
To check:
D1
D2
D3
D4
D5
D6
Has the dataset been released with an open data license ?
Does it include the clause: "In any case the dataset can’t be used for re-identifying the person"?
Has the API (in case it exists) been released with an open source license?
Is the open data/API platform license regime compliant with the IPR policy of the dataset?
To check:
D1
D2
D3
D4
D5
D6
Do you check that the dataset concerns your institutional competences, scope and finality? Do you check if the dataset concerns other public administration competences?
Not applicable
Not applicable
Not applicable
Not applicable
Not applicable
Not applicable
Are you sure the dataset respects the limitations for the publication stated by your national legislation or by the EU directives?
Are you sure there aren’t some limitations connected to the international relations, public security or national defense?
Are you sure there aren’t some limitations concerning the public interest?
Are the international law limitations respected?
Are the INSPIRE law limitations for the spatial data respected?
Not applicable
Not applicable
Not applicable
Not applicable
Not applicable
To check:
D1
D2
D3
D4
D5
D6
Do you check that the dataset could be released for free?
Do you check if there are some agreements with some other partners in order to release the dataset with a reasonable price?
Not applicable
Not applicable
Not applicable
Not applicable
Not applicable
Not applicable
Do the open data platform terms of service include a clause of “non liability agreement” regarding the dataset and API provided?
In case you decide to release the dataset to a reasonable price do you check if the limitations imposed by the 2019/ 1024/EU directive are respected?
Not applicable
Not applicable
Not applicable
Not applicable
Not applicable
Not applicable
In case you decide to release the dataset to a reasonable
price did you check the e-Commerce directive and regulation?
Not applicable
Not applicable
Not applicable
Not applicable
Not applicable
Not applicable
To check:
D1
D2
D3
D4
D5
D6
Is there a temporary policy for updating the dataset?
Not applicable
Is there some mechanism for informing the end-user that the dataset is updated at a given time to avoid mis-usage and so potential risk of damage?
Not applicable
Did you check if the dataset for some reason can’t be indexed by the research engines (e.g. Google, Yahoo, etc.)?
Not applicable
In case of personal data, is there a reasonable technical mechanism for collecting requests of deletion (e.g. right to be forgotten)?
Publication license
After the preliminary legal compliance check, we assessed the individual licenses of the source datasets to ensure legal interoperability and determine the most suitable license for the resulting mashups. Using the Odalische portal to verify license compatibility, we selected the Creative Commons Attribution 4.0 (CC BY 4.0) license for our newly created datasets. This choice aligns with AgID recommendations, as it promotes an open ecosystem that encourages data reuse while mandating proper attribution. The following table illustrates the compatibility between the original source licenses and the CC BY 4.0 license chosen for this project.
ID
Mashup Dataset
Original Datasets
Original licenses
Final license
MD1
Istat VS Open Data Accommodations
D1, D2
CC-BY 4.0, CC-BY 4.0
CC-BY 4.0
MD2
Tourism Carrying Capacity
D2, D3
CC-BY 4.0, CC-BY 4.0
CC-BY 4.0
MD3
Legislative Impact
D1, D2, D4
CC-BY 4.0, CC-BY 4.0, CC-BY 4.0
CC-BY 4.0
MD4
Price VS Facilities Growth Rate
D1, D5
CC-BY 4.0, Proprietary*
CC-BY 4.0
MD5
Regional Enrollments vs. Rent Prices
D5, D6
Proprietary*, Public Domain
CC-BY 4.0
MD6
Regional Sensitivity Index
D5, D6
Proprietary*, Public Domain
CC-BY 4.0
*
Our study integrates a dataset sourced from the real estate portal Idealista with various open data sources. While Idealista is protected by intellectual property and 'sui generis' database rights, its use here aligns with the Text and Data Mining (TDM) exception for scientific and academic research (Art. 3, EU Directive 2019/790 and Art. 70-ter, Italian Law 633/1941).
The raw data was accessed through public channels and processed exclusively for non-commercial academic purposes to generate aggregated statistical derivatives, such as rental price averages. Because these outputs are transformative and do not reproduce a substantial part of the original dataset, they do not interfere with the provider's commercial interests. Consequently, the final mashup datasets derived from this one are released under a CC BY 4.0 license; this choice ensures the legal interoperability of the new research product, maintaining transparency through full attribution of all primary sources. Finally, all raw proprietary data will be permanently removed from the project repository after the project presentation.
Ethical Analysis
The ethical framework of this project is structured in accordance with the European Commission’s Guidelines on Ethics and Data Protection. To facilitate a rigorous assessment, we integrated the Principles of Data Ethics developed by DataEthics.eu alongside the Open Data Institute’s (ODI) Data Ethics Canvas.
Principles of Data Ethics
This project investigates the phenomenon of "touristification" (tourism-led gentrification) in Bologna and its impact on housing availability and university enrollment. At the core of this analysis lie the Right to the City and the Right to Education. The primary objective is to generate public value for residents and the student community rather than serving commercial or tourism-industry interests. By providing a data-driven interpretative framework, the study empowers citizens to visualize and understand the socio-economic forces reshaping their urban environment.
The study relies on the secondary use of administrative and statistical datasets, previously collected for other purposes. In accordance with GDPR principles, specifically those concerning processing for scientific and statistical purposes, we have implemented data minimisation protocols. Although direct involvement of data subjects was not possible due to the nature of the sources, we ensured that only the necessary information was processed. All findings are presented in an aggregated format, preventing any direct impact on the self-determination of the individuals represented in the data.
All data sources (Bologna Open Data, ISTAT, MUR-USTAT and Idealista) are publicly accessible and explicitly cited. All transformations, filtering methods and indicators (e.g., growth rates and correlations) are documented in a reproducible workflow. This documentation discloses the methodological limitations and potential biases inherent in both the original and integrated datasets.
Following the principle of Data Protection by Design, we selected datasets that were pre-anonymised at the source and limited the amount and granularity of data processed to what was strictly necessary for our research questions. For private-sector data (Idealista), we ensured that the extraction and usage remained within a clearly documented academic‑research purpose, avoiding any infringement of proprietary rights or individual privacy. Internal responsibility was maintained for every stage of data cleaning and aggregation to prevent the introduction of errors or misinterpretations. These choices are documented so that the research team can be held accountable for how data were selected, processed and interpreted.
Our research investigates "touristification" as a socio-economic phenomenon affecting the real estate market and the fundamental rights of citizens, with a specific focus on university students and vulnerable groups. In alignment with the principle of Equality, this analysis explicitly seeks to identify potential discrimination or stigmatization based on financial and social conditions. By crossing regional enrollment trends with rental price indices, the analysis seeks to identify patterns of "educational exclusion." This approach highlights how specific demographic or geographic groups may be disproportionately affected by the rising cost of living, thereby fulfilling the ethical mandate to protect vulnerable populations from systemic discrimination.
Analysis of Data Sources and Dataset-Specific Ethical Risks
To ensure a comprehensive ethical assessment, and given that the project integrates datasets from diverse origins, we analyzed each source's provenance and potential ethical risks.
Open Data - Comune di Bologna
The dataset regarding Tourist Accommodation Businesses (D1) was sourced from the Municipality of Bologna’s Open Data portal. This platform operates under European and national open data directives, ensuring that data are pre-anonymised and compliant with legislative principles.
The primary ethical risk identified was the presence of granular geospatial data (addresses and house numbers). While geospatial information does not fall under the category of personal data, it poses a risk of indirect de-anonymisation when cross-referenced with other datasets. Since many extra-hotel facilities are operated by private individuals at their place of residence, publishing such details could compromise their privacy. Furthermore, data concerning "ceased" activities could inadvertently disclose an individual’s economic failure, potentially impacting their human dignity. To mitigate these risks, we applied the principle of data minimisation, removing all specific geographic identifiers as they were not essential for the macro-level purposes of our analysis.
ISTAT
Tourism statistics (D2, D3) were derived from ISTAT, which operates under a rigorous ethical framework of independence and confidentiality aligned with the European Statistics Code of Practice. The team identified representation bias as the significant ethical concern. ISTAT datasets on tourist accommodations may suffer from "under-reporting," as they primarily capture registered facilities, potentially overlooking the informal or "grey" short-term rental market (the sommerso). This discrepancy is evident when comparing ISTAT data with the one collected by the Municipality of Bologna in Open Data, which provides more granular tracking of facility life cycles. To mitigate this bias of omission, we cross-referenced both sources to provide a more accurate and ethically responsible assessment of urban tourist pressure, ensuring transparency in our reporting.
Idealista
The inclusion of market data from Idealista (D5) represents the most significant ethical challenge of this project. Unlike institutional sources, Idealista is a commercial entity whose data is generated for profit-making purposes. Our secondary use is justified by the educational nature of this project. To address the lack of explicit consent from data subjects, we process the data exclusively in its most aggregated form to prevent any direct or indirect identification.
Furthermore, we commit to deleting the raw data from our public platforms following the academic assessment, retaining only the visualizations derived from its integration with other datasets. Throughout the project, the source has been transparently cited to avoid any misattribution.
Methodologically, we disclose a lack of transparency regarding the platform's proprietary algorithms; thus, the dataset is defined as a proxy for “asking prices” rather than finalized contracts. Ethically, it is used here only to represent the "market reality" faced by students and families.
Finally, we acknowledge a structural selection bias: the dataset reflects a "digital-only" marketplace. It does not account for the entire housing market, such as informal rental networks or informal rentals or historical long-term contracts that do not pass through digital platforms.
MUR-USTAT
Student enrollment data (D6) were collected from the MUR-USTAT Open Data platform. This dataset contains sensitive information concerning residence and citizenship of universities' students, therefore the ethical aspect of their handling has to be carefully considered. Our dataset contained information about the number of students by province of residence, province of study, university and disciplinary group. While the data arrived anonymised, we identified a risk of indirect re-identification through the combination of variables. For instance, crossing a specific province of residence with a niche "disciplinary group" could isolate individual student trajectories. In accordance with EU Ethics and Data Protection guidance, we mitigated this by clustering students at the regional level and deleting non-essential granular information, such as specific disciplinary codes, thereby upholding the principle of privacy by design.
Ethical risks and limitations
We acknowledge several limitations and biases that inform our ethical approach:
Correlation vs. Causation: We recognize that "correlation does not imply causation." The increase in rental prices is a significant factor, but not the sole driver of enrollment fluctuations. Other variables, such as regional demographic shifts, the improvement of academic offerings in students' home regions, or personal preferences, are not fully captured in this quantitative model.
Risk of Regional Stigmatization: We are aware that highlighting specific regions as "vulnerable" carries a risk of stigmatization. Our goal is to use this data to advocate for systemic policy changes rather than to label specific communities.
Legal Framework dataset (self-produced)
The final dataset (D4) consists of a self-produced record of national regulations governing short-term rentals in Italy. While this dataset does not contain personal data, its construction carries specific ethical responsibilities regarding objectivity and transparency.
The dataset was compiled by aggregating official legislative texts from Nomattiva, the official Italian government portal providing access to legislation. The primary ethical risk in self-producing a regulatory dataset is the potential for arbitrary selection. Deciding which laws are "relevant" can inadvertently bias the analysis toward a specific narrative. To minimize this risk and ensure methodological integrity, we performed a systematic search on Normattiva using specific keywords (“strutture ricettive”, “affitti brevi”, “locazioni brevi”). The results were manually reviewed and scraped to ensure a comprehensive overview of the regulatory evolution.
This dataset serves to investigate how legislative choices have attempted to regulate the short term rental market. It provides a necessary "human-centric" background to our analysis, shifting the focus from abstract statistics to the concrete impact of policy-making on the Right to Education and Housing.
Technical Analysis
This project is the result of several analyses produced on a curated catalog of datasets assembled for our study.
The catalog is composed of six original datasets, sourced from multiple providers, with some datasets coming from the same source and others from distinct institutions, and characterized by differences in access modality, metadata completeness, formats, and ease of retrieval. In some cases, the original data required format conversion or the construction of a coherent series, for example by annualizing values across multiple years or by reshaping tables to obtain comparable structures. One of the original datasets was created in-house by extracting and structuring information from legislative documents available on Normattiva.
Alongside these, we created six mashup datasets, obtained by combining and cross-referencing the information contained in the original sources into integrated tables; these derived datasets power all graphs, indicators, and data-driven inferences produced in the project.
Title
Here To Stay - OADE Datasets Catalog
Description
A curated collection of metadata about the datasets for the project Here To Stay.
All datasets have been assessed using the metadata model defined by Agenzia per l'Italia Digitale (AgID), which frames metadata quality as a four-level classification intended to describe how “attached” metadata are to the data they describe, and how informative they are. The model focuses on two dimensions: the strength of the data-metadata relationship (whether metadata are embedded, tightly linked, or merely external) and the level of detail provided (whether metadata describe the whole dataset only, or go deeper into its internal structure).
Most of the catalog falls under Level 2 (“weak”), because metadata are external to the data and describe the dataset as a whole rather than individual records or variables:
Open Data Bologna: metadata are primarily provided on the dataset’s descriptive web page, while the downloadable files act as distributions with limited embedded description.
Ustat MUR: metadata are similarly exposed through the dataset’s descriptive page, with dataset-level information kept separate from the data files.
ISTAT: metadata discovery depends on the SDMX access layer; it is necessary to retrieve the dataset identifier and then access the associated metadata through SDMX services rather than through a single descriptive landing page.
Our original dataset: metadata are captured in the project’s global Turtle metadata file, which serves as the external dataset-level description for the produced distributions.
A separate case is the original dataset constructed from Idealista data, which is closer to Level 1: the source is effectively restricted to being consulted only as rendered content on an HTML page and does not publish an appropriate, explicit metadata layer that can be linked to the dataset, so metadata must be reconstructed and documented ex-post by inference during collection and description.
DCAT-AP_IT framework
DCAT-AP_IT is the national metadata application profile adopted in Italy to describe catalogs, datasets, and their distributions in a uniform, machine-readable way, so that data published by different administrations can be discovered, compared, and harvested consistently across portals. It is the Italian profile of the European DCAT-AP, which is the common specification promoted at EU level to exchange public-sector dataset descriptions among data portals.
This work sits on top of the broader DCAT, the W3C Data Catalog Vocabulary: an RDF vocabulary designed to represent data catalogs on the Web by modeling core entities such as Catalog, Dataset, and Distribution, and by linking descriptions to access points, formats, licenses, themes, and responsible agents in a way that software can process automatically. Starting from this shared semantic backbone (currently standardized as DCAT v3), DCAT-AP defines a European interoperability layer by selecting and constraining how DCAT should be used for public-sector data portals, while DCAT-AP_IT further specializes that layer for the Italian context by prescribing the required classes, properties, controlled vocabularies, and modeling patterns expected by national catalogs and publication workflows.
Original datasets
All original datasets were re-metadated by us in a strictly DCAT-AP_IT-compliant form so they can be treated as compliant within a single corpus.
Concretely, we extracted (or, when necessary, inferred) the values for the mandatory elements (dcterms:identifier, dcterms:title, dcterms:description, dcterms:modified, dcat:theme, dcterms:rightsHolder, dcterms:accrualPeriodicity, and dcat:distribution) using the dcterms and dcat vocabularies, and we populated them consistently across the catalog.
We also implemented the required separation between dataset and distribution by modeling each distribution with its required fields (notably dcterms:format, dcterms:license, and dcat:accessURL), and we explicitly represented our own file conversions and segmentations when producing new distributions; where relevant, we captured temporal and spatial framing (e.g., dcterms:temporal and dcterms:spatial) to reflect annualized series or geographically scoped extractions.
Mashup datasets
All mashup datasets were metadated coherently under DCAT-AP_IT from the beginning, so that even the datasets created inside the project are described with the same up-to-date national standard used for public-sector catalogs in Italy, and can therefore be interpreted in the same way as the original sources in terms of Dataset/Distribution separation, access points, formats, licenses, themes and agents. In addition, we made provenance explicit by linking each mashup dataset to its originating dataset(s) using the W3C PROV ontology, a standard vocabulary specifically designed to represent and interchange provenance information, i.e. who/what generated a resource, from which inputs, through which transformations, so that readers can reconstruct the genealogy of each derived table.
We used in particular prov:wasDerivedFrom to state derivation relationships, and we set ourselves as dcterms:publisher for the newly produced outputs: this cross-referencing within metadata matters because it makes data-driven conclusions auditable, supports reuse without guesswork, and lets downstream users assess trust, scope, and limitations by immediately seeing which source(s) each mashup depends on.
Technical hardening
To reduce avoidable metadata drift, as an extra layer of technical care we added automated checks in our GitHub repository: a GitHub Action runs rapper, a command-line RDF parser and serializer from the Raptor RDF toolkit, on every push to ensure our Turtle files parse correctly and remain syntactically valid, treating RDF syntax errors like failing tests.
This continuous syntax check is complemented by a semantic compliance check: we also submitted our metadata Turtle file to the official DCAT-AP_IT validator published by AgID, which verifies structural congruence against the profile rules. At the time of publication of the project, the metadata file returned 0 errors, giving us an external confirmation that the file is not only well-formed RDF, but also aligned with the DCAT-AP_IT constraints expected by the national guidelines.
FAIR principles
The table below uses the “FAIR Principles overview” as a checklist to evaluate our outputs against a widely used reference framework for data stewardship. The FAIR Principles, promoted by the GO FAIR Initiative, define what it means for digital assets to be Findable, Accessible, Interoperable, and Reusable, and they stress machine-actionability: the goal is not only that humans can read a dataset description, but that software agents can discover it via identifiers and registries, understand enough metadata to interpret it, and follow explicit access and reuse conditions with minimal manual mediation.
By mapping each principle to concrete cataloging choices (identifiers, access URLs, formats, licenses, provenance links, and vocabulary usage), the checklist helps make our level of compliance explicit and auditable, and highlights where the catalog behaves like a robust, standards-aligned digital object rather than a one-off collection of files.
To check:
(Meta)data are assigned a globally unique and persistent identifier
Data are described with rich metadata (defined by R1 below)
Metadata clearly and explicitly include the identifier of the data they describe
(Meta)data are registered or indexed in a searchable resource
To check:
(Meta)data are retrievable by their identifier using a standardised communications protocol
The communication protocol is open, free, and universally implementable
The communication protocol allows for an authentication and authorisation procedure, where necessary
Metadata are accessible, even when the data are no longer available
To check:
(Meta)data use a formal, accessible, shared, and broadly applicable language for knowledge representation.
(Meta)data use vocabularies that follow FAIR principles
(Meta)data include qualified references to other (meta)data
To check:
(Meta)data are richly described with a plurality of accurate and relevant attributes
(Meta)data are released with a clear and accessible data usage license
(Meta)data are associated with detailed provenance
(Meta)data meet domain-relevant community standards
Workflow
The original datasets were processed using KNIME platform. The process included data cleansing, handling missing values, and eliminating unnecessary fields for the purposes of the analysis, as well as combining multiple data sources to create the final mashup datasets.
The image below allows navigation through the workflow, while the full KNIME workflow can be downloaded here.
The first step of the data study consisted in identifying the number of accommodation facilities in the city of Bologna throughout the years, to visualize its potential growth to then compare with the rental prices.
The team started the Knime workflow by analysing D1, published by Open Data Bologna: the dataset was a list of all tourist accommodations opened (and potentially closed) between 1982 and 2025, with the address and type of facility specified. The data minimization in this case consisted in the removal of all geospatial information (address and coordinates) and the dataset was subsequently filtered to keep only the facilities that were open in the time span we considered. By counting the facilities annually, we observed that their number in 2024 was over five times higher than in 2014.
We then proceeded to compare these results with D2, coming from Istat: this dataset has the same focus as D1, as it counts the number of different types of accommodations for each city in Italy. It is organized in a folder of 11 files for the 11 years included in our time span, since on the Istat websites the files are divided per year and all held in a zip available for download. Considering that the Istat dataset is comprehensive of information about all italian cities, we filtered it to only focus on Bologna. As the dataset already had the total number of facilities per city per year, it was easy to visualize the rise in the number among the different years. Nevertheless, the values regarding the accommodations from the Istat dataset were far below the ones from Open Data Bologna. This discrepancy likely arises because the Bolognese Open Data portal logs all locally registered entities, whereas Istat tends to exclude facilities that are active but report zero occupancy. Moreover, the Istat dataset focuses on hotel establishments, hence it may not be considering short-term rentals, such as those offered by AirBnB, Booking or such platforms.
While the absolute numbers differed substantially, the upward trend remained still evident across both data sources. Therefore we decided to verify the correlation among the two datasets and understand if the growth rate was similar for the two among the years. This resulted in our first mashup dataset MD1 and we visualized the comparison between the variations in percentages of growth of the two datasets. The figures did not line up perfectly but they were qualitatively aligned, showing a shared trajectory despite the numerical gap; this ensured that both sources were equally authoritative even if relying on diverse inclusion criteria for their counts.
The second focus in our workflow aimed to assess whether the expansion of the accommodation sector was commensurate with the surge in tourist arrivals in Bologna. For this matter, we analyzed D3, another dataset coming from Istat and recording tourist flows - specifically arrivals and overnight stays - within accommodation establishments for every Italian municipality. In this instance as well, we extracted only the data pertaining to the municipality of Bologna and processed the dataset as to be left only with the total arrivals per year in the city. We chose to only take into consideration the arrivals to quantify the actual number of visitors, rather than the total nights spent. From the results we learnt that the tourist flow grew steadily until 2020, when a sharp decline was registered due to the pandemic; subsequently, figures rebounded significantly, reaching new peaks by 2024.
At this point in the analysis we compared the growth in the number of facilities with the increase in tourist arrivals, questioning whether the proliferation of these units was justified by a proportional influx of tourists, or if other market dynamics were at play. For this matter we selected D2 and D3 as source datasets, because both coming from Istat and drawing information from the same group of registered facilities. We calculated annual tourism carrying capacity in Bologna (densità di carico: total arrivals / total facilities) and visualized its decrease from 2014 to 2024, meaning that the supply of accommodations significantly exceeds the actual demand.
The second mashup dataset (MD2) we created includes exactly the total arrivals and total facilities per year according to Istat and the calculated ratio between the two.
A legal assessment of the regulatory framework associated with tourist short-term rentals was also conducted as part of this study. Due to the lack of a dataset tailored to the team’s need, we compiled one (D4) selecting the most relevant legislations for the matter, introduced quite recently to regulate the short-term rental market, characterized by significant regulatory gaps. For our analysis, we selected national laws from the Normattiva portal, identifying relevant legislation through a set of keywords closely aligned with the subject matter of the research. The laws made part of the dataset are the following:
D. L. 50/2017: Introduced the tax regime for short-term rentals, defining the 21% flat-rate tax (cedolare secca).
D. L 34/2019: Established the first national database for accommodation facilities and short-term rental properties, laying the groundwork for a standardized Identification Code system.
D. L. 145/2023: Introduced the National Identification Code (CIN), mandating security standards and establishing a rigorous penalty system to combat unauthorized hospitality practices.
D.L. 213/2023: Amended the taxation of short-term rentals by increasing the flat-rate tax from 21% to 26% for owners managing multiple rental units (from the second unit onwards).
The Tourism Code (D.Lgs. 79/2011) already mentioned was also included for compliance.
The legislation was introduced in the data analysis to potentially visualize long-term effects of the laws on the number of new facilities arising in the city of Bologna. The mashup dataset resulting from this step (MD3) combines the same yearly totals for the facilities sourced from D1 and D2 with the titles and descriptions of the laws. The timeframe object of the analysis was divided into phases, each beginning with the entry into force of a specific law, in order to assess whether the introduction of the legislation produced any observable effects. Yet, the visualization shows that in spite of the laws being introduced, the number of facilities rises throughout the years exponentially, as already stated for the first visualization. This may indicate a regulatory lag or a market where capital flow is resilient to legislative constraints, signaling that the measures implemented have not yet reached a saturation point capable of limiting expansion.
The research then turned to the economic impact, specifically addressing housing affordability and the upward pressure on rental rates. D5 was sourced from Idealista and was organized in a folder comprising monthly rent prices from 2012 to 2025 for five zones of Bologna, specifically the city centre and four outer zones around it. The dataset was filtered for our years and months of interest and then the prices of the four outer zones were averaged out, to be able to oppose a single price to that of the historic centre. Subsequently, the average annual rental price for the two macro-areas was calculated from the monthly values and the analysis revealed that prices increased exponentially in both areas throughout the years.
The rental prices were later correlated with the number of structures extracted from D1, the dataset from Open Data Bologna, as it also included a distinction between the facilities located in the historic centre and those in the outer zones. First, a simple linear correlation was performed, yielding a coefficient close to 1. To ensure the robustness of this correlation, we compared the growth rates of both prices and facilities across the two zones. The resulting visualization shows that the two metrics follow nearly identical fluctuations, potentially confirming that the rise in rental prices is linked to the increase in short-term rental structures. The results seem to suggest that properties previously available for long-term residential use may have been converted into tourist accommodations, causing a supply-side contraction, driving rental prices upward as demand remains constant. The MD4 mashup dataset supporting this visualization includes the annual rental prices, the number of accommodation facilities and their corresponding growth rates for both the historic center and the outer zones of Bologna.
Lastly, the study examined the effects of touristification on students, focusing on housing affordability. The objective was to determine whether this demographic segment is particularly vulnerable to or affected by the rising costs associated with the growth of short-term rentals. D6 was analysed for this matter: the dataset comes from MUR and describes student enrollment by province of residence and degree program for all Italian universities, from A.Y. 2010/2011 to A.Y. 2024/2025.
First, the team filtered the dataset to be left only with information regarding the University of Bologna. Then, students from different programs enrolled in the same academic year were summed together and the same was done from students coming from the same region; this was done to avoid singling out individuals in cases where very few students from a specific province were enrolled in a given course. The next step consisted in filtering out the years not included in the range of interest.
The first focus of the analysis aimed at verifying whether the total number of enrolled students had lowered in time, possibly due to rental price increase. Yet, the numbers actually increase between 2014 and 2024.
Hence, we reasoned on whether the regional composition of the student population has evolved over the years, focusing on changes in the geographic origin of enrolled students. Since the focus were Italian regions, we removed the information about foreign students from the dataset. We then proceeded to calculate for each year the percentage of students from each region out of the annual total to observe the geographic evolution of the student body over time. For doing so, we obtained the ten-year percentage change (delta) between 2014 and 2024 for each region and were able to witness a decline in the proportion of students from several Southern Italian regions, indicating that these areas represent a smaller share of the student body compared to a decade ago. This is displayed geographically in the visualization of the Italian regions throughout the years of the considered time span.
University Enrollment by Year
The aim at this point was to put in correlationthe rise of rental prices with the geographic evolution of the student population. MD5 was produced using D5 and D6 and includes the number of students enrolled each year from the single Italian regions aligned with the average rental price in Bologna for that year (the distinction between prices of the historic center and the outer zones is irrelevant for this matter). This mashup dataset was used to perform the correlation between the number of students and the prices for each region.
The result was a visualization of the Regional Sensitivity Index, representing how much each region is sensitive to the price rise. The southern regions, which registered a negative delta, also show a negative correlation coefficient. This confirms their high sensitivity to price increases and suggests that students from further regions are increasingly deterred from moving to Bologna. This visualization is documented by MD6, the last mashup dataset, which juxtaposes the sensitivity index with the ten-year percentage delta of students for all Italian regions. Emilia-Romagna was excluded from this specific analysis, as the majority of students from this region are commuters and do not typically participate in the local rental market. This is confirmed by the University of Bologna itself in an article from 2025, where it’s stated that for the A.Y. 2024/2025 a decline in new enrollments from Southern Italy was recorded, a trend consistent with the rising cost of living and renting affecting major university cities in the North of Italy; according to the article, this would be confirmed by a simultaneous increase in enrolments in Southern universities and also conversely corresponds to the rising of enrollments from Bologna and Emilia-Romagna - which also suggests that students are increasingly opting for local institutions to avoid the prohibitive costs of relocation.
Therefore, our data analysis acts as a confirmation of what has been reported in the article.
RDF Serialization
Below are the metadata produced for describing the datasets used and created within the project, following the DCAT-AP_IT standard and serialized in RDF/Turtle format. Please refer to the Technical Analysis section of the website for more details on the reasoning behind the chosen RDF metadata assertions and structure.
Sustainability
The sustainability of the data used in this project is evaluated based on source reliability.
The datasets (D1, D2, D3, D6) sourced from the Municipality of Bologna’s Open Data portal, ISTAT and MUR-USTAT are considered high-sustainability sources, since they are maintained under legal mandates for public statistics and administrative transparency. While these datasets do not currently use persistent identifiers (PIDs), their structured naming conventions and institutional metadata ensure they remain retrievable through official archives even in the event of URL reconfiguration.
The Idealista dataset (D5) represents a high-volatility source, since commercial newsroom URLs are highly likely to become obsolete or change. In compliance with intellectual property constraints, raw commercial datasets will be destroyed once the primary research objectives are achieved, as we do not hold the rights for long-term redistribution.
To ensure the long-term accessibility of our unique contributions, the legislation dataset (D4) and the integrated "mashup" datasets produced by the team have been assigned permanent identifiers through GitHub.
Here To Stay is the final project developed for the Open Access And Digital Ethics course (a.y. 2025/2026) within the Digital Humanities and Digital Knowledge Master's Degree (University of Bologna). As such, the project is not subject to active maintenance or future updates beyond the scope of its initial submission.
Conclusions
Touristification has been at the center of mainstream debate for more than a decade, affecting not only Italy but several other European countries. The growth of short-term rental platforms has intensified pressure on housing markets, raising living costs and straining urban coexistence. The slogan “tourists go home”, which emerged from citizen protests in Barcelona, has become a symbol of the urban malaise that runs among the ones who inhabit the city. To date, unlike cities such as Barcelona, which announced a ban on short-term rentals, Italy has relied on local regulations.
According to a recent analysis by Demoskopika on overtourism, Bologna currently faces a “moderate risk”, yet the consequences are already strikingly visible. In this context, the Municipality of Bologna attempted in December 2025 to introduce stricter rules for tourist rentals in the historic center, but these were overturned by the Council of State due to their impact on established economic interests. While datasets, as stated throughout our analysis, only partially reflect the human cost of the crisis, student protests and the growing difficulty in finding housing highlight a deeper issue: the housing question has become a factor of social exclusion and a violation of the residents’ Right to the City.
Although Airbnb is not the sole driver of this crisis, Bologna’s housing emergency is an undeniable reality, confirmed by the city's own strategic responses. In 2023, the Municipality launched the Piano per l’Abitare (“Plan for Housing”), a €200 million investment to create 3,000 housing units for marginalized groups, students and those seeking below-market rents.
We’re all aware that without structural intervention and a rethink of urban space as a public good, cities like Bologna face the loss of their most vital asset: the social and cultural diversity brought by the student population. By investigating this phenomenon through data, our analysis aimed to sharpen the focus on this crisis. We firmly believe that, beyond public debate, rigorous data analysis and visualization can stimulate new perspectives and ensure this issue remains at the forefront of the political agenda.