Dwelling Inference¶
The process of inference aims at enriching GIS records by matching them with records from non-localized data sources containing more complete information. In the case of dwelling inference, we use the ‘construction_year_class’, ‘residential_type’ and ‘district’ derived from GIS data to find the corresponding records in the census data from which we extract the following attributes :
‘occupancy_type’ : (possible values : ‘primary residence’, ‘second home’, ‘vacant’, ‘occasional housing’)
‘occupant_count’ : the number of dwelling occupants, from 0 to 6 (set to 2 for second homes)
‘heating_system’ : the main heating system of the dwelling (possible values : ‘electric_heater’, ‘electric_heat_pump’, ‘oil_boiler’, ‘gas_boiler’, ‘wood_boiler’, ‘district_network’)
‘living_area_class’ : (possible values : ‘Less than 30 m²’, ‘From 30 to 40 m²’, ‘From 40 to 60 m²’, ‘From 60 to 80 m²’, ‘From 80 to 100 m²’, ‘From 100 to 120 m²’, ‘More than 120 m²’)
Algorithm¶
All the dwellings are grouped according to ‘construction_year_class’, ‘residential_type’ and ‘district’
The records with the same set of attributes in the census database are selected
If there are no records, the operation is repeated by replacing ‘district’ with a higher geographic level (‘city’, ‘city_group’, ‘department’, ‘region’) until records are found
If the total weight of the selected census records is lower than the number of dwellings in the group, the weights are scaled to surpass it
A list of census records with unit weight is generated by repeating the records N times, where N is the integer closest to the record’s weight
The matching census records are selected by random sampling without replacement
The inferred data is transferred from the census records to the dwellings
The geographic level of inference is recorded as ‘inference_geo_level’
Note
When inferring data for dwelling with ‘apartment’ ‘residential_type’, a consistent ‘heating_system’ must be enforced for a given building, which can contain several dwellings. To achieve this, we add ‘building_id’ to the attributes used to group dwellings and select a single census record from which the ‘heating_system’ is taken and applied to the group of dwellings. The algorithm above can then be applied with ‘heating_system’ as an additional matching attribute.
Living area estimation (experimental)¶
Estimating the living area in residential buildings is necessary to determine their energy consumption and energy efficiency. However, there does not exist a simple relationship between the living area and the floor area of a building. Below is a scatter plot of living area and floor area, obtained by matching an energy diagnosis record (containing the living area) to a BDTOPO building address using the ADRESSE PREMIUM database from IGN.

Living area as a function of floor area for buildings with address-level matching energy diagnosis record in the Rhône department (a filter on minimal and maximal values of areas has been applied)¶
The approach taken to circumvent this issue relies on using the subset of buildings for which the living area can be obtained from a diagnosis record to create probabilistic models of the relationship between living area and floor area for various building configurations.
Individual houses¶
For individual houses, we only need to find the living area of a single dwelling. To achieve this, we estimate the ratio between living area and floor area (living area share). We start by grouping the buildings in three categories :
the building has no other use and no annex
the building has no other use and an annex
the building has multiple uses
For each category, we define intervals of floor areas for which the histogram of living area share is displayed below.

Histograms of living area share for the various floor area intervals for the category ‘no other use, no annex’¶

Histograms of living area share for the various floor area intervals for the category ‘no other use, with annex’¶

Histograms of living area share for the various floor area intervals for the category ‘multiple use’¶
The marked differences in the living area share distribution for the various categories and intervals of floor area confirm the relevancy of these groupings. However, a significant amount of anomalous data is present such as buildings with no other use and no annex the living and a living area share below 30%, or buildings with a living area share close to or above 100%. To accommodate these limitations, we propose the following procedure :
fit a metalog distribution to each grouping of living area share
group the dwellings according to their ‘residential_only’, ‘has_annex’ and ‘floor_area’ attributes
draw the living area share in the corresponding metalog distribution
clip the living area share to minimal and maximal values depending on the ‘residential_only’ and ‘has_annex’ attributes
calculate the living area
residential_only |
has_annex |
min living_area_share |
max living_area_share |
---|---|---|---|
True |
False |
70% |
85% |
True |
True |
20% |
70% |
False |
False |
5% |
50% |
Warning
Coherence between the living area estimated and inferred census data could be obtained by drawing the census records in a sample with the corresponding living area class. However, in cases where the living area distribution does not match the living area classes at the district level, this creates a selection bias resulting in shares of heating systems that are incoherent with district level census data. As a consequence, we currently ignore the living area class when selecting the initial census record and correct the occupant count afterwards by drawing a new record with the living area class and heating system as additional matching attributes.
Apartments¶
For collective housing building, the objective is to estimate the living area of multiple dwellings, while managing anomalous situations for which the number of dwellings itself might need to be adjusted. The first part of the procedure consists in independently estimating the living area of each dwelling using the inferred living area class and metalog distributions of living area for each class obtained from a random sample of diagnosis data.

Histogram of living area by class for a random sample of energy diagnosis of apartments¶
The second part of the procedure is a corrective algorithm dealing with cases when the obtained building level living area share is outside bounds depending on the ‘residential_only’ and ‘has_annex’ attributes of the building. For such cases, the following steps are performed :
a living area share target is drawn inside the bounds
the areas of the smallest and biggest dwelling are multiplied by the ratio between the living area target and current value
if the values obtained lie in a range of reasonable living areas (10 to 250 m²), the scaling is performed for all living areas
if not, randomly selected dwellings are added or removed until the living area lies inside the bounds
residential_only |
has_annex |
min living_area_share |
max living_area_share |
---|---|---|---|
True |
False |
70% |
90% |
True |
True |
40% |
70% |
False |
False |
5% |
70% |
Building level records¶
Once the inference and living area calculations have been performed, the following attributes are added to the buildings :
‘living_area’
‘living_area_share’
‘heating_system’
Application¶
We use the “Peuple-Boivin-St-Jacques” district of Saint-Etienne to illustrate the results obtained by the dwelling inference algorithm.

Building construction year classes of district “Peuple-Boivin-St-Jacques, Saint-Etienne”.¶

Building heating system of district “Peuple-Boivin-St-Jacques, Saint-Etienne”.¶

Living area vs floor area for buildings in district “Peuple-Boivin-St-Jacques, Saint-Etienne”.¶