In Part 1, we learned how to analyze the housing rentals geospatial data to find the nearby rentals as well as the cheapest rentals in the proximity. In this article, we will explore how to find the green space area around a selected property and build a machine learning model that makes predictions based on geospatial features. Although these are slightly complex use cases, we will notice how easily they are accomplished with Netezza.
Use case 3: Find the green space around the selected property
For recreational activities, users might be interested in knowing the green space details closer to the property. For this, we make use of national parks and state park campgrounds data to arrive at this information.
To calculate the national park area, we use ST_Buffer to create a buffer region of 100 miles around the selected record and run ST_Intersection to find out the intersection between the national parks multi-polygon data and the buffer region. Next, we run ST_Area on the intersections found to calculate the total national park area around the property.
To calculate the number of state park campgrounds within a buffer region, we create a buffer region of 100 miles around the selected property using ST_Buffer and run ST_Intersects to find out how many campgrounds lie within the buffer region. We leverage the following Netezza geospatial operations:
ST_Intersection - Determines a geometry object representing the points shared by the specified geometries
ST_Area - Determines the area of the specified geometry object having a surface
ST_Intersects – Checks whether the two geometries intersect
Load data into the national_parks and state_parks tables
Since we have green space information in two different datasets, we will load the data into two different tables. One is for national parks, on which the national park area is going to be calculated, and the other is for state park campgrounds, which will be used to find out the total number of campgrounds within a specified buffer.
Creating the tables:
state_parks table
HOUSING.ADMIN(ADMIN)=> CREATETABLE state_parks(park_point st_geometry(200));
national parks table
HOUSING.ADMIN(ADMIN)=> createtable national_parks(the_geom
st_geometry(64000));
Show more
Inserting data into the national_parks table:
We will use geopandas to load the data in client environment, then use insert statements to load into the Netezza database. The national parks dataset can be found at National Earth.
Since the state parks data is present in different files, we need to read each file and insert the point records into our state_parks table. The CSV files can be found on U.S. and Canada campgrounds.
parks_regions_list = [west,canadacamp,southcamp,midwest,northeast,southwest]
for region in parks_regions_list:
for i in region.iterrows():
idadb.ida_query(f'''INSERT INTO state_parks(park_point) VALUES (
inza..ST_WKTToSQL('{i[1]['geometry']}'))''')
Show more
Calculating the state park count for every record:
We calculate the state park count within 100 miles of every record and save the results under a state_parks_count column in the cleaned_properties table.
Adding column state_parks_count to the cleaned_properties table:
idadb.ida_query('''ALTER TABLE cleaned_properties add STATE_PARKS_COUNT INTEGER NOT NULL DEFAULT 0;''')
Show more
In this subquery, we first calculate the buffer region within 100 miles of the property, then run an ST_Intersects operation to find out the campgrounds that lie within the buffer region. Further, we calculate the campgrounds for each property that returned true for the ST_Intersects operation and update them in the state_parks_count column for each property:
#Updating the state park count for each record
idadb.ida_query(f'''update cleaned_properties set cleaned_properties.state_parks_count = parks_count.count from
(select temp.id,count(temp.c) as count from (select cleaned_properties.id,inza..ST_Intersects(
inza..ST_WKTToSQL(inza..ST_AsText(
inza..ST_Buffer(cleaned_properties.point
,100,8,'mile')))
, national_parks.the_geom) as c from cleaned_properties,national_parks where c='True') as temp group by temp.id) as
parks_count where parks_count.id=cleaned_properties.id;''')
)
Show more
Calculating the national parks area for every record:
To save the national park area for each record, we will use the green_area table and insert the IDs of all the records present in the cleaned_properties table into the green_area table initially:
idadb.ida_query(''' create table green_area(id bigint,national_parks_area double not null default 0.0);''')
idadb.ida_query('''insert into green_area(select id from cleaned_properties);''')
Show more
We first calculate the buffer region of 100 miles around all the house records, then run the ST_Intersection() operation to find out whether the two geometries (buffer data of each record and the national parks geometry) intersect. Next, we calculate the area for all the records for which the ST_Intersection() operation did not return “POINT EMPTY”, then add all the areas, grouping them by record ID. Further, we update the national_parks_area column in the green_area table to save the sum of all the intersecting areas according to the record ID:
#Get the national_parks_area for each record
idadb.ida_query(f'''update green_area set green_area.national_parks_area = d.sum from(
select f.id,SUM(f.area) as sum from (select temp.id,inza..ST_Area(inza..ST_WKTToSQL(temp.c),'mile') as area
from (select cleaned_properties.id,inza..ST_AsText(inza..ST_Intersection(
inza..ST_WKTToSQL(inza..ST_AsText(
inza..ST_Buffer(cleaned_properties.point ,100,8,'mile')))
, national_parks.the_geom)) as c from cleaned_properties,national_parks ) as temp
where temp.c!='POINT EMPTY') as f group by f.id) as d where d.id = green_area.id;''')
Show more
Queries.py:
def get_green_area(record_id):
try:
record_selected_by_user = idadb.ida_query(f'''SELECT inza..ST_AsText(POINT) as POINT,REGION from cleaned_properties where id = {record_id} ''')
#Getting the properties in the same region ordered by national_parks_area and park_count
ranked_properties = idadb.ida_query(f'''select green_area.national_parks_area,temp.state_parks_count,temp.ID,
temp.PRICE,temp.REGION,temp.BEDS,temp.BATHS,temp.SQFEET,temp.CATS_ALLOWED,temp.DOGS_ALLOWED, temp.SMOKING_ALLOWED,temp.WHEELCHAIR_ACCESS,
temp.ELECTRIC_VEHICLE_CHARGE,temp.COMES_FURNISHED,temp.LAT as"lat" , temp.LON as"lon"from (SELECT * from cleaned_properties where
cleaned_properties.region ='{record_selected_by_user['REGION'][0]}') as temp innerjoin green_area
on green_area.id = temp.id
orderby state_parks_count desc,national_parks_area desc;''')
green_information = [200,ranked_properties]
return green_information
except:
return [404,'Unknown error happened! Green information could not be found, please try again..']
Show more
Output visual:
Select Ranking properties according to their national park area and park count to display both the park_count and the national_parks_area within a buffer region of 100 miles from the property, output in a table format, and ranked in decreasing order.
Use case 4: Predict house prices using geospatial features
In this use case, we will add geospatial context to our input data to build a more robust rental price prediction machine learning model. For many users, proximity to good schools is key factor in picking a home. To capture this insight, we add a new geospatial feature that measures the distances between the selected property and top-rated schools, then we build a machine learning model that makes use of this feature, along with other features to predict rental prices more accurately.
As with the properties dataset, we have lat and lon columns in public_schools that provide the geographic coordinates. So, we first add a geometry column to the table and create point data for each school record using those two columns:
idadb.ida_query('''ALTER TABLE Public_Schools ADD point st_geometry(200) ''')
idadb.ida_query(f'''UPDATE Public_Schools set point=inza..ST_POINT(Public_Schools.LON,
Public_Schools.LAT) WHERE OBJECTID = Public_Schools.OBJECTID;''')
We join the properties table with public_schools table, group by property ID, and calculate the min distance to schools for each property. We further join this temp table with the properties table to update the distance_to_schools column for each property:
idadb.ida_query('''update cleaned_properties set cleaned_properties.distance_to_schools = temp.distance_to_schools
from (select cleaned_properties.id,MIN(inza..ST_DISTANCE(
inza..ST_Transform(cleaned_properties.point, 4326),
inza..ST_Transform(Public_Schools.point,4326),
'mile')) as distance_to_schools from Public_Schools,cleaned_properties
where cleaned_properties.REGION = LOWER(Public_Schools.CITY)
GROUPBY cleaned_properties.id) as temp
where temp.id = cleaned_properties.id; ''')
Show more
Build machine learning model and generate predictions:
Connect to cleaned_properties table :
idadf = IdaDataFrame(idadb,"CLEANED_PROPERTIES")
Show more
In-database function:
The preprocessed data, which is stored in the cleaned_properties table, is passed on to the in-database function as the input dataframe. The data is then divided into training and testing data, and the model is built using a gradient-boosting regression algorithm for the training data. Finally, the model is scored with the test data:
code_house_prediction_model = """def house_prediction_model(self,df):
import numpy as np
import pandas as pd
import sklearn
import pickle
import subprocess
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import r2_score,mean_squared_error
data = df.copy()
#Replacing the null values in numerical columns
data['DISTANCE_TO_SCHOOLS'] = data['DISTANCE_TO_SCHOOLS'].replace(to_replace = 0 , value = 1000000)
columns_with_outliers = ['DISTANCE_TO_SCHOOLS']
for i in columns_with_outliers:
column = data[i]
first = column.quantile(0.05)
third = column.quantile(0.90)
iqr = third - first
upper_limit = third + 1.5 * iqr
lower_limit = first - 1.5 * iqr
column.mask(column > upper_limit, upper_limit, inplace=True)
column.mask(column < lower_limit, lower_limit, inplace=True)
#Encoding the categorical columns
encoder={}
categorical_columns_list = []
for i in data.columns:
if data[i].dtype == 'object':
categorical_columns_list.append(i)
encoder[i] = LabelEncoder()
encoder[i].fit(list(data[i].values))
data[i] = encoder[i].transform(list(data[i].values))
#Splitting into training and testing dataset
y = data['PRICE']
X = data.drop(columns=['PRICE'],axis=1)
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.20 , train_size = 0.80, random_state=100)
#Gradient Boosting Regression
from sklearn.ensemble import GradientBoostingRegressor
gbr = GradientBoostingRegressor()
gbr.fit(X_train, y_train)
y_pred_gbr= gbr.predict(X_test)
print(y_pred_gbr)
r2_score = gbr.score(X_test,y_test)
print(r2_score)
from sklearn.model_selection import cross_val_score
cv_scores = np.sqrt(-cross_val_score(gbr, X, y,scoring='neg_mean_squared_error',cv=2))
final_score = (np.mean(cv_scores/y.max()))
predictions = X_test.copy()
for column in categorical_columns_list:
predictions[column] = encoder[column].inverse_transform(list(predictions[column].values))
predictions['PRICE_PREDICTED'] = y_pred_gbr
predictions['ACCURACY'] = r2_score
def print_output(x):
row = [x['ID'],x['REGION'],
x['TYPE'],
x['SQFEET'],x['BEDS'],
x['BATHS'],x['CATS_ALLOWED'],x['DOGS_ALLOWED'],x['SMOKING_ALLOWED'],
x['WHEELCHAIR_ACCESS'],x['ELECTRIC_VEHICLE_CHARGE'],
x['COMES_FURNISHED'],x['LAUNDRY_OPTIONS'],x['PARKING_OPTIONS'],x['DISTANCE_TO_SCHOOLS'],x['PRICE_PREDICTED'],x['ACCURACY']]
self.output(row)
predictions.apply(print_output, axis=1)
Show more
Using NZFunGroupedApply, we push the custom function inside the database:
The output_signature parameter reflects the data types of the result, and the output_table parameter stores the result of the function execution. The results are as follows:
ID REGION TYPE SQFEET BEDS BATHS ... COMES_FURNISHED LAUNDRY_OPTIONS PARKING_OPTIONS DISTANCE_TO_SCHOOLS PRICE_PREDICTED ACCURACY
07049959793 SF bay area house 2324.04.02.5 ... 0 w/d in unit off-street parking 1000000.0000003414.9226390.48470617046310307 SF bay area apartment 1220.01.01.0 ... 0 laundry on site no parking 1000000.0000003201.6763790.48470627045320682 SF bay area condo 812.02.01.0 ... 0 laundry in bldg off-street parking 1000000.0000002547.9315810.48470637037718327 SF bay area apartment 1200.02.01.0 ... 0 laundry in bldg attached garage 1000000.0000003532.8754890.48470647049933172 SF bay area apartment 981.02.02.0 ... 0 w/d in unit carport 1000000.0000002977.9678900.484706531437024597472 yuma apartment 700.02.01.0 ... 0 laundry on site carport 0.519920725.0537690.516445531447046821594 yuma apartment 475.01.01.0 ... 0 laundry on site carport 0.519920669.1010390.516445531457041564055 yuma apartment 845.02.01.0 ... 0 laundry on site carport 0.418105762.9126330.516445531467036317850 yuma apartment 722.02.02.0 ... 1 w/d in unit off-street parking 0.519920715.7399560.516445531477042498268 yuma apartment 475.01.01.0 ... 0 laundry on site carport 0.519920667.4223860.516445
Show more
Results: Since we chose index=region, the data was split into region-based partitions, and the function was executed against each partition in parallel. As a result, different models for different regions were created, and the predictions from all the models were then merged and output in the predictions table.
Output visual:
After training the model successfully, the results are output in the predictions table. We load all the records from the predictions table into a Streamlit widget.
Summary
In this series of articles, we have explored how to use Netezza’s geospatial capabilities (geospatial operations of nzspatial_esri cartridge) for analyzing geospatial data. Netezza geospatial operations are easy to use and can handle enormous datasets at fast speeds, delivering a powerful in-database analytics experience. We also showcased several use cases using the housing dataset to highlight how users could leverage these core geospatial operations and solve for the business needs that may arise in geospatial analytics domain.
About cookies on this siteOur websites require some cookies to function properly (required). In addition, other cookies may be used with your consent to analyze site usage, improve the user experience and for advertising.For more information, please review your cookie preferences options. By visiting our website, you agree to our processing of information as described in IBM’sprivacy statement. To provide a smooth navigation, your cookie preferences will be shared across the IBM web domains listed here.