In this article, you will learn how to use Netezza to analyze and build machine learning models for geospatial data. Geospatial data is essentially the data associated with a particular location (the records in this type of information set will have coordinates, address, city, and postal or ZIP code included with them). The Netezza geospatial operations are completely in-database, meaning that the data is never moved out of the database. Additionally, these operations (backed by SQL implementations) leverage the massively parallel processing (MPP) architecture of Netezza to deliver results at lightning speeds. This unique advantage of Netezza makes it a standout in the field of geospatial analytics.
We will use U.S. housing rental records (listings in various U.S. locations) as the input data and build several use cases around it to highlight some of the most important operations. We will also use the Streamlit library to build a web application for our use cases.
Installation
For prerequisite installation, refer to Installation.
Accessing the properties dataset Access, one could notice several non-geospatial features like region, type, beds, baths, and square footage for each property.
The user could apply filters based on those features to find the properties of choice.
With streamlit-aggrid (a custom Streamlit component), the result could be rendered in an interactive, customizable grid format. This allows the user to sort and filter operations on the result itself.
The next section highlights how bringing in the geospatial aspects would open up more possibilities for the user in finding a perfect home.
Geospatial use cases
Before we get to the use cases, we will do a quick preprocessing of the data to ensure that it is clean. We apply a custom built-in database function Appendix A: Preprocess to compute a cleaned_properties table. This function first removes the records that do not have values for lat and long columns. It then imputes the other numeric and categorical columns. For numeric columns, it imputes the missing values with the means of the respective columns and for categorical columns, it imputes with the most frequent values of the respective columns. It then handles outliers using the pandas.series.mask() function to replace all the records in the column that either have a value greater than the upper limit or lower than the lower limit.
Using NZFunTApply, we push the custom function inside the database:
ID REGION PRICE TYPE SQFEET BEDS BATHS ... WHEELCHAIR_ACCESS ELECTRIC_VEHICLE_CHARGE COMES_FURNISHED LAUNDRY_OPTIONS PARKING_OPTIONS LAT LON
07039061606 birmingham 1195.0 apartment 1908.03.02.0 ... 000 laundry on site street parking 33.4226 -86.70649717041970863 birmingham 1120.0 apartment 1319.03.02.0 ... 000 laundry on site off-street parking 33.3755 -86.80449727041966914 birmingham 825.0 apartment 1133.01.01.5 ... 000 laundry on site street parking 33.4226 -86.70649737041966936 birmingham 800.0 apartment 927.01.01.0 ... 000 laundry on site street parking 33.4226 -86.70649747041966888 birmingham 785.0 apartment 1047.02.01.0 ... 000 laundry on site street parking 33.4226 -86.706497
Show more
Create a geometry column and populate data
Since geospatial operations can only be applied to geometry columns, we first need to create a column of type geometry for the cleaned_properties table:
idadb.ida_query('''ALTER TABLE cleaned_properties ADD point st_geometry(200) ''')
Show more
Now we populate the geometry column. Lat and long columns in the dataset already capture the coordinate information, so we make use of these columns to generate the point for each property:
idadb.ida_query(f'''UPDATE cleaned_properties set point=inza..ST_POINT(cleaned_properties.lon,
cleaned_properties.lat) WHERE ID = cleaned_properties.ID;''')
Show more
Use case 1: Find nearest properties for a selected property of interest
Often, a user would be interested in finding all the options nearest a property of interest. For example, if he can find a lower rental price within a close distance of a selected property, he might want to check it out before making a decision.
To address this, we create a buffer region first, based on the user’s distance preference. We take the geometry of the selected record and use the ST_Buffer function to calculate the buffer area around the selected record. Next, we run the ST_Contains function over all the records in the properties table and check whether each point/record lies in the buffer region, and we collect only the points lying in the region. After this, we present those properties to the user, sorted by price. We also can present the maximum, minimum, and average prices of all the records in our buffer region.
We leverage the following Netezza geospatial operations:
inza..ST_AsText – Used to get the geometry data in a well-known text format, which enables easy readability for the user
Inza..ST_WKTToSQL - The opposite of inza..ST_AsText, this creates a geometry that corresponds to a well-known text geometry
inza..ST_Buffer – Used to create a buffer region (usually a polygon data) around the specified point
inza..ST_Contains – Takes in two parameters to check whether the first geometry point contains the second geometry point
Queries.py:
from connection import *
import pandas as pd
def get_houses_in_the_area(record_id,buffer_miles):
try:
#User selects a record
record_selected_by_user = idadb.ida_query(f'''SELECT inza..ST_AsText(POINT) as POINT,REGION,PRICE from cleaned_properties where id = {record_id}''')
#Getting the buffer region around the specified point
buffer_region = idadb.ida_query(f'''SELECT inza..ST_AsText(inza..ST_Buffer(inza..ST_WKTToSQL('{record_selected_by_user['POINT'][0]}'),{buffer_miles},8,'mile'))''')
#Properties within the buffer region
properties_within_distance = idadb.ida_query(f'''SELECT price from cleaned_properties where inza..ST_Contains( inza..ST_WKTToSQL('{buffer_region[0]}'),inza..ST_WKTToSQL(inza..ST_AsText(cleaned_properties.point)));''')
summary_list = [max(properties_within_distance),min(properties_within_distance),np.mean(properties_within_distance)]
properties_list = idadb.ida_query(f'''SELECT * from cleaned_properties where region='{record_selected_by_user['REGION'][0]}' order by PRICE''')
return summary_list,properties_list
except:
return 'Statistics for the region could not be found'
Show more
Output visual:
The user selects the Get Summary Information for Nearby rentals option, which displays all the rentals within X miles, along with summary stats.
Use case 2: Find cheaper rental options close to the selected property
This is a simple extension to use case 1, where we make use of the distances between the selected property and other properties in the buffer region. This could help us answer the user’s questions, like “How far should I go to find a 10-percent cheaper home when compared to the selected property?"
To address this, we run an ST_DISTANCE function to find out the distance between all the records and the selected record. Next, we sort these records in increasing order of their distance between and return the selected houses.
Queries.py:
def get_discounted_houses(record_id):
try:
#User selects a record
record_selected_by_user = idadb.ida_query(f'''SELECT inza..ST_AsText(POINT) as POINT,REGION,
PRICE from cleaned_properties where id = {record_id}''')
#Get all the records where it is lower than the discounted price.
selected_house_price = record_selected_by_user['PRICE'][0]
discount = 10
final_price = selected_house_price*(1-(discount/100))
#Calculate distances between properties and display all the records matching the condition
final_houses_with_discounted_prices = idadb.ida_query(f'''select * from ( select cleaned_properties.*,inza..ST_Distance(
inza..ST_Transform(inza..ST_WKTToSQL(inza..ST_AsText(cleaned_properties.point)), 4326),
inza..ST_Transform(inza..ST_WKTToSQL('{record_selected_by_user['POINT'][0]}'),4326), 'mile') as
distance_between_properties from cleaned_properties) as temp
where temp.price<={final_price} and temp.distance_between_properties<5 order by temp.distance_between_properties;''')
return final_houses_with_discounted_prices
except:
return 'No houses found nearby at a discount'
Show more
Output visual:
The user selects the Search discounted houses around the area option, which displays all the rentals within X miles and at a discount of 10 percent less than the selected property.
Next step
Please see Part 2 for the exploration of two more use cases.
Appendix A: Preprocess in-database function
if idadb.exists_table("cleaned_properties"):
idadb.drop_table("cleaned_properties")
code_cleanup_dataset = """def cleanup_dataset(self,df):
import numpy as np
import pandas as pd
import sklearn
import pickle
import subprocess
from sklearn.impute import SimpleImputer
data = df.copy()
data = data.drop(axis =1 , columns = ['URL','REGION_URL','IMAGE_URL'])
#Replacing the null values in categorical columns with the most frequent value in those particular columns
forcol in data.columns:
if data[col].dtype=='object':
imputer = SimpleImputer(missing_values='',
strategy= 'most_frequent' ,fill_value='missing')
data[col] = imputer.fit_transform(data[col].values.reshape(-1,1))
#Dropping all the rows with null values in numerical columns as we already handled the null values in categorical columns
data.dropna(inplace=True)
#Outliers handling
columns_with_outliers = ['PRICE','SQFEET','BEDS','BATHS']
for i in columns_with_outliers:
column = data[i]
first = column.quantile(0.05)
third = column.quantile(0.90)
iqr = third - first
upper_limit = third + 1.5 * iqr
lower_limit = first - 1.5 * iqr
column.mask(column > upper_limit, upper_limit, inplace=True)
column.mask(column < lower_limit, lower_limit, inplace=True)
predictions = data.copy()
def print_output(x):
row = [x['ID'],x['REGION'],
x['PRICE'], x['TYPE'],
x['SQFEET'],x['BEDS'],
x['BATHS'],x['CATS_ALLOWED'],x['DOGS_ALLOWED'],x['SMOKING_ALLOWED'],
x['WHEELCHAIR_ACCESS'],x['ELECTRIC_VEHICLE_CHARGE'],
x['COMES_FURNISHED'],x['LAUNDRY_OPTIONS'],x['PARKING_OPTIONS'],x['LATITUDE'],x['LONGITUDE']]
self.output(row)
predictions.apply(print_output, axis=1)
About cookies on this siteOur websites require some cookies to function properly (required). In addition, other cookies may be used with your consent to analyze site usage, improve the user experience and for advertising.For more information, please review your cookie preferences options. By visiting our website, you agree to our processing of information as described in IBM’sprivacy statement. To provide a smooth navigation, your cookie preferences will be shared across the IBM web domains listed here.