Article

Use Netezza to analyze and build machine learning models for geospatial data, Part 1

A powerful in-database analytics experience

By

Pratik Joseph Dabre,

Vinay Kasireddy

In this article, you will learn how to use Netezza to analyze and build machine learning models for geospatial data. Geospatial data is essentially the data associated with a particular location (the records in this type of information set will have coordinates, address, city, and postal or ZIP code included with them). The Netezza geospatial operations are completely in-database, meaning that the data is never moved out of the database. Additionally, these operations (backed by SQL implementations) leverage the massively parallel processing (MPP) architecture of Netezza to deliver results at lightning speeds. This unique advantage of Netezza makes it a standout in the field of geospatial analytics.

We will use U.S. housing rental records (listings in various U.S. locations) as the input data and build several use cases around it to highlight some of the most important operations. We will also use the Streamlit library to build a web application for our use cases.

Installation

For prerequisite installation, refer to Installation.

Upload the dataset to the database

The Kaggle datasets are available as CSV files from Kaggle at house-rent-prediction-dataset and US Schools Dataset. Refer to Upload for steps.

Explore the properties dataset

Accessing the properties dataset Access, one could notice several non-geospatial features like region, type, beds, baths, and square footage for each property.

Image shows non-geospatial features

The user could apply filters based on those features to find the properties of choice.

Image shows feature selection

With streamlit-aggrid (a custom Streamlit component), the result could be rendered in an interactive, customizable grid format. This allows the user to sort and filter operations on the result itself.

Image shows interactive, customizable grid format

The next section highlights how bringing in the geospatial aspects would open up more possibilities for the user in finding a perfect home.

Geospatial use cases

Before we get to the use cases, we will do a quick preprocessing of the data to ensure that it is clean. We apply a custom built-in database function Appendix A: Preprocess to compute a cleaned_properties table. This function first removes the records that do not have values for lat and long columns. It then imputes the other numeric and categorical columns. For numeric columns, it imputes the missing values with the means of the respective columns and for categorical columns, it imputes with the most frequent values of the respective columns. It then handles outliers using the pandas.series.mask() function to replace all the records in the column that either have a value greater than the upper limit or lower than the lower limit.

Using NZFunTApply, we push the custom function inside the database:

output_signature = {'ID':'int64','REGION':'str','PRICE':'double','TYPE':'str',
'SQFEET':'double','BEDS':'float','BATHS':'float','CATS_ALLOWED':'int','DOGS_ALLOWED':'int',
'SMOKING_ALLOWED':'int','WHEELCHAIR_ACCESS':'int','ELECTRIC_VEHICLE_CHARGE' : 'int','COMES_FURNISHED':'int',
'LAUNDRY_OPTIONS':'str','PARKING_OPTIONS':'str','LAT':'float','LON':'float'
}


nz_fun_t_apply = NZFunTApply(df=properties_idadf, code_str=code_cleanup_dataset, parallel = False,fun_name ="cleanup_dataset", output_table = 'cleaned_properties',output_signature=output_signature)
result_idadf = nz_fun_t_apply.get_result()
result = result_idadf.as_dataframe()
print(result.head())

The output is as follows:

ID      REGION   PRICE       TYPE  SQFEET  BEDS  BATHS  ...  WHEELCHAIR_ACCESS  ELECTRIC_VEHICLE_CHARGE  COMES_FURNISHED  LAUNDRY_OPTIONS     PARKING_OPTIONS      LAT        LON
0  7039061606  birmingham  1195.0  apartment  1908.0   3.0    2.0  ...                  0                        0                0  laundry on site      street parking  33.4226 -86.706497
1  7041970863  birmingham  1120.0  apartment  1319.0   3.0    2.0  ...                  0                        0                0  laundry on site  off-street parking  33.3755 -86.804497
2  7041966914  birmingham   825.0  apartment  1133.0   1.0    1.5  ...                  0                        0                0  laundry on site      street parking  33.4226 -86.706497
3  7041966936  birmingham   800.0  apartment   927.0   1.0    1.0  ...                  0                        0                0  laundry on site      street parking  33.4226 -86.706497
4  7041966888  birmingham   785.0  apartment  1047.0   2.0    1.0  ...                  0                        0                0  laundry on site      street parking  33.4226 -86.706497

Create a geometry column and populate data

Since geospatial operations can only be applied to geometry columns, we first need to create a column of type geometry for the cleaned_properties table:

idadb.ida_query('''ALTER TABLE cleaned_properties ADD point st_geometry(200) ''')

Now we populate the geometry column. Lat and long columns in the dataset already capture the coordinate information, so we make use of these columns to generate the point for each property:

idadb.ida_query(f'''UPDATE cleaned_properties set point=inza..ST_POINT(cleaned_properties.lon,
             cleaned_properties.lat) WHERE ID = cleaned_properties.ID;''')

Use case 1: Find nearest properties for a selected property of interest

Often, a user would be interested in finding all the options nearest a property of interest. For example, if he can find a lower rental price within a close distance of a selected property, he might want to check it out before making a decision.

To address this, we create a buffer region first, based on the user’s distance preference. We take the geometry of the selected record and use the ST_Buffer function to calculate the buffer area around the selected record. Next, we run the ST_Contains function over all the records in the properties table and check whether each point/record lies in the buffer region, and we collect only the points lying in the region. After this, we present those properties to the user, sorted by price. We also can present the maximum, minimum, and average prices of all the records in our buffer region.

We leverage the following Netezza geospatial operations:

  1. inza..ST_AsText – Used to get the geometry data in a well-known text format, which enables easy readability for the user
  2. Inza..ST_WKTToSQL - The opposite of inza..ST_AsText, this creates a geometry that corresponds to a well-known text geometry
  3. inza..ST_Buffer – Used to create a buffer region (usually a polygon data) around the specified point
  4. inza..ST_Contains – Takes in two parameters to check whether the first geometry point contains the second geometry point

Queries.py:

from connection import *
import pandas as pd

def get_houses_in_the_area(record_id,buffer_miles): 
 try:
  #User selects a record 
  record_selected_by_user = idadb.ida_query(f'''SELECT inza..ST_AsText(POINT) as POINT,REGION,PRICE from cleaned_properties where id = {record_id}''')


  #Getting the buffer region around the specified point 
  buffer_region = idadb.ida_query(f'''SELECT inza..ST_AsText(inza..ST_Buffer(inza..ST_WKTToSQL('{record_selected_by_user['POINT'][0]}'),{buffer_miles},8,'mile'))''') 

  #Properties within the buffer region
  properties_within_distance = idadb.ida_query(f'''SELECT price from cleaned_properties where inza..ST_Contains(    inza..ST_WKTToSQL('{buffer_region[0]}'),inza..ST_WKTToSQL(inza..ST_AsText(cleaned_properties.point)));''') 

  summary_list = [max(properties_within_distance),min(properties_within_distance),np.mean(properties_within_distance)]

  properties_list = idadb.ida_query(f'''SELECT * from cleaned_properties where region='{record_selected_by_user['REGION'][0]}' order by PRICE''')
  return summary_list,properties_list
 except:
  return 'Statistics for the region could not be found'

Output visual:

The user selects the Get Summary Information for Nearby rentals option, which displays all the rentals within X miles, along with summary stats.

Image shows display all rentals within X miles result Image shows summary stats

Use case 2: Find cheaper rental options close to the selected property

This is a simple extension to use case 1, where we make use of the distances between the selected property and other properties in the buffer region. This could help us answer the user’s questions, like “How far should I go to find a 10-percent cheaper home when compared to the selected property?"

To address this, we run an ST_DISTANCE function to find out the distance between all the records and the selected record. Next, we sort these records in increasing order of their distance between and return the selected houses.

Queries.py:

def get_discounted_houses(record_id):

 try:
  #User selects a record 
  record_selected_by_user = idadb.ida_query(f'''SELECT inza..ST_AsText(POINT) as POINT,REGION,
  PRICE from cleaned_properties where id = {record_id}''')


  #Get all the records where it is lower than the discounted price.
  selected_house_price = record_selected_by_user['PRICE'][0]
  discount = 10


  final_price = selected_house_price*(1-(discount/100))


  #Calculate distances between properties and display all the records matching the condition


  final_houses_with_discounted_prices = idadb.ida_query(f'''select * from ( select cleaned_properties.*,inza..ST_Distance(
  inza..ST_Transform(inza..ST_WKTToSQL(inza..ST_AsText(cleaned_properties.point)), 4326), 
  inza..ST_Transform(inza..ST_WKTToSQL('{record_selected_by_user['POINT'][0]}'),4326), 'mile') as 
  distance_between_properties from cleaned_properties) as temp 
  where temp.price<={final_price} and temp.distance_between_properties<5 order by temp.distance_between_properties;''')



  return final_houses_with_discounted_prices

 except:
  return 'No houses found nearby at a discount'

Output visual:

The user selects the Search discounted houses around the area option, which displays all the rentals within X miles and at a discount of 10 percent less than the selected property.

Image shows discounted rentals within X miles Image shows summary stats

Next step

Please see Part 2 for the exploration of two more use cases.


Appendix A: Preprocess in-database function

if idadb.exists_table("cleaned_properties"):
 idadb.drop_table("cleaned_properties")
code_cleanup_dataset = """def cleanup_dataset(self,df):
import numpy as np
import pandas as pd
import sklearn
import pickle
import subprocess
from sklearn.impute import SimpleImputer
data = df.copy()
data = data.drop(axis =1 , columns = ['URL','REGION_URL','IMAGE_URL'])
#Replacing the null values in categorical columns with the most frequent value in those particular columns
for col in data.columns:
 if data[col].dtype=='object':
  imputer = SimpleImputer(missing_values='', 
  strategy= 'most_frequent' ,fill_value='missing')
  data[col] = imputer.fit_transform(data[col].values.reshape(-1,1))

#Dropping all the rows with null values in numerical columns as we already handled the null values in categorical columns
data.dropna(inplace=True)

#Outliers handling
columns_with_outliers = ['PRICE','SQFEET','BEDS','BATHS']
for i in columns_with_outliers:
 column = data[i]
 first = column.quantile(0.05)
 third = column.quantile(0.90)
 iqr = third - first
 upper_limit = third + 1.5 * iqr
 lower_limit = first - 1.5 * iqr
 column.mask(column > upper_limit, upper_limit, inplace=True)
 column.mask(column < lower_limit, lower_limit, inplace=True)
predictions = data.copy()
def print_output(x):
 row = [x['ID'],x['REGION'],
 x['PRICE'], x['TYPE'],
 x['SQFEET'],x['BEDS'],
 x['BATHS'],x['CATS_ALLOWED'],x['DOGS_ALLOWED'],x['SMOKING_ALLOWED'],
 x['WHEELCHAIR_ACCESS'],x['ELECTRIC_VEHICLE_CHARGE'],
 x['COMES_FURNISHED'],x['LAUNDRY_OPTIONS'],x['PARKING_OPTIONS'],x['LATITUDE'],x['LONGITUDE']]
 self.output(row)
predictions.apply(print_output, axis=1)