Article

Use Netezza to analyze and build machine learning models for geospatial data, Part 2

A powerful in-database analytics experience

By

Pratik Joseph Dabre,

Vinay Kasireddy

In Part 1, we learned how to analyze the housing rentals geospatial data to find the nearby rentals as well as the cheapest rentals in the proximity. In this article, we will explore how to find the green space area around a selected property and build a machine learning model that makes predictions based on geospatial features. Although these are slightly complex use cases, we will notice how easily they are accomplished with Netezza.

Use case 3: Find the green space around the selected property

For recreational activities, users might be interested in knowing the green space details closer to the property. For this, we make use of national parks and state park campgrounds data to arrive at this information.

To calculate the national park area, we use ST_Buffer to create a buffer region of 100 miles around the selected record and run ST_Intersection to find out the intersection between the national parks multi-polygon data and the buffer region. Next, we run ST_Area on the intersections found to calculate the total national park area around the property.

To calculate the number of state park campgrounds within a buffer region, we create a buffer region of 100 miles around the selected property using ST_Buffer and run ST_Intersects to find out how many campgrounds lie within the buffer region. We leverage the following Netezza geospatial operations:

  1. ST_Intersection - Determines a geometry object representing the points shared by the specified geometries
  2. ST_Area - Determines the area of the specified geometry object having a surface
  3. ST_Intersects – Checks whether the two geometries intersect

Load data into the national_parks and state_parks tables

Since we have green space information in two different datasets, we will load the data into two different tables. One is for national parks, on which the national park area is going to be calculated, and the other is for state park campgrounds, which will be used to find out the total number of campgrounds within a specified buffer.

Creating the tables:

state_parks table
HOUSING.ADMIN(ADMIN)=> CREATE TABLE state_parks(park_point st_geometry(200));
national parks table
HOUSING.ADMIN(ADMIN)=> create table national_parks(the_geom
st_geometry(64000));

Inserting data into the national_parks table:

We will use geopandas to load the data in client environment, then use insert statements to load into the Netezza database. The national parks dataset can be found at National Earth.

import geopandas as gpd
fp = "final_parks/ne_10m_parks_and_protected_lands_area.shp"
 data = gpd.read_file(fp)

for row in data.iterrows():
        idadb.ida_query(f'''INSERT INTO national_parks(the_geom) VALUES (inza..ST_WKTToSQL(' {row [1] ['geometry'] } ' ))''')

Inserting data into the state_parks table:

Since the state parks data is present in different files, we need to read each file and insert the point records into our state_parks table. The CSV files can be found on U.S. and Canada campgrounds.

southcamp = gpd.read_file("data/SouthCamp.csv")
southcamp = gpd.GeoDataFrame(southcamp,geometry=gpd.points_from_xy(southcamp.field_1, southcamp.field_2))

northeast = gpd.read_file("data/NortheastCamp.csv")
northeast = gpd.GeoDataFrame(northeast,geometry=gpd.points_from_xy(northeast.field_1, northeast.field_2))

midwest = gpd.read_file("data/MidwestCamp.csv")
midwest = gpd.GeoDataFrame(midwest,geometry=gpd.points_from_xy(midwest.field_1, midwest.field_2))

southwest = gpd.read_file(‘data/ SouthwestCamp.csv’)
southwest = gpd.GeoDataFrame(southwest,geometry=gpd.points_from_xy(southwest.field_1, southwest.field_2))

west = gpd.read_file("data/WestCamp.csv")
west = gpd.GeoDataFrame(west,geometry=gpd.points_from_xy(west.field_1, west.field_2))

Creating a list of all the regions together:

parks_regions_list = [west,canadacamp,southcamp,midwest,northeast,southwest]
for region in parks_regions_list:
    for i in region.iterrows():
        idadb.ida_query(f'''INSERT INTO state_parks(park_point) VALUES (
inza..ST_WKTToSQL('{i[1]['geometry']}'))''')

Calculating the state park count for every record:

We calculate the state park count within 100 miles of every record and save the results under a state_parks_count column in the cleaned_properties table.

Adding column state_parks_count to the cleaned_properties table:

idadb.ida_query('''ALTER TABLE cleaned_properties add STATE_PARKS_COUNT INTEGER NOT NULL DEFAULT 0;''')

In this subquery, we first calculate the buffer region within 100 miles of the property, then run an ST_Intersects operation to find out the campgrounds that lie within the buffer region. Further, we calculate the campgrounds for each property that returned true for the ST_Intersects operation and update them in the state_parks_count column for each property:

#Updating the state park count for each record 
idadb.ida_query(f'''update cleaned_properties set cleaned_properties.state_parks_count = parks_count.count from 
(select temp.id,count(temp.c) as count from (select cleaned_properties.id,inza..ST_Intersects( 
inza..ST_WKTToSQL(inza..ST_AsText( 
inza..ST_Buffer(cleaned_properties.point 
,100,8,'mile'))) 
, national_parks.the_geom) as c from cleaned_properties,national_parks where c='True') as temp group by temp.id) as 
parks_count where parks_count.id=cleaned_properties.id;''')

)

Calculating the national parks area for every record:

To save the national park area for each record, we will use the green_area table and insert the IDs of all the records present in the cleaned_properties table into the green_area table initially:

idadb.ida_query(''' create table green_area(id bigint,national_parks_area double not null default 0.0);''')

idadb.ida_query('''insert into green_area(select id from cleaned_properties);''')

We first calculate the buffer region of 100 miles around all the house records, then run the ST_Intersection() operation to find out whether the two geometries (buffer data of each record and the national parks geometry) intersect. Next, we calculate the area for all the records for which the ST_Intersection() operation did not return “POINT EMPTY”, then add all the areas, grouping them by record ID. Further, we update the national_parks_area column in the green_area table to save the sum of all the intersecting areas according to the record ID:

#Get the national_parks_area for each record
idadb.ida_query(f'''update green_area set green_area.national_parks_area = d.sum from( 
select f.id,SUM(f.area) as sum from (select temp.id,inza..ST_Area(inza..ST_WKTToSQL(temp.c),'mile') as area 
from (select cleaned_properties.id,inza..ST_AsText(inza..ST_Intersection( 
inza..ST_WKTToSQL(inza..ST_AsText( 
inza..ST_Buffer(cleaned_properties.point ,100,8,'mile'))) 
, national_parks.the_geom)) as c from cleaned_properties,national_parks ) as temp 
where temp.c!='POINT EMPTY') as f group by f.id) as d where d.id = green_area.id;''')

Queries.py:

def get_green_area(record_id): 
 try:
  record_selected_by_user = idadb.ida_query(f'''SELECT inza..ST_AsText(POINT) as POINT,REGION from cleaned_properties where id = {record_id} ''')

  #Getting the properties in the same region ordered by national_parks_area and park_count
  ranked_properties = idadb.ida_query(f'''select green_area.national_parks_area,temp.state_parks_count,temp.ID,
  temp.PRICE,temp.REGION,temp.BEDS,temp.BATHS,temp.SQFEET,temp.CATS_ALLOWED,temp.DOGS_ALLOWED, temp.SMOKING_ALLOWED,temp.WHEELCHAIR_ACCESS, 
  temp.ELECTRIC_VEHICLE_CHARGE,temp.COMES_FURNISHED,temp.LAT as "lat" , temp.LON as "lon" from (SELECT * from cleaned_properties where 
  cleaned_properties.region ='{record_selected_by_user['REGION'][0]}') as temp inner join green_area
  on green_area.id = temp.id 
  order by state_parks_count desc,national_parks_area desc;''')
  green_information = [200,ranked_properties]


  return green_information


 except:
  return [404,'Unknown error happened! Green information could not be found, please try again..']

Output visual:

Select Ranking properties according to their national park area and park count to display both the park_count and the national_parks_area within a buffer region of 100 miles from the property, output in a table format, and ranked in decreasing order.

Image shows park_count within a buffer region of 100 miles Image shows national_parks_area within a buffer region of 100 miles

Use case 4: Predict house prices using geospatial features

In this use case, we will add geospatial context to our input data to build a more robust rental price prediction machine learning model. For many users, proximity to good schools is key factor in picking a home. To capture this insight, we add a new geospatial feature that measures the distances between the selected property and top-rated schools, then we build a machine learning model that makes use of this feature, along with other features to predict rental prices more accurately.

As with the properties dataset, we have lat and lon columns in public_schools that provide the geographic coordinates. So, we first add a geometry column to the table and create point data for each school record using those two columns:

idadb.ida_query('''ALTER TABLE Public_Schools ADD point st_geometry(200) ''')


idadb.ida_query(f'''UPDATE Public_Schools set point=inza..ST_POINT(Public_Schools.LON,
             Public_Schools.LAT) WHERE OBJECTID = Public_Schools.OBJECTID;''')

Add a geospatial feature

Create a new column called DISTANCE_TO_SCHOOLS:

idadb.ida_query('ALTER TABLE cleaned_properties add column DISTANCE_TO_SCHOOLS double;')

We join the properties table with public_schools table, group by property ID, and calculate the min distance to schools for each property. We further join this temp table with the properties table to update the distance_to_schools column for each property:

idadb.ida_query('''update cleaned_properties set cleaned_properties.distance_to_schools = temp.distance_to_schools 
from (select cleaned_properties.id,MIN(inza..ST_DISTANCE(
inza..ST_Transform(cleaned_properties.point, 4326), 
inza..ST_Transform(Public_Schools.point,4326),
'mile')) as distance_to_schools from Public_Schools,cleaned_properties 
where cleaned_properties.REGION = LOWER(Public_Schools.CITY)
GROUP BY cleaned_properties.id) as temp
where temp.id = cleaned_properties.id; ''')

Build machine learning model and generate predictions:

Connect to cleaned_properties table :
idadf = IdaDataFrame(idadb,"CLEANED_PROPERTIES")

In-database function:

The preprocessed data, which is stored in the cleaned_properties table, is passed on to the in-database function as the input dataframe. The data is then divided into training and testing data, and the model is built using a gradient-boosting regression algorithm for the training data. Finally, the model is scored with the test data:

code_house_prediction_model = """def house_prediction_model(self,df):
import numpy as np
import pandas as pd
import sklearn
import pickle
import subprocess
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import r2_score,mean_squared_error

data = df.copy()

#Replacing the null values in numerical columns
data['DISTANCE_TO_SCHOOLS'] = data['DISTANCE_TO_SCHOOLS'].replace(to_replace = 0 , value = 1000000)
columns_with_outliers = ['DISTANCE_TO_SCHOOLS']
for i in columns_with_outliers:
 column = data[i]
 first = column.quantile(0.05)
 third = column.quantile(0.90)
 iqr = third - first
 upper_limit = third + 1.5 * iqr
 lower_limit = first - 1.5 * iqr
 column.mask(column > upper_limit, upper_limit, inplace=True)
 column.mask(column < lower_limit, lower_limit, inplace=True)

#Encoding the categorical columns
encoder={}
categorical_columns_list = []
for i in data.columns:
 if data[i].dtype == 'object':
  categorical_columns_list.append(i)
  encoder[i] = LabelEncoder()
  encoder[i].fit(list(data[i].values))
  data[i] = encoder[i].transform(list(data[i].values))
#Splitting into training and testing dataset
y = data['PRICE']
X = data.drop(columns=['PRICE'],axis=1)
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.20 , train_size = 0.80, random_state=100)

#Gradient Boosting Regression
from sklearn.ensemble import GradientBoostingRegressor
gbr = GradientBoostingRegressor()
gbr.fit(X_train, y_train)
y_pred_gbr= gbr.predict(X_test)
print(y_pred_gbr)
r2_score = gbr.score(X_test,y_test)
print(r2_score)

from sklearn.model_selection import cross_val_score
cv_scores = np.sqrt(-cross_val_score(gbr, X, y,scoring='neg_mean_squared_error',cv=2))
final_score = (np.mean(cv_scores/y.max()))

predictions = X_test.copy()
for column in categorical_columns_list: 
predictions[column] = encoder[column].inverse_transform(list(predictions[column].values))
predictions['PRICE_PREDICTED'] = y_pred_gbr
predictions['ACCURACY'] = r2_score
def print_output(x):
 row = [x['ID'],x['REGION'],
 x['TYPE'],
 x['SQFEET'],x['BEDS'],
 x['BATHS'],x['CATS_ALLOWED'],x['DOGS_ALLOWED'],x['SMOKING_ALLOWED'],
 x['WHEELCHAIR_ACCESS'],x['ELECTRIC_VEHICLE_CHARGE'],
 x['COMES_FURNISHED'],x['LAUNDRY_OPTIONS'],x['PARKING_OPTIONS'],x['DISTANCE_TO_SCHOOLS'],x['PRICE_PREDICTED'],x['ACCURACY']]
 self.output(row)
predictions.apply(print_output, axis=1)

Using NZFunGroupedApply, we push the custom function inside the database:

output_signature = {'ID':'int64','REGION':'str','TYPE':'str',
'SQFEET':'double','BEDS':'float','BATHS':'float','CATS_ALLOWED':'int','DOGS_ALLOWED':'int',
'SMOKING_ALLOWED':'int','WHEELCHAIR_ACCESS':'int','ELECTRIC_VEHICLE_CHARGE' : 'int','COMES_FURNISHED':'int',
'LAUNDRY_OPTIONS':'str','PARKING_OPTIONS':'str','DISTANCE_TO_SCHOOLS':'double','PRICE_PREDICTED':'double','ACCURACY':'double'
}



nz_fun_grouped_apply = NZFunGroupedApply(df=cleaned_properties_idadf, columns = ['ID','REGION','TYPE','SQFEET','PRICE','BEDS','BATHS','CATS_ALLOWED','DOGS_ALLOWED'
,'SMOKING_ALLOWED','WHEELCHAIR_ACCESS','ELECTRIC_VEHICLE_CHARGE','COMES_FURNISHED','LAUNDRY_OPTIONS','PARKING_OPTIONS','DISTANCE_TO_SCHOOLS'],code_str=code_house_prediction_model, index='REGION',fun_name ="house_prediction_model", output_table='predictions',output_signature=output_signature)
result_idadf = nz_fun_grouped_apply.get_result()
result = result_idadf.as_dataframe()
print(result)

The output_signature parameter reflects the data types of the result, and the output_table parameter stores the result of the function execution. The results are as follows:

   ID       REGION       TYPE  SQFEET  BEDS  BATHS  ...  COMES_FURNISHED  LAUNDRY_OPTIONS     PARKING_OPTIONS  DISTANCE_TO_SCHOOLS  PRICE_PREDICTED  ACCURACY
0      7049959793  SF bay area      house  2324.0   4.0    2.5  ...                0      w/d in unit  off-street parking       1000000.000000      3414.922639  0.484706
1      7046310307  SF bay area  apartment  1220.0   1.0    1.0  ...                0  laundry on site          no parking       1000000.000000      3201.676379  0.484706
2      7045320682  SF bay area      condo   812.0   2.0    1.0  ...                0  laundry in bldg  off-street parking       1000000.000000      2547.931581  0.484706
3      7037718327  SF bay area  apartment  1200.0   2.0    1.0  ...                0  laundry in bldg     attached garage       1000000.000000      3532.875489  0.484706
4      7049933172  SF bay area  apartment   981.0   2.0    2.0  ...                0      w/d in unit             carport       1000000.000000      2977.967890  0.484706
53143  7024597472         yuma  apartment   700.0   2.0    1.0  ...                0  laundry on site             carport             0.519920       725.053769  0.516445
53144  7046821594         yuma  apartment   475.0   1.0    1.0  ...                0  laundry on site             carport             0.519920       669.101039  0.516445
53145  7041564055         yuma  apartment   845.0   2.0    1.0  ...                0  laundry on site             carport             0.418105       762.912633  0.516445
53146  7036317850         yuma  apartment   722.0   2.0    2.0  ...                1      w/d in unit  off-street parking             0.519920       715.739956  0.516445
53147  7042498268         yuma  apartment   475.0   1.0    1.0  ...                0  laundry on site             carport             0.519920       667.422386  0.516445

Results: Since we chose index=region, the data was split into region-based partitions, and the function was executed against each partition in parallel. As a result, different models for different regions were created, and the predictions from all the models were then merged and output in the predictions table.

Output visual:

After training the model successfully, the results are output in the predictions table. We load all the records from the predictions table into a Streamlit widget.

Image shows predictions table output

Summary

In this series of articles, we have explored how to use Netezza’s geospatial capabilities (geospatial operations of nzspatial_esri cartridge) for analyzing geospatial data. Netezza geospatial operations are easy to use and can handle enormous datasets at fast speeds, delivering a powerful in-database analytics experience. We also showcased several use cases using the housing dataset to highlight how users could leverage these core geospatial operations and solve for the business needs that may arise in geospatial analytics domain.