Real Estate Analysis Python
Automating Real Estate Investment Analysis: Python Web …
Data-mining in python to automate real estate deal picture (from unsplash)The goal is to build a python web tool that is capable of analyzing investment properties. The tool will use data mining to find prices for properties, and then analyze the return mmaryIndexMotivationsThe AppThe FunctionsSummary and Going ForwardThe applications of bots or automation to trading and investments is not new. Multiple examples include stock trading bots tasked with buying or selling assets based on different models or indicators. Renaissance technology received the worlds attention due to their return rates through algorithmic investing that averaged 60%+ annualized returns over a 30-year time benefits of automation have the potential to transform businesses of varying size. Individual repetitive tasks can be automated by software that can be developed in house, or by third party SaaS the individual retail investor, python bots pose a promising solution to various elements of real estate investing. In this article, we examine automating the process of analyzing properties. Other processes that could be automated include: listing properties, sending notices to tenants, screening tenants (machine learning or AI), and even automatically dispatching maintenance and outputsThe program takes three inputs: the listing URL, the monthly rent price, and the property tax rate. It returns monthly cash flow, cap rate, and cash on cash return flow — The profit each month after paying everything (mortgage, property management, repair allowances, vacancy expense)Cap rate — The net income per year divided by the price of the asset (in percent)Cash on cash return rate — Net income per year divided by the down payment used for the asset (in percent)PackagesThe following packages were used. Note for those working in the Anacondas environment, it appears that streamlit is not currently available through this package manager. Streamlit was installed with pip, which might cause issues when PIP and Anacondas are used togetherRequests — This package was used to access websites in python via HTTP autiful soup 4 — Used for web scraping and data mining. We are able to use this package to retrieve HTML code that describes the content and styling of a website. Once the HTML code is retrieved, beautiful soup can be utilized to isolate specific parts of the site. For example, in this project, we use beautiful soup to fetch the house reamlit — This package makes deploying web apps super simple. The coded was developed in a Jupyter notebook, then converted to scripts once it was working. Streamlit allowed for seamless deployment and minimal time spent on user interface. A classic deployment option of combining python for the backend, flask for deployment, and react for dynamic content is much more straightforward using streamlit. Writing the codeThe functions that do the majority of the heavy lifting are price_mine, mortgage_monthly, and are the main functions that perform the following duties respectively:Retrieving the listing price from the URLCalculating the monthly mortgage costFinding the monthly net operating income after all expensesprice_mine | This function was designed to retrieve the house price from the listing. APIs could be used, but web-scraping is empowering. The downside of web scraping is that changes to the sites structure need to be updated in the code. Here, the web scraping package used was beautiful soup. The process is relatively simple, the web page is inspected using f12, then the desired element is found in the HTML code. This code can be isolated in beautiful soup to retrieve a specific part of the page. Once the code was retrieved the built in python function replace was used to remove commas, dollar signs, and unnecessary spaces so that a float variable can be price_mine(url): #Currently this function takes an input of a URL and returns the listing prices #The site it mines is remax #The input must be a string input, we can reformat the input to force this to work #Next we use regex to remove space and commas and dollar signs headers = ({‘User-Agent’: ‘Mozilla/5. 0 (Windows NT 6. 1) AppleWebKit/537. 36 (KHTML, like Gecko) Chrome/41. 0. 2228. 0 Safari/537. 36’}) response = get(url) response_text = html_soup = BeautifulSoup(response_text, ”) prices = (‘h2’, {‘class’: ‘price’}) prices = place(“, “, “”) prices = place(“$”, “”) prices = place(” “, “”) prices = float(prices) return pricesmortgage_monthly | This function takes the listing price, mortgage length, and interest rate as inputs, and returns the monthly mortgage price. There are many ways to calculate the monthly mortgage price, no specific decision was made as far as which method to use and a generic algorithm that was fairly easy to implement was mortgage_monthly(price, years, percent): #This implements an approach to finding a monthly mortgage amount from the purchase price, #years and percent. #Sample input: (300000, 20, 4) = 2422 # percent = percent /100 down = down_payment(price, 20) loan = price – down months = years*12 interest_monthly = percent/12 interest_plus = interest_monthly + 1 exponent = (interest_plus)**(-1*months) subtract = 1 – exponent division = interest_monthly / subtract payment = division * loan return(payment)net_operating | This function takes the monthly rent, the tax rate, and the price as inputs, and returns the net operating income per month. The amount of net operating income each represents the cash after: paying the mortgage (principle and interest), property taxes, paying a management fee (10% per month), property repairs allowances, and vacancy allowances. The argument could be made that only the monthly interest payment constitutes an expense since the principle builds equity. While this is true, our model wants to find out how much cash is left after paying everything. Individual investment analysis bots could change elements like this to personalize the calculations to the individual net_operating(rent, tax_rate, price): #Takes input as monthly mortgage amount and monthly rental amount #Uses managment expense, amount for repairs, vacancy ratio #Example input: net_operating(1000, 1, 400, 200) #879. 33 #1000 – 16. 67 (tax) – 100 (managment) – 4 (repairs) mortgage_amt = mortgage_monthly(price, 20, 3) prop_managment = rent * 0. 10 prop_tax = (price * (tax_rate/100)/12) prop_repairs = (price * 0. 02)/12 vacancy = (rent*0. 02) #These sections are a list of all the expenses used and formulas for each net_income = rent – prop_managment – prop_tax – prop_repairs – vacancy – mortgage_amt #Summing up expenses output = [prop_managment, prop_tax, prop_repairs, vacancy, net_income] return outputOther functions:Other functions used such as cap_rate calculated the ratio of net income to asset price as a percent. The full list of function is available on the project’s GitHub repository but will be excluded from this conceptualizationsThe idea was to have the inputs on the left-hand side of the page, and the output on the right-hand side of the page. The inputs were positioned inside a sidebar, this way the inputs and outputs are visually itial user interface concept sketchesIntroducing streamlitA common way we could build this dashboard is to create a static website with HTML, deploy the back end using flask, store values in some sort of database, and link everything using react. A new alternative deployment path that has advantages over this approach is called reamlit allows the fast transition from a python script to a modern user experience. It also offers a straightforward and fast deployment path. The first step in the conversion was to replace the built in python input functions and replace them with the streamlit input boxes. The same replacement was made for the is this is done, the streamlit app can be deployed from the console, and accessed through the external IP itial user interface built in streamlitOnce the user interface was built in streamlit, the code was modified to add a sidebar for the inputs as originally depicted in the above interface with sidebar for inputsThe final codeThe final code is available on repository for the projectAlthough groups such as renaissance technology have been able to profit from mathematical models applied to investing, there are benefits for individual retail investors that can be implemented much estate investors can benefit from automation by handling many tasks that would previously require an assistant, or tie up a lot of time. This was an example of using automation to reduce time spent on filtering deals. More deals could be reviewed by the investor if automated summary reports were generated and only the best assets were presented to a human. Mom and pop shops, real estate investors, and entrepreneurs can benefit from automation and not just fortune 500 companies.
Exploring US Real Estate Values with Python
This post covers data exploration using machine learning and interactive plotting. If interested in running the examples, there is a complementary Domino project available.
Introduction
Models are at the heart of data science. Data exploration is vital to model development and is particularly important at the start of any data science project. Visualization tools help make the shape of the data more obvious, surface patterns that can easily hide in hundreds of rows of data, and can even assist in the modeling process itself. As Domino seeks to help data scientists accelerate their work, we reached out to AWP Pearson for permission to excerpt the chapter “Real Estate” from the book, Pragmatic AI: An Introduction to Cloud-Based Machine Learning by Noah Gift. Many thanks to AWP Pearson for providing the permissions to excerpt the work and enabling us to provide a complementary Domino project.
Chapter Introduction: Real Estate
Once you get on the playing field, it’s not about whether you’re liked or not liked. All that matters is to play at a high level and do whatever it takes to help your team win. That’s what it’s James
Do you know of any good data sets to explore? This is one of the most asked questions I get as a lecturer or when teaching a workshop. One of my go-to answers is the Zillow real estate data sets:. The real estate market in the United States is something that every person living in the country has to deal with, and as a result, it makes for a great topic of conversation about ML.
Exploring Real Estate Values in the United States
Living in the San Francisco Bay Area makes someone think often and long about housing prices. There is a good reason for that. The median home prices in the Bay Area are accelerating at shocking rates. From 2010 to 2017, the median price of a single-family home in San Francisco has gone from approximately $775, 000 to $1. 5 million. This data will be explored with a Jupyter Notebook [and Domino project]. The entire project, and its data, can be checked out at [and the complementary Domino project].
At the beginning of the notebook, several libraries are imported and Pandas is set to display float versus scientific notation.
import pandas as pd
t_option(“display. float_format”, lambda x: “%. 3f”% x)
import numpy as np
import as sm
import as smf
import as plt
import seaborn as sns
(color_codes=True)
from uster import KMeans
color = lor_palette()
from import display, HTML
display(HTML(“{ width:100%! important;}”))%matplotlib inline
Next, data from Zillow for single-family homes is imported and described.
()
scribe()
Next, cleanup is done to rename a column and format the column type.
(columns={“RegionName”:”ZipCode”}, inplace=True)
df[“ZipCode”]=df[“ZipCode”](lambda x: “{:. 0f}”(x))df[“RegionID”=df[“RegionID”](lambda x: “{:. 0f}”(x))
Getting the median values for all of the United States will be helpful for many different types of analysis in this notebook. In the following example, multiple values that match the region or city are aggregated and median calculations are created for them. A new DataFrame is created called df_comparison that will be used with the Plotly library.
median_prices = ()
marin_df = df[df[“CountyName”] == “Marin”]()
sf_df = df[df[“City”] == “San Francisco”]()
palo_alto = df[df[“City”] == “Palo Alto”]()
df_comparison = ([marin_df, sf_df, palo_alto, median_prices], axis=1)
lumns = [“Marin County”, “San Francisco”, “Palo Alto”, “Median USA”]
Interactive Data Visualization in Python
There are a couple of commonly used interactive data visualization libraries in Python: Plotly and Bokeh. In this chapter, Plotly will be used for data visualization, but Bokeh could also have done similar plots. Plotly is a commercial company that can be used in both offline mode and by exporting to the company website. Plotly also has an open-source Python framework called Dash that can be used for building analytical web applications. Many of the plots in this chapter can be found here.
In this example, a library called Cufflinks is used to make it trivial to plot directly from a Pandas DataFrame to Plotly. Cufflinks is described as a “productivity tool” for Pandas. One of the major strengths of the library is the capability to plot as an almost native feature of Pandas.
import cufflinks as cf
cf. go_offline()
(title=”Bay Area MedianSingle Family Home Prices 1996-2017″,
xTitle=”Year”,, yTitle=”Sales Price”, #bestfit=True, bestfit_colors=[“pink”],
#subplots=True,
shape=(4, 1),
#subplot_titles=True, fill=True, )
fill=True)
Figure 10. 1 shows a view of the plot without interactions turned on. Palo Alto looks like a truly scary place to be entering the housing market as a buyer.
Figure 10. 1 Can Palo Alto Grow Exponentially Forever?
In Figure 10. 2, the mouse is hovering over December 2009, and it shows a point near the bottom of the last housing crash, with the median housing price in Palo Alto at $1. 2 million, the median in San Francisco around $750, 000, and the median in the entire United States at $170, 000.
Figure 10. 2 Housing Market Bottom in December 2009
By scrolling through the graph, it can be shown that in December 2017, the price in Palo Alto was about $2. 7 million, more than double in 8 years. On the other hand, the median home price in rest of the United States has only appreciated about 5 percent. This is definitely worth exploring more.
Clustering on Size Rank and Price
To further explore what is going on, a k-means cluster 3D visualization can be created with both sklearn and Plotly. First, the data is scaled using the MinMaxScaler so outliers don’t skew the results of the clustering.
from eprocessing import MinMaxScaler
columns_to_drop = [“RegionID”, “ZipCode”, “City”, “State”, “Metro”, “CountyName”]
df_numerical = ()
df_numerical = (columns_to_drop, axis=1)
Next, a quick description is done.
When the cluster is performed after dropping missing values, there are about 10, 000 rows.
scaler = MinMaxScaler()
scaled_df = t_transform(df_numerical)
kmeans = KMeans(n_clusters=3, random_state=0)(scaled_df)
print(len(bels_))10015
An appreciation ratio column is added, and the data is cleaned up before visualization.
cluster_df = (deep=True)
(inplace=True)scribe()cluster_df[‘cluster’] = bels_
cluster_df[‘appreciation_ratio’] =round(cluster_df[“2017-09”]/cluster_df[“1996-04”], 2)
cluster_df[‘CityZipCodeAppRatio’]=cluster_df[“City”](str) + “-” + cluster_df[‘ZipCode’] + “-” +
cluster_df[“appreciation_ratio”](str)()
Next, Plotly is used in offline mode (i. e., it doesn’t get sent to the Plotly servers), and three axes are graphed: x is the appreciation ratio, y is the year 1996, and z is the year 2017. The clusters are shaded. In Figure 10. 3, some patterns stick out instantly. Jersey City has appreciated the most in the last 30 years, going from a low of $142, 000 to a high of $1. 344 million, a 9x increase.
Figure 10. 3 What the Heck Is Going on With Jersey City Appreciation?
Some other visible things are a couple of zip codes in Palo Alto. They have also increased about 6 times in value, which is even more amazing considering how expensive the houses were to begin with. In the last 10 years, the rise of startups in Palo Alto, including Facebook, have caused a distorted elevation of pricing, even factoring in the entire Bay Area.
Another interesting visualization would be the appreciation ratio of these same columns to see whether this trend in Palo Alto can be observed further. The code looks similar to the code for Figure 10. 3.
from ighbors import KNeighborsRegressor
neigh = KNeighborsRegressor(n_neighbors=2)
cleveland = df[df[“City”] == “Cleveland”]()
df_median_compare = Frame()
df_median_compare[“Cleveland_ratio_median”] = cleveland/df_comparison[“Median USA”]
df_median_compare[“San_Francisco_ratio_median”] = df_comparison[“San Francisco”]/df_comparison[“Median USA”]
df_median_compare[“Palo_Alto_ratio_median”] = df_comparison[“Palo Alto”]/df_comparison[“Median USA”]
df_median_compare[“Marin_County_ratio_median”] = df_comparison[“Marin County”]/df_comparison[“Median USA”]
(title=”Ratio to National Median Region Median Home Price to National Median Home Price Ratio 1996-2017″,
xTitle=”Year”,
yTitle=”Ratio to National Median”,
#bestfit=True, bestfit_colors=[“pink”],
#subplot_titles=True,
In Figure 10. 4, the median appreciation of Palo Alto looks exponential since the housing crash of 2008, yet the rest of the San Francisco Bay Area seems to have be less volatile. A reasonable hypothesis may be that there is a bubble inside the Bay Area, in Palo Alto, that may not be sustainable. Eventually, exponential growth comes to an end.
One more thing to look at would be to look at the rent index and see if there are further patterns to tease out.
Figure 10. 4 Palo Alto Went From Having Home Prices 5 Times Higher Than National Median to 15 Times Higher in About 10 Years
The initial data import is cleaned up and the Metro column is renamed to be a City column.
df_rent = ad_csv(“.. /data/”)
median_prices_rent = ()
df_rent[df_rent[“CountyName”] == “Marin”]()
lumns
(columns={“Metro”:”City”}, inplace=True)
Next, the medians are created in a new DataFrame.
marin_df = df_rent[df_rent[“CountyName”span>] == “Marin”]()
sf_df = df_rent[df_rent[“City”] == “San Francisco]()
cleveland = df_rent[df_rent[“City”] == “Cleveland”]()
palo_alto = df_rent[df_rent[“City”] == “Palo Alto”]()
df_comparison_rent = ([marin_df, sf_df, palo_alto, cleveland, median_prices_rent], axis=1)
lumns = [“Marin County”, “San Francisco, “Palo Alto”, “Cleveland”, “Median USA”]
Finally, Cufflinks is used again to plot the median rents.
(
title=”Median Monthly Rents Single Family Homes”,
yTitle=”Monthly”,
In Figure 10. 5, the trends look much less dramatic, partially because the data is spread over a shorter period of time, but this isn’t the whole picture. Although Palo Alto isn’t in this data set, the other cities in the San Francisco Bay Area look much closer to the median rents, whereas Cleveland, Ohio, appears to be about half of the median rent in the United States.
Figure 10. 5 Rents in the San Francisco Bay Area since 2011 Have Almost Doubled While the Rest of the US Has Stayed Flat
One final analysis would be to look a similar rent ratio across the United States. In this code, the rent ratio is created with a new empty DataFrame and then inserted into Plotly again.
df_median_rents_ratio = Frame()
df_median_rents_ratio[“Cleveland_ratio_median”] = df_comparison_rent[“Cleveland”]/df_comparison_rent[“Median USA”]
df_median_rents_ratio[“San_Francisco_ratio_median”] = df_comparison_rent[“San Francisco”]/df_comparison_rent[“Median USA”]
df_median_rents_ratio[“Palo_Alto_ratio_median”] = df_comparison_rent[“Palo Alto”]/df_comparison_rent[“Median USA”]
df_median_rents_ratio[“Marin_County_ratio_median”] = df_comparison_rent[“Marin County”]/df_comparison_rent[“Median USA”]
(title=”Median Monthly Rents Ratios Single Family Homes vs National Median”
yTitle=”Rent vs Median Rent USA”,
Figure 10. 6 shows a different story from the appreciation ratio. In San Francisco, the median rent is still double the median of the rest of the United States, but nowhere near the 8x increase of the median home price. In looking at the rental data, it may pay to double-check before buying a home in 2018, especially in the Palo Alto area. Renting, even though it is high, may be a much better deal.
Figure 10. 6 Monthly Rents in the San Francisco Bay Area Versus National Median Have Exploded
Summary
In this chapter, a data exploration was performed on a public Zillow data set. The Plotly library was used to create interactive data visualizations in Python. k-means clustering and 3D visualization were used to tease out more information from a relatively simple data set. The findings included the idea that there may have been a housing bubble in the San Francisco Bay Area, specifically Palo Alto, in 2017. It may pay to do further exploration on this data set by creating a classification model of when to sell and when to buy for each region of the United States.
Other future directions to take this sample project is to look at higher-level APIs like ones that House Canary provides. It may be that your organization can build an AI application by using their prediction models as a base, and then layering other AI and ML techniques on top.
Editorial note: small changes have been implemented to increase readability online and reflect the pre-installed components in the complementary Domino project.
House hunting — the data scientist way | by Atma Mani | GeoAI
At some point in time, each of us would have went through the process of either renting or buying a house. Whether we realize or not, a lot of factors we consider important are heavily influenced by location. In this article, we apply the data wrangling capabilities of scientific Python ecosystem and geospatial data visualization & analysis capabilities of the ArcGIS platform to build a model that will help shortlist good properties (houses). You may ask why do this as there are a number of real estate websites that promise something similar. I hope by the end of this article, you will be able to answer that I recently presented this as a keynote speech at a GeoDev meetup hosted by Esri in Portland. The Python notebooks used for this blog can be found here: and the slides here: data for the city of Portland was collected from a popular real estate website. It came in a few CSV files of different sizes. Data was read using Pandas as DataFrame objects. These DataFrames form the bedrock of this study upon which both spatial and attribute analysis are performed. The CSVs were merged to obtain an initial list of about 4200 properties listed for sale. Missing value imputationAn initial and critical step in any data analysis and machine learning project is wrangling and cleaning of the data. The data collected, in this case, suffers from duplicates, illegal characters in column names, and outliers. Pandas makes it extremely easy to sanitize tabular data. Different strategies were used to impute for missing values. Centrality measures such as mean and median were used to impute for missing values in ‘lot size’, ‘price per sqft. ’ and ‘sq ft’ columns whereas, frequency measures like mode were used for columns such as ‘ZIP’. Rows that had missing values in critical columns such as ‘beds’, ‘baths’, ‘price’, ‘year built’, ‘latitude’ and ‘longitude’ were dropped as there was no reliable way of salvaging them. After removing such records, there were 3652 properties available for moving outliersOutliers in real estate data could be due to a number of reasons. For example, erroneous data formats, bad default values, typographical errors during data entry etc. can lead to outliers. Plotting the distribution of numeric columns can give a sense of outliers in the data set. # explore distribution of numeric columnsax_list = (bins=25, layout=(4, 4), figsize=(15, 15))Histograms of numeric columns show the presence of outliersFrom the first two histograms, it appears as though all the houses have the same number of beds and baths, which is simply not true. This is a sign that a small number of high values (outliers) are skewing the distribution. A few different approaches exist to filter outliers. A popular technique is to use a 6 sigma filter which removes values that are greater than 3 standard deviations from mean. This filter assumes that the data follows a normal distribution and uses mean as the measure of centrality. However, when data suffers heavily from outliers, as in this case, mean can get distorted. An Inter Quartile Range (IQR) filter which uses median, a more robust measure of centrality, can filter out outliers that are at a set distance from median in a more reliable fashion. After removing outliers using the IQR filter, the distribution of numeric columns looks much healthier. Histogram of numeric columns after missing value imputation and removal of outliersAs seen so far, Pandas provides an efficient API to explore the statistical distribution of the numeric columns. To explore the spatial distribution of this data set, ArcGIS API for Python is pandas as pdimport as plt%matplotlib inlinefrom import GISfrom arcgis. features import GeoAccessor, GeoSeriesAccessorThe GeoAccessor and GeoSeriesAccessor classes add spatial capabilities to Pandas DataFrame objects. Any regular DataFrame object with location columns can be transformed into a Spatially Enabled DataFrame using these classes. >>> prop_sdf = om_xy(prop_df, ‘LONGITUDE’, ‘LATITUDE’)Similar to plotting a statistical chart out of a DataFrame object, a spatial plot on an interactive map widget can be plotted out of a Spatially Enabled listings of properties for sale in Portland marketRenderers such as heat-maps can be applied to quickly visualize the density of the otting the Spatially Enabled DataFrame with a heatmap renderer shows the presence of hotspots near downtown and Clover Hill neighborhoodsThe ArcGIS API for Python comes with an assortment of sophisticated renderers that help visualize the spatial variation in columns such as ‘property price’, ‘age’, ‘square footage’, ‘HoA’ etc. Spatial and statistical distributions of property price, square footage and monthly HoACombining maps with statistical plots as shown above yield deeper insights into the real estate market of interest. For instance, maps such as these help investigate general assumptions that properties are higher priced, smaller in size, older in age around the downtown area and change progressively as you move toward the following rules based on the intrinsic features of the houses were used to build a shortlist. >>> filtered_df = prop_sdf[(prop_df[‘BEDS’]>=2) & (prop_df[‘BATHS’]>1)& (prop_df[‘HOA PER MONTH’]<=200) & (prop_df['YEAR BUILT']>=2000) & (prop_df[‘SQUARE FEET’] > 2000) & (prop_df[‘PRICE’]<=700000)]>>> (331, 23)This narrows down the list of eligible properties to 331, from an original 3624. When plotted on a map, these shortlisted properties are spread across the ortlisted properties are spread throughout the Portland marketFrom the histograms below, most houses in the shortlist have 4 beds and the majority of them are skewed toward the upper end of the price spectrum. Histograms of shortlisted propertiesWhen buying a house, buyers look for proximity to facilities such as groceries, pharmacies, urgent care, parks etc. These comprise a house’s location properties. The geocoding module of the ArcGIS API for Python can be used to search for such facilities within a specified distance around a ocoding import geocode# search for restaurants in neighborhoodrestaurants = geocode(‘restaurant’,, max_locations=200)# search for hospitalshospitals = geocode(‘hospital’,, max_locations=50)The map below displays facilities such as groceries, restaurants, hospitals, coffee shops, bars, gas stations, shops & service, travel & transport, parks and educational institutions that fall within a 5 mile buffer around a map shows some common facilities home buyers generally look for around a house. A house chosen at random is symbolized with a red star and the 5 mile buffer is symbolized with the black circle. The blue line represents the fastest route from the house to a designated destination (Esri Portland R&D office in this case). Another important aspect that buyers consider is the time it takes to commute to work / school. The network module of the ArcGIS API for Python provides tools to compute driving directions and duration based on historic traffic information. For instance, the snippet below calculates the directions between the chosen house and Esri Portland R&D office and the time it would take on a typical Monday morning at 8 ute_result = (stops, return_routes=True, return_stops=True, return_directions=True, impedance_attribute_name=’TravelTime’, start_time=644511600000).. (“route length: {} miles, route duration {}”(round(route_length, 3)))>>>route length: 10. 273 miles, route duration: 27m, 48. 39sWhen routing, you can add multiple stops such as to daycare, gym or other places you visit as part of your commute and take those into account as well. This information can be turned into a Pandas DataFrame and visualized as a table or a bar chart. Thus, houses can be compared against one another based on access to neighborhood List of neighborhood facilities for a house chosen at random, as a table. Right: Number of such facilities under each category. Feature engineeringThe above steps were run on a batch mode against each of the 331 shortlisted properties. Different neighborhood facilities were added as new columns to the data set and the count of the number of facilities a property has access to (within a specified distance) wasadded as the column value. The idea is if there are a lot of facilities of the same kind near a property, all of them compete for the same market. This competition keeps a check on the prices and also improves the quality of service. Such houses are more attractive than the histograms below show the distribution of facility counts around the 331 shortlisted properties. Histograms of facilities around the shortlisted propertiesBased on the histogram, there is a healthy distribution of coffee shops, gas stations, hospitals and general purpose shops around a large number of houses. Further, the Portland market appears to perform really well when it comes to commute duration & commute length (when computed to Esri downtown office), number of schools, grocery shops, parks and restaurants as most houses have a large number of options to choose, features such as the number of beds, baths, square footage, price, age, etc. form the intrinsic features of a property. Through a series of spatial enrichment steps as seen above, location based attributes such as the number of restaurants, grocery stores, hospitals, etc. a property has access to were computed. These become a property’s spatial features and were added to the original data set as part of the feature engineering process. Evaluating houses is a deeply personal process. Different buyers look for different characteristics in a house and not all aspects are considered equal. Thus, it is possible to assign different weights for each of the features and arrive at a weighted sum (a score) for each house. The higher the score, the more desirable a house is a scoring function that reflects the relative importance of each feature in a house. Attributes that are considered desirable are weighed positively, while those undesirable are weighed set_scores(row): score = ((row[‘PRICE’]*-1. 5) + # penalize by 1. 5 times (row[‘BEDS’]*1)+ (row[‘BATHS’]*1)+ (row[‘SQUARE FEET’]*1)+ (row[‘LOT SIZE’]*1)+ (row[‘YEAR BUILT’]*1)+ (row[‘HOA PER MONTH’]*-1)+ # penalize by 1 times (row[‘grocery_count’]*1)+ (row[‘restaurant_count’]*1)+ (row[‘hospitals_count’]*1. 5)+ # reward by 1. 5 times (row[‘coffee_count’]*1)+ (row[‘bars_count’]*1)+ (row[‘shops_count’]*1)+ (row[‘travel_count’]*1. 5 times (row[‘parks_count’]*1)+ (row[‘edu_count’]*1)+ (row[‘commute_length’]*-1)+ # penalize by 1 times (row[‘commute_duration’]*-2) # penalize by 2 times) return scoreScaling your dataWhile a scoring function can be extremely handy to compare feature engineered, shortlisted houses when applied directly (without any scaling), it returns a set of scores that are heavily influenced by a small number of attributes whose values are numerically very large. For instance, attributes such as property price tend to be really large numbers (hundreds of thousands) compared to, say, number of beds (<10) and when used without scaling, it tends to dominate the scores beyond its allotted weight. Effects of scaling: Property scores before and after scaling the numerical columnsThe left side of the chart above shows scores computed without scaling. They appear extremely correlated with the property price variable. While property price is an important consideration for most buyers, it cannot be the only criteria that determines a property’s rectify this, all numerical columns were scaled to a uniform range of 0-1 using the MinMaxScaler function from scikit-learn library and a new set of scores were computed. The right side of the chart above shows the results of this scaling. The scores appear normally distributed and the scatter between property price and scores shows only a weak correlation. Thus, the scaling function has performed well and would allow the scoring algorithm to effectively consider all the other attributes of a house in addition to property the properties were scored, they were sorted in descending order of their scores and assigned a rank. The house with highest score gets rank 1 and so on. From here, the home buyer could pick the top 10 or 50 houses as a refined shortlist and perform field 50 houses. Lighter shades (yellow) represent lower terestingly, the top 50 houses are spread across the city of Portland without any signs of a strong spatial clustering. Property prices on the other hand appear in clusters. The histograms below show that most houses in the top 50 list have 2 baths, 4 beds (although the shortlist criteria was just a minimum of 2 beds), priced on the upper end of the spectrum, under 2500 sq ft. and built fairly recently (2015). In terms of access to neighborhood facilities, most houses appear well connected with a large number of grocery stores, restaurants, gas stations, educational institutions and hospitals to choose from. Majority of the properties are within 10 miles from the said destination (Esri downtown office) and can be reached within 25 minutes of driving. Histograms of intrinsic and spatial features of the top 50 housesThe scaling function ensured that there is no single feature that dominates a property’s score beyond its allotted weight. However, there may be some features that tend to correlate with scores. To visualize this, the pairplot() function of seaborn library is used to produce a scatter plot of each variable against one another, as shown in the matrix below. The diagonals of the matrix represent histograms of the corresponding atter plots of rank vs spatial features of housesIn the scatter grid above, we notice randomness and stratification in the rank variable. The only two places where the scatters show sings of correlation is understandably between commute_duration and commute_length. In other words, as the commute distance decreases, the commute duration also decreases, meaning traffic conditions don’t affect duration as much as distance second scatter grid below is between rank and various intrinsic features of the properties. The scatter between rank and property price is quite random, meaning it is possible to buy a house with a higher rank for lower than average price. The scatter between rank and square footage shows an interesting ‘U’ shape, meaning as property sizes increase, their rank gets better, but after a point, they get atter plot of rank vs intrinsic features of housesSo far, the data set was feature engineered with intrinsic and spatial attributes. Weights for different features were explicitly defined, using which properties were scored and ranked. In reality, buyers decision making process, although logical, is a little less calculated and a bit fuzzier. Buyers are likely to be content with certain shortcomings (for instance, fewer bedrooms than preferred or fewer shopping centers in the neighborhood) if they are highly impressed with some other characteristic (such as larger square footage to compensate with fewer bedrooms). Thus, we could get buyers to simply ‘favorite’ and ‘blacklist’ a set of houses and let a machine learning model infer their it is difficult to collect such training data for a large number of properties, a mock data set was synthesized using the top 50 houses marked as favorites and the remaining 281 as blacklists. This was fed to a machine learning logistic regression this model learns from the training data, it attempts to assign weights to each of the predictor variables (intrinsic and spatial features) and can predict whether or not a house will be preferred by a buyer. Thus, as newer properties hit the market, this model can predict whether or not a buyer would favorite a new property and present only such relevant results. Below is the accuracy of such a model that was run on this data set. >>> classification_report(y_test, test_predictions, target_names=[‘blacklist’, ‘favorite’]) precision recall f1-score blacklist 0. 94 0. 98 0. 96 favorite 0. 88 0. 71 0. 79 average 0. 93 0. 92 Precision refers to the model’s ability to correctly identify whether a given property is favorite or not. Recall on the other hand refers to its ability to identify all favorites in the test set. The f1-score computes the harmonic mean of precision and recall to provide a combined score of the model’s training data used in this case study is small by today’s standards and is imbalanced because there are fewer properties that are favorites compared to blacklists (50 vs 281). Yet, the model performs appreciably well with high f1 scores for eliminating properties that are likely to be weights assigned by the regression model (code snippet on the right in the image below) show what the model learnt to be the relative importance of each feature based on the training data. When compared against the weights that were manually assigned (see section “Scoring properties”), the logistic regression model has penalized property price, commute length and duration only mildly. It has weighed certain features such as lot size, number of grocery stores, shops, parks and educational institutions negatively and the rest positively. Features such as hospital counts, coffee shops, bars and gas stations are weighed higher than what was assigned type of recommendation engine built in this study is called ‘content based filtering’ as it uses just the intrinsic and spatial features engineered for prediction. For this type of recommendation to work, we need a really large training set. In reality nobody can generate such a large set manually. In practice however, another type of recommendation called ‘community based filtering’ is used. This type of recommendation engine uses the features engineered for the properties, combined with favorite / blacklist data to find similarity between a large number of buyers. It then pools the training set from similar buyers to create a really large training set and learns on this case study, the input data set was spatially enriched with information about access to different facilities. This can be extended further by engineering socio-economic features such as age, income, education level, marital status, population density and a host of other parameters using the geoenrichment module of the ArcGIS API for Python. Another aspect that could be incorporated is to use authoritative data shared by local governments under the open data initiative. For instance, the city of Portland’s open data site lists a host of useful spatial layers that can be used to further enrich this data study demonstrates how data science and machine learning can be employed to one aspect of the real estate industry. Buying a home is a personal process, however a lot of decisions are heavily influenced by location. As shown in this study, Python libraries such as Pandas can be used for visualization and statistical analysis, and libraries such as the ArcGIS API for Python for spatial analysis. The methods adopted in this study can be applied to any other real estate market to build a recommendation engine of your Python notebooks used for this blog can be found here: and the slides here:
Frequently Asked Questions about real estate analysis python
How is Python used in real estate?
When buying a house, buyers look for proximity to facilities such as groceries, pharmacies, urgent care, parks etc. These comprise a house’s location properties. The geocoding module of the ArcGIS API for Python can be used to search for such facilities within a specified distance around a house.Oct 26, 2018
How do you analyze real estate data?
6 Key Steps to Real Estate Market AnalysisResearch neighborhood quality and amenities. … Obtain property value estimates for the area. … Select comparables for your real estate market analysis. … Calculate average price of comparable listings. … Fine-tune your market analysis with adjustments to your comparables.More items…•Oct 2, 2021
How do I scrape data from Zillow with Python?
The whole scraping process contains the following steps:Conduct a search on Zillow by inserting the postal code.Download HTML code through Python Requests.Parse the page through LXML.Export the extracted data to a CSV file.Apr 2, 2021