16 min readNov 26, 2022

Dear friends!

Welcome to AP-Airbnb#1, our first Applied Analytics Project! 👋James Kaleo Kistner, 👋Sindhu Iyengar and 👋my humble self will walk you through our first Applied Project together with the goal of presenting you an example of how you can apply the descriptive, predictive, visualization, other analytics tools you acquired so far. Today’s spotlight lays on Airbnb so let’s get started!

A large share of growth of today’s sharing economy is driven by peer-to-peer (P2P) accommodation providers such as Airbnb. An innovative platform for travelers to find a unique, cheap, and convenient place to stay and an popular opportunity for many to earn passive income in the short-term rental (STR) space. But it seems that the home vacation rental industry is heading a different direction lately. Hosting is becoming more entangled with social and institutional frameworks, professionally developed, institutionally funded and purpose-driven designed Airbnbs taking a bigger market share. Additionally, the low mortgage rates (of the past) and the rise of remote-work culture initiated a structural market shift that resulted in more leveraged buy-ins from retail investors and Airbnb’s big summer release this year.

Just when hosts thought they have learned the way the post-Covid19 world works, the environment changes again. Mortgage rates rise, real estate markets shaking, and a possible another recession knocking at our door. Inspiration for us to launch a project focusing on the pricing factors in the STR market.

1) OBJECTIVE

The work horse in the real estate industry, often seen in (explanatory) econometric or financial asset pricing and (predictive) machine learning models, are hedonic pricing models. A multi-variate framework that suggests that the valuation of housing values can be viewed as a weighted sum of its features (e.g. location, number of rooms, amenities, parking, pool). In our case, the hedonic approach provides us an elegant framework of assessing the daily rental price of our Airbnb units, where the model measures the drivers of the diverse features that are guests are willing to pay for.

The first part of our series will primarily focus on stating our initial hypothesis, data sourcing, cleaning, and exploratory data analysis. The insights gained from part one will be the basis for our predictive modelling part in which we will elaborate on the use of machine learning models to understand the pricing structure of the STR market and provide first actionable insights before we continue expanding into text analysis in part three.

2) HYPOTHESIS

The objective variable (y) of this project is the ‘daily rate’ which will be studied based on cross-sectional datasets collected in November 2022. Following several whiteboard sessions and context research we narrowed our analysis down to the following main factors and hypothesis we will evaluate in this series:

📂LOCATION: Location, Location, Location. It’s the number one rule of real estate investing — we expect ‘location’ to be a robust pricing feature but we are also interested in how cities or possible clusters impact other features in our dataset or our pricing model directly.

📂PHYSICAL FEATURES:

— Space: We hypothesize that capacity of Airbnbs have a significant impact on pricing, the more guests, the higher the rate can be. The question is how these pricing patterns behave when the property size increases. Moreover, does the optimal pricing level per guest vary across different factors? Recent trends show that people have started appreciating larger space during Covid19 which brings us to the questions of whether we can find good ways to assess the available space (since Airbnb does not provide square footage).

— Type: Room types such as ‘private room’ and ‘entire home’ are most likely priced differently, primarily as a consequence of space. The degree of privacy a place has to offer is assumed to be a key factor in the rates as well.

— Amenities: Remote work and the growing investor base in the Airbnb landscape has had its impact on the amenities that are expected at a property. We will take a broader look at the impact on the variables we are able to extract, but will specifically focus on the popular ones such internet, streaming, and pet allowance.

📂QUALITY: Airbnb is in direct competition with hotels. Hence, guests are most likely unwilling to accept low-quality Airbnbs if they can stay at a hotel for just a few dollars more. Airbnbs aren’t categorized in star-levels like hotels but the website provides us with a few features to assess the quality of a place, such as ratings, the review comments, and superhost status. However, we must anticipate sample size issues and survivorship bias. Airbnb is pretty persistent when it comes to making sure you review your last stay but still not everyone does.

📂FEES: The latest discussion in Airbnb communities and social media have primarily centered around cleaning fees. Guests are complaining about excessive fees or additional cleaning chores at checkout. We hypothesize that the cleaning fees (or chores) are starting to impact the ability of the host to charge the same rates.

📂DEMAND: Next to the rental price, occupancy rate is the second main component of the revenue equation for the host (price * bookings = revenue). The room rate might reflect the ‘fair value’ considering the sum of all features but if there are little or no bookings the host will realize no revenue. A economic recession, excessive supply, or insufficient visibility may force a host to lower price levels. We assume that the relationship between price levels and demand is positive but likely inconsistent due to it’s temporal characteristics and the challenge of retrieving reliable and complete data.

3) DATA SOURCING

The first roadblock we hit was the issue of data collection. Airbnb decided in 2017 to limit the access of its API to prospective business partners. Thanks for making our life difficult Airbnb! Secondary external sources like Kaggle or opendatasoft provided valuable datasets (especially time-series) but didn’t match our needs in terms of property specific data. Extracting the data directly from the website remained the best option. Initially, we harvested a small sample size manually to develop a better understanding of the technical, functional and visual components of the website and to perform a preliminary data analysis. Since that turned out promising we developed a web scraper in 🖥️Python (Selenium library) and deploy it to collect data at scale.

4) DATA OVERVIEW

We identified the following elements as quantifiable features which we extracted and evaluate in the following sections.

5) DATA VALIDATION AND CLEANING

To mitigate any model defects and “garbage in = garbage out” scenarios, we implemented data validation rules within our web scraping process and preprocessing code in 🖥️R. Here are a few validation rules the team decided to implement:

data types (e.g. integer, string, floats)
range (e.g. number of bedrooms or fees, negative values)
consistent expressions (e.g. city name, amenities)

Additionally, multiple data cleaning scripts were implemented to dedup (remove redundancy), split, and transform the data as well as tools to identify and remove statistical outliers early on. For instance, several units had ‘studio‘ listed as number of bedrooms instead of 0, showed negative cleaning or service fees, and listings with ‘shared’ features lacked information or turn out misleading. We also decided to drop multiple amenities due to small sample size, grouped the 292 amenities into 71 categories since several are very similar (e.g. ‘Pantene shampoo’ vs. ‘Puracy shampoo’) and clustered the 212 cities into 12 areas based on distance. At this stage, it is imperative that you review the dataset with a team member or even the whole team to discuss the possible issues that occurred in the data collection or cleaning process and evaluate implications of assumptions made to prevent misleading results.

6) Exploratory Data Analysis (EDA)

Exploratory data analysis is an approach to understand, analyze, and summarize data. It will help us to detect outliers and anomalies, uncover underlying structures and patterns, extract insights and important features, and most importantly to test our hypothesis and assumptions. There is a range of tools from simple univariate methods to study each variable individually, to more complex and more demanding multivariate methods that take existing relationships between variables into consideration.

6.1) DESCRIPTIVE AND UNIVARIATE ANALYSIS (UVA)

We will begin our quantitative analysis with the generation of a comprehensive set of charts to study the univariate characteristics of our variables. The most commonly used univariate methods are summaries, compositions/distributions plots, quantile mapping, and tools designed to evaluate normality characteristics. Our main analytics tool for this section is 🖥️Tableau, but 🖥️R and 🖥️Excel jumped in occasionally to support our data processing.

We collected 3619 Airbnb listings across California, USA. We focused on popular and fairly established places along the coast primarily around Los Angeles and San Francisco.

🎯DailyRates (our y-variable): The daily rates are the primary income component for our Airbnb hosts. The prices are collected in US-Dollars and before any discounts, fees, or taxes.

As shown in the figure above, the variable is widely spread out but displays very right skewed patterns. The median is close to $140/night for a place but prices can go up to four-digits.

1️⃣Location: California with a population of 39.5M represents 14.6% of the total US economy, which makes one for the largest economies in the world. The Golden state is one of the most visited US states (260M visitors annually) and known for its greatest variety of regional landscapes, climate zones, and flora and fauna. It’s economic status and diversity makes a very interesting but also heterogeneous short-term rental market for us to analyze.

As you can see in the map above, our datasets spans from San Diego to Sacramento, approx. 500 miles, which covers the majority of the populated areas (light green areas).

2️⃣Physical features: Here, we focus on the physical features and facilities of the property itself such as the capacity listed, number of beds/bedrooms/bathrooms, amenities as well as the roomtype.

As expected, the space-related variables tend to be very right skewed since most Airbnb hosts offer a part of their house or smaller investment properties for rent (mostly apartments). Surprisingly, 74% of our dataset is entitled “entire home” which is a bit higher than expected but still within reasonable range compared to other research in this domain (see AirDNA). The increased number could be the consequence of the paradigm shift we all experienced during Covid19 or the result of a selection bias in our web scraper. The ‘shared room’ represented 0.7% of our datasets and was removed from this study due to sample size issues.

3️⃣Quality: The rating system on Airbnb allows guests to rate their host after check-out on a scale from 1 (worst) to 5 (best) in terms of overall experience, cleanliness, accuracy, value, communication, check-in, and location. Once a host reaches three reviews, the primary score is shown near the title in search results and on the listing itself (see Airbnb) — which allows us to collect the data. Superhosts must have a 4.8 or higher average overall rating in the past year in order to maintain their status (see Airbnb). Furthermore, sources claim that if a host’s account average drops below a 4.7 rating, Airbnb begins penalizing hosts (see article) so not much room for low performers here!

It is no surprise that the median of each rating category is close to 4.9 considering Airbnb’s policies and framework. Except the ‘value’ and ‘cleanliness’ rating most categories comprise 87% of their scores between 4.8 and 5.0.

4️⃣Fees: Most Airbnb listings charge their guests a cleaning fee of around $60–100. The data is fairly normally distributed around this range. More details can be found here 🔗cleaningFees.

5️⃣Demand: Due to its temporal characteristics, demand is a tricky feature. Our approach was to extract the available booking days for the next 30 days and the number of reviews over the past year as a proxy for demand at this point in time.

The vacancy data almost shows a uniform distribution which contradicts our expectation since we study the STR market in popular areas. More details can be found here 🔗NumReviews.

6.2) MULTIVARIATE ANALYSIS (MVA)

Multivariate analysis is a statistical procedure that seeks to identify dependencies and patterns among variables, reduce and simplify the structure of complex data, and sort or group trends. While UVA provides us individual pieces of information about the variables in our Airbnb dataset, performing a multivariate analysis helps us determine how each variable relates to the daily prices and identify relationships among our input variables (multicollinearity). In order to identify the relationships among our variables we will start by creating correlation matrices to quantify the direction of the linear relationship among our numerical variables. Non-linear patterns and categorical variables will be examined through various visual analytic approaches.

Correlation Matrix (significance level: 5%)

As expected, space and rating related variable displayed very high and statistically significant correlation coefficients which supported our initial decision to merge them into the clusters. The high correlation between daily rates and cleaning fees (0.7) also doesn’t come at a surprise since larges spaces usually lead to higher prices which increases the cleaning cost for the host. The correlation vector among the ratings indicates, however, that the daily rate is not affected by ‘Quality’ which was unexpected. Are there non-linear single or multiple trends? Are the variables strongly linearly correlated or they spread loosely in certain e.g. location-related clusters? Can we discern patterns when we integrate non-numerical variables? Are there statistical outliers in the multivariate space that influence our results and need to be removed? We will discuss multiple visual approaches in the next section.

1️⃣Location: As can be seen from the boxplot below, the daily price of Airbnb listings varies significantly among the 12 clusters we created. Clusters that are located inland, such as Fresno or Sacramento, show a 60% lower standard deviation than their coastal peers. The predictive power of our model might be more straightforward in these areas. More details and additional analysis can be found here 🔗mapCityPrices.

2️⃣Physical features:

— Space: The scatter plots (🔗priceSpace.linear) confirm what we already know from the correlation matrix. The attempt to find non-linear patterns using several trend tools failed, not a single one showed a significant improvement of the linear model (🔗priceSpace.nonlinear). We hypothesized that different pricing patterns might be observable when the property size increases. When calculating the price per guest, a non-linear pattern can be detected in the figure below.

The optimal capacity size for Airbnbs, or sweet spot, seems to be around two guests before it drops by 21% to a $45 per guest level from which it keeps declining as the capacity increases. A similar pattern can be found when looking at the ratio between beds and bedrooms (🔗priceBedratio). We argued that properties that feature more beds than bedrooms are most likely less spacious and therefore can’t justify the same price level as their peers with more bedrooms.

— Type: Breaking these concepts further down by including the room types provided more insights. The optimal pricing level per guest seems to vary across room type and location (🔗priceSpaceLocation).

On a ‘per guest’ level, ‘Entire homes’ charge the highest prices at a capacity between 2–4 while ‘Hotel rooms’ and ‘Private rooms’ are reaching their highs at a smaller capacity between 1–2 guests. Not completely a miracle considering the business model behind these room types. The Airbnbs with room offers like these are tailored to solo or business travelers. More details and additional analysis can be found here (🔗priceSpaceRoom).

— Amenities: The amenities dataset was a complete mess. About 75% of the data was incomplete, not available, or corrupted. The remaining dataset is large enough for analysis but should still be taken with a grain of salt due to the heterogeneous characteristic of the dataset.

Most properties that listed amenities were able to charge a higher median price than properties without. Our interest, however, was primarily in the availability of internet, streaming services, and if pets were allowed. Even though these factors can be deemed important (🔗priceAmenitiesSelect), the top five were the availability of a dishwasher, balcony, coffee maker, fireplace, and BBQ. We assume the presence of multicollinearity to other factors like Airbnb categories (e.g. mansion vs. townhouse) or location but lack the data to investigate further.

3️⃣Quality: The dots in the graph below reflect the ‘Overall Rating’ submitted by past guests connected with the price level of the respective Airbnb. Unsurprisingly, it shows a consistent pattern across all rating categories (🔗priceRatingAll). Higher rated Airbnb tend to charge higher prices. The question is certainly the causality behind this statement. Can a host charge higher prices if ratings reach higher levels? We noticed across the board that ‘Superhosts’, the ones that are probably most committed to their Airbnb investment, have higher ratings (as dictated by Airbnb) but stay at lower price levels. An interesting insight if you assume that their high-qualilty experience should allow them to increase profit margins.

4️⃣Fees: Visualizing the relationship between cleaning fees and daily rates confirms the moderately strong positive relationship but also reveals a few interesting nuggets.

The scatterplot indicates that we may face heteroscedasticity issues when we begin building linear regression models in our next part. It seems that Airbnb’s beyond a price point of $130 are less consistent when it comes to pricing their cleaning fee. Capacity, as a proxy for space, doesn’t seem to be the only driver behind cleaning (🔗cleaningCapacity). Several hosts tend to overcharge or set their cleaning fee below market average (even zero). Considering recent guest complaints about hosts charging too much — we saw dozens hosts charging $600 cleaning fee for places with $480 daily rates — or requesting extra cleaning chores from their guests at check-out.

5️⃣Demand: Unfortunately, the data we collected on the demand side turned out to be less useful than anticipated. A shame considering that adding ‘vacancyRates’ to our data collections algorithms increased the runtime by a factor of 4x.

From the above bar chart, we can infer that Airbnbs in a lower price range (<$200) constitute a majority (75%) of the dataset and that they tend to be more booked out than their higher prices peers (🔗priceVacancy). But other than that neither ‘vacancyRates’ (🔗priceVacancyLinear) nor ‘numReviews’ (🔗priceNumReviewsLinear), a popular proxy for demand in peer articles, seem to have an impact on pricing.

7) Insights

Altogether, our exploratory data analysis and visualization section revealed several interesting insights into the California’s short-rental market using a Airbnb dataset. At this point, we can already make a few inference about the factors and hypothesis we formulated at the beginning of the project:

📌Location matters. Prices for Airbnbs closer to the coast are on average 10% higher but you also find very popular tourism places like Napa or Palm Springs that can charge twice the market rate. Location also seems to impact other features like capacity. Properties in LA or Santa Barbara tent to have their price/guest maximum at a higher capacity level than Fresno or Sacramento.

📌Space has a positive impact on pricing. The more the space the higher the total rental price of the property. However, strategies that focus on smaller capacities between 2–4 guests can reach higher efficiency levels in terms of rental rate maximization. Hotel rooms and private rooms tend to maximize around 1–2 guests/listing. Median price levels for spacious properties are generally higher. For example Airbnbs with a 1:1 bed-bedroom ratio are priced 20–25% higher than Airbnbs with 3:1 ratios. We found that amenities dishwasher, balcony, coffee maker, fireplace, BBQ, and pet allowance have a notable positive impact on pricing.

📌The quality of the experience, measured by multiple rating categories and superhosts status, seem to be not a major pricing factor. Rating’s are heavily left skewed and show only little dispersion. Airbnb’s strict rating framework makes it impossible to compare it to the ‘stars’ system we know from the hotel industry. Interestingly, superhosts tend to operate more in the lower end of the price spectrum we see in California’s short-term rental market ($50–150/night).

📌The cleaning fee, the controversial second income stream for hosts, shows consistent patterns in the lower price range but than disperse after the $130 rental rate. After that, several hosts seem to leave the relative pricing approach and start pricing higher (extra income) or lower (cleaning chores for guests?) than the market overall.

📌Demand: The big letdown of this study. The proxy we choose, number for reviews and future 30-day bookings, didn’t seem to impact the pricing levels in a linear or non-linear fashion. We believe the economic reason behind this variable and our hypothesis is sound and encourage future research to consider alternative measures.

8) Limitations

⚠️Historical data could not be included in this study due to limited data access. Hence, this analytics study can’t include seasonality or compare the current economic regime to past regimes like the Pre-Covid19.

⚠️In its Summer release 2022, Airbnb introduced 55 new categories that organize homes based on their style, location, or proximity to a travel activity (e.g. Lakefront, Treehouse, or Tinyhome). Unfortunately, the multiple web scraper libraries we tested were not able to pick up on this new feature, which we believe may have a significant impact on pricing and should be included in future research.

⚠️We assume that our random sample of Airbnb properties is representative of the whole population and that the short-term rental market in California can be described as at least semi-efficient.

9) Final notes and next steps

The first important phase is completed. We formulated clear objectives and hypothesis, delegated responsibilities among the team members, collected, cleaned, and structured our datasets. The exploratory data analysis allowed us to evaluate most of our hypothesis and revealed additional insights to us. For example, we could consider partition the dataset into different subsets fitting different linear trends we discovered in the analysis of location and space variables.

However, locating and quantifying patterns in a multivariate dataset via visualization techniques is very challenging and fails to reveal and illustrate linear relationships intuitively, especially when more than three variables are involved. The application and selection of predictive analytics models is an essential next step to improve our understanding of multivariate characteristics, attributes and dependencies.

In the next part, we will build machine learning models to find the most important features that affect the pricing structure of the Airbnbs in the dataset and provide the tools to estimate daily rates of Airbnbs.