Skip to main content

Airbnb NYC Dataset - Demand Proxy Rationale

· 4 min read
Ross Bulat
Full Stack Engineer

Google Colab notebook: Open in Google Colab

The Airbnb NYC 2019 dataset is useful for exploring listing-level patterns, but it has one important limitation: it does not contain direct demand measures. There are no actual booking counts, occupancy rates, revenue figures, or guest ratings that can be used as a clean target variable.

Because of this, the notebook tests whether reviews_per_month can act as a practical proxy for guest demand. The aim is not to claim that reviews are the same as bookings. Instead, the aim is to check whether review activity behaves like a meaningful demand-related signal that can support later modelling work.

Why a proxy was needed

A machine learning model needs a target variable. In this case, the business question is whether Airbnb listing features can help identify listings that appear to attract stronger guest activity. However, the dataset only provides indirect indicators of activity, such as:

  • number_of_reviews
  • last_review
  • reviews_per_month
  • availability_365

Of these, reviews_per_month is the most useful starting point because it provides a normalised measure of ongoing review activity. Listings with no monthly review activity were assigned a value of zero, which keeps the proxy measurable across the full dataset.

Creating the review-activity proxy

The notebook creates a new binary flag called review_activity_proxy.

This is done by:

  1. Filling missing reviews_per_month values with zero.
  2. Calculating the 75th percentile of the filled review-rate distribution.
  3. Flagging listings at or above that threshold as proxy-positive.

In simple terms, the proxy identifies listings with relatively high review activity compared with the rest of the dataset. This creates a reproducible success flag that can later be used as a classification target.

This is still only a proxy. A listing may receive more reviews because it has more bookings, but reviews are also affected by guest behaviour, host practices, listing age, and other factors. For that reason, the notebook treats the proxy as a preliminary demand-related signal rather than a direct measure of true demand.

Checking whether the proxy makes business sense

The next step was to test whether the proxy behaved in a plausible way. A useful proxy should not look random or disconnected from the rest of the dataset.

The notebook checks this in two main ways.

First, it plots listings spatially using longitude and latitude. This shows whether proxy-positive listings appear within recognisable Airbnb market areas across New York City rather than being distributed without structure.

Second, it explores listing-title length. The notebook cleans the name field, calculates title length, groups titles into bands, and compares proxy-positive rates across those bands. This helps test whether the proxy relates to listing presentation as well as location.

Together, these checks support the idea that the review-activity proxy captures more than a purely mechanical review count. It appears to relate to interpretable Airbnb listing features, including spatial location and title characteristics.

Why this matters for later modelling

This validation step is important because the later modelling task depends on the credibility of the target variable. If the target were arbitrary, then even a technically strong model would have limited business value.

By showing that the proxy is:

  • derived from an existing numeric field,
  • consistently reproducible,
  • linked to spatial listing patterns,
  • related to title-length patterns, and
  • interpretable within the Airbnb business context,

the notebook provides a stronger rationale for using reviews_per_month as a preliminary demand-related success flag.

Key takeaway

The notebook does not prove true Airbnb demand, because the dataset does not contain confirmed bookings or occupancy data. However, it does show that reviews_per_month can be used carefully as a preliminary demand proxy.

The proxy is measurable, reproducible, and connected to meaningful listing characteristics. For an exploratory machine learning project, that makes it a reasonable foundation for modelling stronger guest activity in the Airbnb NYC dataset, provided the limitation is clearly stated.