Subway Strategies

April 2018

For the first project of my Metis data science bootcamp, I worked with colleagues Jonathan Sterling and Prakash Verma to parse New York City subway turnstile data and build an outreach strategy for a fictional nonprofit client. This project used pandas, matplotlib, and Tableau to turn around a solid product on a very tight (i.e., four-day) project timeline.

 
 
nyc_street.jpg

Motivation

This project was unique in that it was defined by the Metis instructors and was our only group project, so I was pleased it was a topic I genuinely cared about: working toward greater diversity in tech. Our assignment was to develop a strategy for the fictional new nonprofit WomenTechWomenYes to deploy their “street teams” (you know, those people with clipboards?) to gather email addresses and invite people to a free gala in the summer. Our goal was to maximize street team efficacy by using publicly available data to optimize placement at NYC subway entries and exits.


The Dataset

We downloaded flat csv files from the Metropolitan Transit Authority. They were…messy. Each file covered one week of subway entries and exits, as marked by periodic readings on the counter of each individual turnstile at a station. We combined several columns of metadata to generate a unique identifier for each turnstile and then looked at the delta to calculate the flow of people through a turnstile. Some data had to be thrown out (e.g., if the counter reset to zero or started going in reverse). The remaining counts were summed for daily totals at the station level.

turnstiles.jpg

pandas2.jpg

Tools and Analysis

This project came after just four days of bootcamp instruction and relied on descriptive statistics rather than more advanced machine learning models. The experience deepened our skills in two essential Python libraries: pandas to build and manipulate tables of information, and matplotlib to build graphs and charts.

Our analysis showed that the top 30 busiest subway stations in New York City accounted for one-third of all rides in the system, so we focused exclusively on those stations in our recommendations.


Recommendations

We recommend that WomenTechWomenYes street teams be placed at a dozen stations grouped by three strategies:

strategy_volume.jpg

VOLUME:
Target high-traffic locations

Centrally-located, highly-trafficked transit hubs. Street teams will meet both New Yorkers and visitors, contributing to goal of increasing awareness and possibly yielding connections with local residents.

  • 34th St - Penn Station

  • Grand Central - 42nd St

  • 34th St - Herald Square

  • Times Square - 42nd St

strategy_interest.jpg

INTEREST:
Target tech hubs

Find tech workers with a personal connection to the organization’s mission. Canvas tech hubs in Manhattan and Brooklyn, especially at lunchtime or during commuting hours.

  • 23rd St

  • 14th St - Union Square

  • 14th St

  • Jay St - MetroTech

strategy_engagement.jpg

ENGAGEMENT:
Target residential areas

Go to busy stations in residential areas with higher median age and income, where getting an email address might turn into a donation, gala attendance, or deeper engagement with the organization.

  • 86th St

  • 59th St - Columbus Circle

  • 59th St

  • 72nd St


tableau_screenshot.png

Visualization

An interactive Tableau dashboard is available with the data for all 30 of the busiest stations, including the 12 we ultimately recommended. WomenTechWomenYes may find that other stations on the map are also suited to their needs, now or in the future.

The color overlay in the map incorporates census data for median age in a neighborhood. Age showed more difference across census tracts than median income did. This was because incomes in New York are high compared to most of the country (compensating for a higher cost of living) and the pre-set census income ranges were too low to provide useful context.


nyc_street.jpg

Next Steps

We would like to further the work already done by deepening our analysis in a few key ways.

  • Hourly Analysis: Provide recommendations based on station traffic by time of day.

  • Neighborhood Data: Consider tourism, demographics, and charitable giving.

  • Evaluation: Track WTWY results by station, including weather data as a control.