Subway Strategies
April 2018
For the first project of my Metis data science bootcamp, I worked with colleagues Jonathan Sterling and Prakash Verma to parse New York City subway turnstile data and build an outreach strategy for a fictional nonprofit client. This project used pandas, matplotlib, and Tableau to turn around a solid product on a very tight (i.e., four-day) project timeline.
Motivation
This project was unique in that it was defined by the Metis instructors and was our only group project, so I was pleased it was a topic I genuinely cared about: working toward greater diversity in tech. Our assignment was to develop a strategy for the fictional new nonprofit WomenTechWomenYes to deploy their “street teams” (you know, those people with clipboards?) to gather email addresses and invite people to a free gala in the summer. Our goal was to maximize street team efficacy by using publicly available data to optimize placement at NYC subway entries and exits.
The Dataset
We downloaded flat csv files from the Metropolitan Transit Authority. They were…messy. Each file covered one week of subway entries and exits, as marked by periodic readings on the counter of each individual turnstile at a station. We combined several columns of metadata to generate a unique identifier for each turnstile and then looked at the delta to calculate the flow of people through a turnstile. Some data had to be thrown out (e.g., if the counter reset to zero or started going in reverse). The remaining counts were summed for daily totals at the station level.
Tools and Analysis
This project came after just four days of bootcamp instruction and relied on descriptive statistics rather than more advanced machine learning models. The experience deepened our skills in two essential Python libraries: pandas to build and manipulate tables of information, and matplotlib to build graphs and charts.
Our analysis showed that the top 30 busiest subway stations in New York City accounted for one-third of all rides in the system, so we focused exclusively on those stations in our recommendations.
Recommendations
We recommend that WomenTechWomenYes street teams be placed at a dozen stations grouped by three strategies:
VOLUME:
Target high-traffic locations
Centrally-located, highly-trafficked transit hubs. Street teams will meet both New Yorkers and visitors, contributing to goal of increasing awareness and possibly yielding connections with local residents.
34th St - Penn Station
Grand Central - 42nd St
34th St - Herald Square
Times Square - 42nd St
INTEREST:
Target tech hubs
Find tech workers with a personal connection to the organization’s mission. Canvas tech hubs in Manhattan and Brooklyn, especially at lunchtime or during commuting hours.
23rd St
14th St - Union Square
14th St
Jay St - MetroTech
ENGAGEMENT:
Target residential areas
Go to busy stations in residential areas with higher median age and income, where getting an email address might turn into a donation, gala attendance, or deeper engagement with the organization.
86th St
59th St - Columbus Circle
59th St
72nd St
Visualization
An interactive Tableau dashboard is available with the data for all 30 of the busiest stations, including the 12 we ultimately recommended. WomenTechWomenYes may find that other stations on the map are also suited to their needs, now or in the future.
The color overlay in the map incorporates census data for median age in a neighborhood. Age showed more difference across census tracts than median income did. This was because incomes in New York are high compared to most of the country (compensating for a higher cost of living) and the pre-set census income ranges were too low to provide useful context.
Next Steps
We would like to further the work already done by deepening our analysis in a few key ways.
Hourly Analysis: Provide recommendations based on station traffic by time of day.
Neighborhood Data: Consider tourism, demographics, and charitable giving.
Evaluation: Track WTWY results by station, including weather data as a control.