Subways for Coffee (Part 1 of 2)

New York City subways. Within its tangled concrete and metal maze, there’s bound to be droves of John and Jane Smith’s thirsting for a brilliant cup of Joe. Where are they though? We have this perfectly aromatic blend of bold, earthly coffee beans and rich, creamy chocolate going undiscovered. If they can’t find us, we’ll need to find them.

Thankfully the MTA (Metropolitan Transportation Authority) has provided turnstile data that might help us do just that.

The Hunt for John and Jane

In order to find our friends we’ll need to know when and where they will be looking for coffee. Hmmmm. Maybe if we merged MTA turnstile data with US Census demographics data, we might be able to identify high probability areas J&J might be located.

Sounds like a plan (-_-)b

  1. Identify peak coffee shop hours in NYC
    1. Google Places’ Popular Times
  2. (Pass 1) Score best MTA stations during those hours based on the following:
    1. Most exit traffic (exit = person going to street level = person able to see our coffee shop)
    2. Most sustained, long-term growth of exits
  3. (Pass 2) Rescore chosen stations based on the following:
    1. Coffee demographic presence as identified by US Census data
    2. Pre-existing competition in area

Let’s take it one step at a time. In this post we’ll work up to and through pass 1.

Times?

Google Places can help us with this. They keep track of Popular Times through each day of the week at various coffee shops. Unfortunately there is no API access to this information at the moment so we’ll have to take a look by hand.

tues

Judging from dozens of coffee shops throughout Manhattan, a pattern of consistently peak hours across all days of the week and locations seems to be 11AM - 3PM. We can use these hours to narrow our search.

Where?

When John and Jane want coffee they exit the subway and go up to the street level right? Let’s find the most popular stations for exiting between 11AM - 3PM then.

  1. Download turnstile data from MTA:

    wget -A txt -m -p -E -k -K -np http://web.mta.info/developers/turnstile.html
    

    I arranged the downloaded data into annual folders afterwards. turnstile/2016, turnstile/2015, etc. We’ll be focusing on 2015 and 2016.

  2. Import data into pandas

    import pandas as pd
    import numpy as np
    import glob
    import seaborn as sns
    
    # Import all turnstile files from into one big data frame
    all_files = glob.glob("turnstile/2016/*.txt")
    df = pd.concat((pd.read_csv(f) for f in all_files))
    
    # Reset the indexes so they are all unique
    df.reset_index(drop=True, inplace=True)
    df.head()
    

    df01

  3. Data cleaning

    # Remove whitespace from column names
    df.columns = [column.strip() for column in df.columns]
    
    # Remove any duplicate rows
    prev_size = len(df)
    df.drop_duplicates(subset=['C/A', 'UNIT', 'SCP', 'STATION', 'DATE', 'TIME'])
    prev_size, len(df)
    
    # Create a new column for hourly exit counts. This is done by subtracting
    # the current row from the previous one
    df['FOUR_HOURLY_EXITS'] = df['EXITS'] - df['EXITS'].shift(1)
    
    # Filter bad values (NaN, high, or negative) off of 'FOUR_HOURLY_EXITS'
    clean_df = df[(df['FOUR_HOURLY_EXITS'] >= 0) & (df['FOUR_HOURLY_EXITS'] < 100000)]
    
  4. Filtering to 11AM - 3PM

    # Since MTA data is provided in 4hr time slots, using 15:00:00 includes all information from 11:00 to 15:00
    df_2016 = clean_df[(clean_df['TIME'] == '15:00:00')].groupby(['C/A', 'UNIT', 'STATION', 'SCP','TIME'])['FOUR_HOURLY_EXITS'].agg(["sum", "mean"]).copy()
    df_2016.head()
    

    df02

  5. Aggregate data into Control Area / Station combos

    # Control Area + Station combos = specific street level exits
    df_agg = df_2016.groupby(['C/A', 'STATION', 'TIME'], as_index=False)['sum', 'mean'].sum().sort_values('sum', ascending = False)
    df_agg.head()
    

    df03

  6. Alright. Now let’s finally see where J&J might be at 11AM - 3PM.

    ca_list = df_agg['C/A'].tolist()
    station_list = df_agg['STATION'].tolist()
    y_tick_labels = [station_list[i] + ' (' + ca_list[i] + ')' for i,v in enumerate(ca_list)]
    
    most_trafficked = sns.barplot(y=df_agg['C/A'].head(20), x=df_agg['sum'].head(20))
    most_trafficked.set(ylabel='Stations (C/A)', xlabel='Total Annual Exits', title='Most Exited Station C/As in 2016 During 11AM - 3PM')
    most_trafficked.set(yticklabels=y_tick_labels)
    for item in most_trafficked.get_xticklabels():
        item.set_rotation(90)
    

    most_exits

  7. Let’s take it a step further now. Let’s find areas that J&J might be frequenting more and more these days during 11AM - 3PM. We’ll first repeat steps 1-6 with 2015 data. Then we can do this afterwards:

    # Aggregate 2015 and 2016 into Control Area + Station combos
    df_2015_street = df_2015.groupby(['C/A', 'STATION', 'TIME'], as_index=False)['sum', 'mean'].sum().sort_values('sum', ascending = False)
    df_2016_street = df_2016.groupby(['C/A', 'STATION', 'TIME'], as_index=False)['sum', 'mean'].sum().sort_values('sum', ascending = False)
    
    # Comparing all of 2015 traffic vs all of 2016 traffic.
    df_trend_2015_2016 = pd.merge(df_2015_street, df_2016_street, on=['C/A', 'STATION', 'TIME'])
    
    # Subtracting total 2016 traffic from total 2015 traffic to determine a growth value
    df_trend_2015_2016['growth_sum'] = df_trend_2015_2016['sum_y'] - df_trend_2015_2016['sum_x']
    df_trend_2015_2016_sorted = df_trend_2015_2016.sort_values("growth_sum", ascending=False)
    df_trend_2015_2016_sorted.head()
    
    # Plot
    ca_list = df_trend_2015_2016_sorted['C/A'].tolist()
    station_list = df_trend_2015_2016_sorted['STATION'].tolist()
    x_tick_labels = [station_list[i] + ' (' + ca_list[i] + ')' for i,v in enumerate(ca_list)]
    
    most_growth = sns.barplot(y=df_trend_2015_2016_sorted['C/A'].head(20), x=df_trend_2015_2016_sorted['growth_sum'].head(20))
    most_growth.set(ylabel='Stations (C/A)', xlabel='Total Exit Growth', title='Most Growth Station C/As Between 2015-2016 During 11AM - 3PM')
    most_growth.set(yticklabels=x_tick_labels)
    for item in most_growth.get_xticklabels():
        item.set_rotation(90)
    

    most_growth

  8. Fantastic. Now let’s combine both of those lists and find locations that perform well in both categories. I’m sure we’ll be able to find J&J at at least one of those places.

    mta_recommendations = []
    for idx in range(100):
        for idx2 in range(100):
            if df_trend_2015_2016_top_100.iloc[idx2]['C/A'] == df_2016_street.iloc[idx]['C/A'] and \
                df_trend_2015_2016_top_100.iloc[idx2]['STATION'] == df_2016_street.iloc[idx]['STATION']:
                mta_recommendations.append([df_trend_2015_2016_top_100.iloc[idx2]['C/A'], df_trend_2015_2016_top_100.iloc[idx2]['STATION']])
    
    # Unique stations worth examining
    mta_rec_stations = []
    for exit in range(len(mta_recommendations)):
        if mta_recommendations[exit][1] not in [mta_rec_stations[x] for x in range(len(mta_rec_stations))]:
            mta_rec_stations.append(mta_recommendations[exit][1])
    
    mta_rec_stations
    
    ['86 ST', '34 ST-HERALD SQ', 'GRAND ST', 'CHAMBERS ST', '125 ST', 'JAMAICA CENTER', '42 ST-BRYANT PK', '82 ST-JACKSON H', '59 ST', '34 ST-PENN STA', 'JUNCTION BLVD' 'SPRING ST', 'BOWLING GREEN', 'JAY ST-METROTEC', '50 ST', 'FULTON ST', 'CROWN HTS-UTICA', 'CANAL ST', '34 ST-HUDSON YD', '14 ST', '23 ST', '72 ST', 'WORLD TRADE CTR', '167 ST']
    

Down to 24 stations! We’re off to a great start in finding J&J. In the next post we’ll look into further narrowing down our list with US Census demographic data.

comments powered by Disqus