by Luke Shulman

When is Data Big?

This week the health world turns to Orlando for the annual conference of the Health Information Management Systems Society annual conference. it is absolutely one of the biggest events in health IT and one of the top 25 trade shows in the entire country.

How many people attend HIMSS each year? We could search one of hundreds of press releases about the event from exhibitors. We could even visit the link above from Trade Show news.

Big Data Approach

How about we measure how many people fly to Orlando during the quarter when HIMSS takes place? The Bureau of Transportation Statistics releases quarterly data files drawn from a 10% of all airline itineraries sold each quarter from all of the major US airline carriers, with more than 5% market share.

For this analysis, we downloaded data for Q1 tickets from 2014 through 2018. HIMSS was last held in Orlando in Jan 2017 and Jan 2015 respectively,though its possible the conference center has likely booked other events that like HIMSS rotate through cities every few years.

US airports see about 2.6 Million per day so even 10% of tickets for one quarter is a massive file well beyond the size of datasets we see in most contexts.

here are some basic stats on just one of the files:

import subprocess

lines_in_file = subprocess.run(["wc", "-l", "data/flight_origin_destination/Origin_and_Destination_Survey_DB1BMarket_2017_1.csv"], stdout=subprocess.PIPE)
print("Lines in the file {0}".format(lines_in_file.stdout

Lines in the file b'5866624 data/flight_origin_destination/Origin_and_Destination_Survey_DB1BMarket_2017_1.csv\n'

That’s over 5 million tickets to analyze.

The first step in processing data for our visualizations is typically to load the data into a pandas dataframe to manipulate and prepare the data before visualizing it. The flexibility and extensibility of Pandas means it is extremely memory intensive. One of the original authors of the library recommends having 5 to 10 times as much RAM as the size of your dataset. For one quarter of the airline data (1.4gb), this would require up to 10gb more than my laptop and more than I want to spend on an AWS server for a blog post.

There are other approaches we could take. Certainly we could leverage a database to serialize the data on disk so that we interact with it through queries but it would be easier to maintain our Python, Pandas, Altair workflow.

Enter Dask, a project developed by Anaconda that uses the same API as pandas but underpins it with a architecture that parallelizes the operations leaving the data on disk until called for. Dask’s power is immense allowing a dataframe to span an entire cluster. But locally, it’s also an easy way to use the same python data science workflows while keeping the data waiting until it is needed or in our case aggregated for our analysis.

Let’s take a look at the data now.

import dask.dataframe as dd
from vega_datasets import data
import altair as alt
import pandas as pd
import geopandas as gpd
from shapely.geometry import Point, LineString
import json
import altair_theme
%matplotlib inline

data_Q1_2014 = dd.read_csv("data/flight_origin_destination/Origin_and_Destination_Survey_DB1BMarket_2014_1.csv")
data_Q1_2015 = dd.read_csv("data/flight_origin_destination/Origin_and_Destination_Survey_DB1BMarket_2015_1.csv")
data_Q1_2016 = dd.read_csv("data/flight_origin_destination/Origin_and_Destination_Survey_DB1BMarket_2016_1.csv")
data_Q1_2017 = dd.read_csv("data/flight_origin_destination/Origin_and_Destination_Survey_DB1BMarket_2017_1.csv")
data_Q1_2018 = dd.read_csv("data/flight_origin_destination/Origin_and_Destination_Survey_DB1BMarket_2018_1.csv")
data_Q1_2014.head()
ItinID MktID MktCoupons Year Quarter OriginAirportID OriginAirportSeqID OriginCityMarketID Origin OriginCountry ... BulkFare Passengers MktFare MktDistance MktDistanceGroup MktMilesFlown NonStopMiles ItinGeoType MktGeoType Unnamed: 41
0 201412918419 20141291841902 1 2014 1 11618 1161802 31703 EWR US ... 0.0 1.0 266.5 719.0 2 719.0 719.0 2 2 NaN
1 201412918420 20141291842001 1 2014 1 13930 1393003 30977 ORD US ... 0.0 1.0 266.5 719.0 2 719.0 719.0 2 2 NaN
2 201412918420 20141291842002 1 2014 1 11618 1161802 31703 EWR US ... 0.0 1.0 266.5 719.0 2 719.0 719.0 2 2 NaN
3 201412918421 20141291842101 1 2014 1 13930 1393003 30977 ORD US ... 0.0 1.0 268.0 719.0 2 719.0 719.0 2 2 NaN
4 201412918421 20141291842102 1 2014 1 11618 1161802 31703 EWR US ... 0.0 1.0 268.0 719.0 2 719.0 719.0 2 2 NaN

5 rows × 42 columns

# Set some indicators so we know from when the data came. There is no indicator is the raw files. 
data_Q1_2014['Period'] = 201401
data_Q1_2014['HIMSS'] = 'No HIMSS'
data_Q1_2015['Period'] = 201501
data_Q1_2015['HIMSS'] = 'HIMSS'
data_Q1_2016['Period'] = 201601
data_Q1_2016['HIMSS'] = 'No HIMSS'
data_Q1_2017['Period'] = 201701
data_Q1_2017['HIMSS'] = 'HIMSS'
data_Q1_2018['Period'] = 201801
data_Q1_2018['HIMSS'] = 'No HIMSS'

# Dask uses append because it fits with the model of lazily parallel calculation 
# where as concat would rely on the whole dataset being available 
all_years = data_Q1_2014.append(data_Q1_2015).append(data_Q1_2016).append(data_Q1_2017).append(data_Q1_2018)

mco_totals_by_year_task = all_years[all_years.Dest == 'MCO'].groupby(['Period','HIMSS']).agg({'Passengers':'sum', 'MktFare':'mean'})
mco_totals_by_airport_task = all_years[all_years.Dest == 'MCO'].groupby('Origin').agg({'Passengers':'sum', 'MktFare':'mean'})

Each row is an itinerary for one passenger between an origin and destination with the fare and distance. We are focused only on Orlando.

The above commands essentially create two tasks and won’t process until called for. In the next command, Dask will compute the aggregates and return a dataframe of the result.

mco_totals_by_year = mco_totals_by_year_task.compute()
mco_totals_by_airport = mco_totals_by_airport_task.compute()

mco_totals_by_airport.sort_values('Passengers').tail()
Passengers MktFare
Origin
ORD 66146.0 217.768325
MSP 66435.0 201.077036
DTW 67107.0 195.274883
PHL 81802.0 184.273720
EWR 84991.0 223.100379

Minneapolis, Detroit, Chicago all major airline hubs. For creating the visualization we will need a few other data sources. Geographic figures for US States, and latitude and longitude locations for major airports. These data sources are actually included in the altair/vega example datasets and thus are easily obtained using the input at the top.

us_states =  alt.topo_feature(data.us_10m.url,  'states')

The following chart will provide a base United States map for the visualization.

usa_chart = alt.Chart(us_states).mark_geoshape(fill='#dedede', stroke='white')
usa_chart

We can see that looks a little funny. Because the standard projection is a world-mercator, we are seeing the map based on the relative positions the 48 stats pluse Alaska and Hawaii. This is something we will correct later. We are used to seeing the US presented in a more central format.

After we develop each part of our chart (base map, airports, routes),we will correct the geographic projection all at once.

airports = pd.read_csv(data.airports.url)

merged_airport_data = airports.join(mco_totals_by_airport, on='iata',   how='inner')
# data_json = merged_airport_data[merged_airport_data.Passengers > 5000].to_json()
# data_to_chart = alt.Data(values=json.loads(data_json)['features'])

We are also going to filter the data to just focus on large airports (with over 10,000 passengers of traffic) in the 50 states.

key_airports = merged_airport_data[(merged_airport_data.Passengers > 10000) & 
                                   (merged_airport_data.state != 'PR')]
key_airports.head()
iata name city state country latitude longitude Passengers MktFare dest_latitude dest_longitude route_name
771 ACY Atlantic City International Atlantic City NJ USA 39.457583 -74.577167 16133.0 88.635223 28.428889 -81.316028 ACY_MCO
825 ALB Albany Cty Albany NY USA 42.748119 -73.802979 20567.0 204.663398 28.428889 -81.316028 ALB_MCO
880 ATL William B Hartsfield-Atlanta Intl Atlanta GA USA 33.640444 -84.426944 55434.0 176.386318 28.428889 -81.316028 ATL_MCO
890 AUS Austin-Bergstrom International Austin TX USA 30.194533 -97.669872 17309.0 193.862941 28.428889 -81.316028 AUS_MCO
933 BDL Bradley International Windsor Locks CT USA 41.938874 -72.683228 34742.0 188.418242 28.428889 -81.316028 BDL_MCO
airport_chart = alt.Chart(key_airports[['name', 'latitude','Passengers','longitude']]).mark_point(
    fill='steelblue',stroke='white', filled=True).encode(
tooltip='name:N', size='Passengers:Q', latitude='latitude:Q', longitude='longitude:Q')
airport_chart

Great, this shows us the airports sized by their traffic to Orlando. Like with our us map it looks a little funny but we will fix it at the end. We can add one more element which is lines representing the routes between the cities. The code below creates columns for the destination (which is Orlando at a constant lat/long).

key_airports['dest_latitude'] = 28.428889
key_airports['dest_longitude'] = -81.316028

We are also going to give each route a name so that we can instruct the visualization to draw one line per route.

key_airports['route_name'] = merged_airport_data.iata + '_' + 'MCO'

Still, the data is formatted incorrectly for a line chart. Line charts work two ways, one connects multiple points into continuous line segments. The other way draws multiple points but draws multiple lines in a series which is what we want. So we want to arrange the data so that the origin and destination points are in the same column vertically with ‘route_name’ identifying which line is whice. The code below creates that structure by concatanting the origin points with the destination points in the same column

line_airport_points = pd.concat((key_airports[['route_name','latitude','longitude','Passengers']]
          , key_airports[['route_name','dest_latitude','dest_longitude','Passengers']].rename(
              columns={'dest_latitude':'latitude', 'dest_longitude':'longitude'})
          )
         )
line_airport_points.head()
route_name latitude longitude Passengers
771 ACY_MCO 39.457583 -74.577167 16133.0
825 ALB_MCO 42.748119 -73.802979 20567.0
880 ATL_MCO 33.640444 -84.426944 55434.0
890 AUS_MCO 30.194533 -97.669872 17309.0
933 BDL_MCO 41.938874 -72.683228 34742.0
routes = alt.Chart(line_airport_points).mark_trail(strokeOpacity=.5).encode(
    latitude='latitude', 
    longitude='longitude', 
    color=alt.Color('Passengers', legend=None, scale=alt.Scale(scheme='purples')), 
    size=alt.Size('Passengers:Q', bin=True, scale=alt.Scale(range=(1,2)), legend=None),
    tooltip='route_name')
routes

The use of mark_trail allowed us to size each line. Now all that’s left is to put everything together and add a change to the projection which allows us see everything clearly in a US Centric map.

usa_map_combined = alt.layer(usa_chart, airport_chart,  routes, 
                            width=600, height=600).project(type='albersUsa')
usa_map_combined

Comparing Multiple Years

There we have the legions of HIT vendors streaming through American airport to descend on Orlando. But how does it compare to non-HIMSS years. For that, we will use the 2014 2015 and 2017 data.

We will set up a Dask pipeline that will load the data for each year and give the totals arriving in Orlando for each. This where dask really shines as its essentially chunking through over 5gb of data rows across 5 files.

from importlib import reload
reload(altair_theme)

base = alt.Chart(mco_totals_by_year.reset_index()).encode(
x='Period:O')

passengers = base.mark_bar(size=20).encode(
y=alt.Y('Passengers:Q', axis=alt.Axis(format=",.0f", title='Number of Passengers'), scale=alt.Scale(domain=(0,600000))),
color=alt.Color('HIMSS:N', scale=alt.Scale(range=['#7e78d2', '#fb62f6'], domain=['No HIMSS','HIMSS']))
)

fares = base.mark_line().encode(
y=alt.Y('MktFare:Q', axis=alt.Axis(format='$.0f', title='Average Fare'),scale=alt.Scale(domain=(150,300)))
)

stats_chart = alt.layer(
    passengers,
    fares,
    width=600,
    height=100
).resolve_scale(
    y='independent'
)


alt.vconcat(usa_map_combined, stats_chart, title="Passenger Traffic Headed to Orlando 2014-2018").configure_title(
fontSize=20, anchor='middle')

In summary, HIMSS’ 50,000+ attendees don’t even appear as a blip for the greater Orlando area. At least each year the avg fare does seem to be getting a little lower.