New York City Taxi Data Analysis using Python python

Name : Phan Vinh Phu

TABLE OF CONTENTS¶

Introduction¶

1. Data Loading¶

  • 1.1 Install & Import Dependencies
  • 1.2 Load the data
  • 1.3 Compare data of 3 months

2. Data Exploration and Pre-processing¶

  • 2.1 Handling missing values
  • 2.2 Identify two columns that have “noisy” (erroneous) values
  • 2.3 Identify 2 columns that are highly correlated and explain their correlation

3. Featurization¶

  • 3.1 Create a feature which is a flag indicating if the trip is in rush-hour or not
  • 3.2 Create a feature that encodes the “complexity” of the trip
  • 3.3 Calculate the pickup and drop-off frequency in each taxi zone.

4. Data Analysis¶

  • 4.1 Rank the vendors by popularity
  • 4.2 What are the peak travel hours?
  • 4.3 What is the average distance of the trips on weekdays and weekends?
  • 4.4 What is the average number of passengers in a trip on weekdays and weekends?
  • 4.5 What is the correlation between the fare about and the tip?
  • 4.6 What is the correlation between the fare about and the tip?

5. Conclusion¶

Introduction¶

This data analysis assignment is aimed to explore, clean, analyze, and visualize the dataset related to 🚕 Taxi Trip Records in New York City to derive meaningful insights and provide actionable recommendations.

Dataset : Yellow and green taxi trip records include fields capturing pick-up and drop-off dates/times, pickup and drop-off locations, trip distances, itemized fares, rate types, payment types, and driver-reported passenger counts. The data used in the attached datasets were collected and provided to the NYC Taxi and Limousine Commission (TLC) by technology providers authorized under the Taxicab & Livery Passenger Enhancement Programs (TPEP/LPEP).

Data Dictionary¶

Field Description
VendorID A code indicating the TPEP provider that provided the record. 1 for Creative Mobile Technologies, LLC and 2 for VeriFone Inc.
tpep_pickup_datetime The date and time when the meter was engaged.
tpep_dropoff_datetime The date and time when the meter was disengaged.
Passenger_count The number of passengers in the vehicle.
Trip_distance The distance of the trip in miles.
PULocationID The pickup location ID, corresponding to the taxi zone where the taximeter was engaged.
DOLocationID The dropoff location ID, corresponding to the taxi zone where the taximeter was disengaged.
RatecodeID The rate code for the trip. 1=Standard rate, 2=JFK, 3=Newark, 4=Nassau or Westchester, 5=Negotiated fare, 6=Group ride
store_and_fwd_flag Indicates whether the trip record was held in vehicle memory before sending to the vendor. Y=store and forward; N=not a store and forward trip.
payment_type A numeric code indicating the payment method: 1= Credit card, 2= Cash, 3= No charge, 4= Dispute, 5= Unknown, 6= Voided trip.
fare_amount The time-and-distance fare calculated by the meter.
extra Extra charges. Currently, this only includes the 0.5 dollars and 1 dollar rush hour and overnight charges.
MTA_tax The 0.50 dollars MTA tax that is automatically triggered based on the metered rate in use.
improvement_surcharge 0.30 dollars improvement surcharge assessed trips at the flag drop. The improvement surcharge began being levied in 2015.
tip_amount Tip amount – This field is automatically populated for credit card tips. Cash tips are not included.
tolls_amount Total amount of all tolls paid in trip.
total_amount The total amount charged to passengers. This field includes the metered fare, extra charges, MTA tax, tip_amount, and tolls_amount plus any improvement_surcharge or ehail_fee.
congestion_surcharge 2.75 dollars congestion surcharge assessed trips in yellow and green taxis in Manhattan south of 96th St. The surcharge began being levied in 2019.
Airport_fee 1.25 dollars for pick up only at LaGuardia and John F. Kennedy Airports

1. Data Loading ¶

1.1 Install & Import Dependencies ¶

Mount drive¶

In [ ]:
from google.colab import drive
drive.mount('/content/drive')
Mounted at /content/drive
In [ ]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import geopandas as gpd

import warnings
import os
import gc

1.2 Load the dataset to dataframe ¶

In [ ]:
# Define file paths
taxi_zones_PATH = "/content/drive/MyDrive/Assignment3/taxi_zone"
DATA_PATH = "/content/drive/MyDrive/Assignment3/tripdata"

# Read the Parquet dataframes
yellow_jan_df = pd.read_parquet(os.path.join(DATA_PATH, 'yellow_tripdata_2023-01.parquet'))
yellow_mar_df = pd.read_parquet(os.path.join(DATA_PATH, 'yellow_tripdata_2023-03.parquet'))
yellow_jun_df = pd.read_parquet(os.path.join(DATA_PATH, 'yellow_tripdata_2023-06.parquet'))

# Load taxi zone lookup data
zones_lookup_df = pd.read_csv("/content/drive/MyDrive/Assignment3/taxi_zone_lookup.csv")
zone_shp = gpd.read_file(os.path.join(taxi_zones_PATH, 'taxi_zones.shp'))

# Define a list of dataframes
dfs = [yellow_jan_df, yellow_mar_df, yellow_jun_df]

# Define the column names
cols = ['VendorID', 'tpep_pickup_datetime', 'tpep_dropoff_datetime',
        'Passenger_count', 'Trip_distance','RateCodeID', 'store_and_fwd_flag',
        'PULocationID', 'DOLocationID', 'payment_type', 'fare_amount',
        'extra', 'MTA_tax','tip_amount' , 'tolls_amount', 'improvement_surcharge',
        'total_amount', 'congestion_surcharge', 'Airport_fee']

# Rename columns in each dataframe
for df in dfs:
    df.columns = cols
In [ ]:
yellow_jan_df.head()
Out[ ]:
VendorID tpep_pickup_datetime tpep_dropoff_datetime Passenger_count Trip_distance RateCodeID store_and_fwd_flag PULocationID DOLocationID payment_type fare_amount extra MTA_tax tip_amount tolls_amount improvement_surcharge total_amount congestion_surcharge Airport_fee
0 2 2023-01-01 00:32:10 2023-01-01 00:40:36 1.0 0.97 1.0 N 161 141 2 9.3 1.00 0.5 0.00 0.0 1.0 14.30 2.5 0.00
1 2 2023-01-01 00:55:08 2023-01-01 01:01:27 1.0 1.10 1.0 N 43 237 1 7.9 1.00 0.5 4.00 0.0 1.0 16.90 2.5 0.00
2 2 2023-01-01 00:25:04 2023-01-01 00:37:49 1.0 2.51 1.0 N 48 238 1 14.9 1.00 0.5 15.00 0.0 1.0 34.90 2.5 0.00
3 1 2023-01-01 00:03:48 2023-01-01 00:13:25 0.0 1.90 1.0 N 138 7 1 12.1 7.25 0.5 0.00 0.0 1.0 20.85 0.0 1.25
4 2 2023-01-01 00:10:29 2023-01-01 00:21:19 1.0 1.43 1.0 N 107 79 1 11.4 1.00 0.5 3.28 0.0 1.0 19.68 2.5 0.00
In [ ]:
yellow_jan_df.info(show_counts = True)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3066766 entries, 0 to 3066765
Data columns (total 19 columns):
 #   Column                 Non-Null Count    Dtype         
---  ------                 --------------    -----         
 0   VendorID               3066766 non-null  int64         
 1   tpep_pickup_datetime   3066766 non-null  datetime64[ns]
 2   tpep_dropoff_datetime  3066766 non-null  datetime64[ns]
 3   Passenger_count        2995023 non-null  float64       
 4   Trip_distance          3066766 non-null  float64       
 5   RateCodeID             2995023 non-null  float64       
 6   store_and_fwd_flag     2995023 non-null  object        
 7   PULocationID           3066766 non-null  int64         
 8   DOLocationID           3066766 non-null  int64         
 9   payment_type           3066766 non-null  int64         
 10  fare_amount            3066766 non-null  float64       
 11  extra                  3066766 non-null  float64       
 12  MTA_tax                3066766 non-null  float64       
 13  tip_amount             3066766 non-null  float64       
 14  tolls_amount           3066766 non-null  float64       
 15  improvement_surcharge  3066766 non-null  float64       
 16  total_amount           3066766 non-null  float64       
 17  congestion_surcharge   2995023 non-null  float64       
 18  Airport_fee            2995023 non-null  float64       
dtypes: datetime64[ns](2), float64(12), int64(4), object(1)
memory usage: 444.6+ MB

1.3 Compare data of 3 months ¶

The total numer of passenger by month¶

The code extracts the total passenger counts for three different months (January, March, and June) from the New York City yellow taxi trip data and visualizes the comparison using a horizontal bar chart .

In [ ]:
# Extract passenger counts for each month
passenger_counts_jan = (yellow_jan_df['Passenger_count']).sum()
passenger_counts_mar = (yellow_mar_df['Passenger_count']).sum()
passenger_counts_jun = (yellow_jun_df['Passenger_count']).sum()

# Set the style
sns.set(style="white")
plt.figure(figsize=(10,3))

ax1 = sns.barplot(x=[passenger_counts_jan, passenger_counts_mar, passenger_counts_jun], y= ['January', 'March', 'June'], palette=['#00b894', '#00b894','#00b894'], orient= "h", width = 0.3)
ax1.set_xlim([0,5000000])
for bars_group in ax1.containers:
    ax1.bar_label(bars_group, padding=3, fontsize=15)


# Set labels and title
sns.despine()
plt.xlabel('Passenger Count')
plt.title('Passenger Count Comparison for January, March, and June');
No description has been provided for this image

Seasonal Variation : The number of passengers vary seasonally. There are more passengers during the spring and summer month (March & June) compared to the winter months (January). March is the month that has the largest number of passengers. This could be due to weather conditions and increased tourist activity.

The distribution of passenger_count for each month¶

In [ ]:
# List of dataframes for the three months
dataframes = [yellow_jan_df, yellow_mar_df, yellow_jun_df]
months = ['January', 'March', 'June']

# Create a figure with three subplots in one row and three columns
fig, axes = plt.subplots(1, 3, figsize=(12, 5))

# Main title for the entire set of subplots
fig.suptitle('Distribution of Passenger Count', fontsize=14)

# Iterate over the dataframes and months
for i, df in enumerate(dataframes):
    ax = axes[i]
    sns.histplot(df['Passenger_count'], kde=False, ax=ax, bins=10, color='#00b894')
    ax.set_title(months[i])  # Set the subtitle for each subplot
    ax.set_xlabel('Passenger Count')
    ax.set_ylabel('Frequency')

# Adjust layout
sns.despine()
plt.tight_layout()
plt.subplots_adjust(top=0.85)  # Adjust the position of the main title
plt.show()
No description has been provided for this image

The code creates a set of three subplots to display the distribution of passenger counts for different months (January, March, and June) using histograms.

Here we see that the mostly 1 or 2 passengers avail the cab. The instance of large group of people travelling together is rare. There is no significant difference between distributions of passenger_count of three months.

Pricing and Fare¶

In [ ]:
# List of columns related to pricing and fare to compare
columns_to_compare = ['total_amount', 'fare_amount', 'extra', 'improvement_surcharge']

# Create a figure with four subplots (2 rows and 2 columns)
fig, axes = plt.subplots(2, 2, figsize=(8, 8))

colors = sns.color_palette(['#00b894','#FFE8D1','#568EA3'])

# Iterate over the columns
for i, column in enumerate(columns_to_compare):
    # Calculate the row and column indices for the subplot
    row_index = i // 2  # Integer division to determine the row
    col_index = i % 2   # Modulus to determine the column

    # Create a dataset with data from all three months for the current column
    data = pd.concat([yellow_jan_df[column], yellow_mar_df[column], yellow_jun_df[column]], axis=1)
    data.columns = ['January', 'March', 'June']

    # Create a boxplot for the current column in the appropriate subplot
    sns.boxplot(data=data, palette= colors, ax=axes[row_index, col_index])
    axes[row_index, col_index].set_title(f'{column.upper()}')
    if column in ['total_amount', 'fare_amount']:
        axes[row_index, col_index].set_ylim([0,100])
    elif column == "improvement_surcharge":
        axes[row_index, col_index].set_ylim([0,15])

# Adjust layout
plt.tight_layout()
plt.show()
No description has been provided for this image

The code creates a set of four box plots in a 2x2 grid to compare the distribution of pricing and fare-related columns ('total_amount', 'fare_amount', 'extra', 'improvement_surcharge') across three different months (January, March, and June) using different colors and limited y-axis scales for specific columns.

According to the graphs :

  • Most of the total_amount values are from around 15 dollars to 30 dollars.
  • Overally, the price and fare of June are slightly more expensive than January and March. The price and fare of March are higher than January. However, the difference is not significant between January, March and June.
  • The factor that mainly contribute to the difference of total_amount is fare_amount.
  • There are some outliers that can be the threat.

The distribution of pickup and dropoff day of the week¶

In [ ]:
# List of dataframes for the three months
dataframes = [yellow_jan_df, yellow_mar_df, yellow_jun_df]
months = ['January', 'March', 'June']

# Define the order of days of the week
days_order = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']

# Create a figure with three rows and two columns
fig, axes = plt.subplots(3, 2, figsize=(15, 10))


# Iterate over the dataframes and months
for i, (df, month) in enumerate(zip(dataframes, months)):
    row_index = i  # Determine the row index

    # Calculate day of the week for pickups and dropoffs
    df['pickup_day'] = df['tpep_pickup_datetime'].dt.day_name()
    df['dropoff_day'] = df['tpep_dropoff_datetime'].dt.day_name()

    # Plot the number of pickups for each day of the week
    sns.countplot(y='pickup_day', data=df, ax=axes[row_index, 0], order=days_order, color="#00b894", width = 0.3)
    axes[row_index, 0].set_title(f'Pickups in {month}')
    axes[row_index, 0].set_xlabel('')
    axes[row_index, 0].set_ylabel('')

    # Plot the number of dropoffs for each day of the week
    sns.countplot(y='dropoff_day', data=df, ax=axes[row_index, 1], order=days_order, color="#fdcb6e", width = 0.3)
    axes[row_index, 1].set_title(f'Dropoffs in {month}')
    axes[row_index, 1].set_xlabel('')
    axes[row_index, 1].set_ylabel('')

# Adjust layout
sns.despine()
plt.tight_layout()
plt.show()
No description has been provided for this image

This code creates a 3x2 grid of countplots to visualize the number of pickups and dropoffs on different days of the week for three different months (January, March, and June). The distribution of Pickup and Drop Off day of the week can be seen graphically.

In January, travelling by taxi on Monday and Wednesday is less than other days of week. Meanwhile, in March and June, the number of pick-ups and drop-offs on Sunday and Monday are least.

2. Data Exploration and Pre-processing ¶

Delete the march and june taxi dataframe to free the memory¶

In [ ]:
# Del dataframe
del yellow_mar_df, yellow_jun_df

# Run garbage collection
gc.collect()
Out[ ]:
15753

2.1 Handling missing values ¶

In [ ]:
# Check missing values
yellow_jan_df.isnull().sum()
Out[ ]:
VendorID                     0
tpep_pickup_datetime         0
tpep_dropoff_datetime        0
Passenger_count          71743
Trip_distance                0
RateCodeID               71743
store_and_fwd_flag       71743
PULocationID                 0
DOLocationID                 0
payment_type                 0
fare_amount                  0
extra                        0
MTA_tax                      0
tip_amount                   0
tolls_amount                 0
improvement_surcharge        0
total_amount                 0
congestion_surcharge     71743
Airport_fee              71743
pickup_day                   0
dropoff_day                  0
dtype: int64
In [ ]:
# Distribution of RateCodeID
yellow_jan_df["RateCodeID"].value_counts()
Out[ ]:
1.0     2839305
2.0      114239
5.0       15043
99.0      13106
3.0        8958
4.0        4366
6.0           6
Name: RateCodeID, dtype: int64
In [ ]:
# Distribution of store_and_fwd_flag
yellow_jan_df["store_and_fwd_flag"].value_counts()
Out[ ]:
N    2975020
Y      20003
Name: store_and_fwd_flag, dtype: int64

Here, the data has significant number of missing values (in this case, 71,743 missing values), it's essential to consider an appropriate strategy for handling them.

In this case, I choose to fill the missing value of Passenger_count, congestion_surcharge, Airport_fee with median values for numerical data because the distributions of data features mostly are skewed or have non-normally distributed pattern. Beside that, when plotting the price and fare, I have seen some outliers and the median is robust to outliers which gives a better representation of the central value.

For RateCodeID and store_and_fwd_flag, I will replace the missing values with the most frequent instances which are 1 for RateCodeID and N for store_and_fwd_flag.

In [ ]:
# Impute missing values with the median
columns_to_impute = ["Passenger_count", "congestion_surcharge", "Airport_fee"]

for column in columns_to_impute:
    median_value = yellow_jan_df[column].median()
    yellow_jan_df[column].fillna(median_value, inplace=True)

# Replace missing values in PULocationID and DOLocationID with the most frequent instance
yellow_jan_df['RateCodeID'].fillna(1.0, inplace=True)
yellow_jan_df['store_and_fwd_flag'].fillna("N", inplace=True)

# Verify that missing values have been imputed
yellow_jan_df.isnull().sum()
Out[ ]:
VendorID                 0
tpep_pickup_datetime     0
tpep_dropoff_datetime    0
Passenger_count          0
Trip_distance            0
RateCodeID               0
store_and_fwd_flag       0
PULocationID             0
DOLocationID             0
payment_type             0
fare_amount              0
extra                    0
MTA_tax                  0
tip_amount               0
tolls_amount             0
improvement_surcharge    0
total_amount             0
congestion_surcharge     0
Airport_fee              0
pickup_day               0
dropoff_day              0
dtype: int64

2.2 Identify two columns that have “noisy” (erroneous) values. ¶

Trip_distance¶

In [ ]:
sns.boxplot(yellow_jan_df['Trip_distance'])
plt.title("The box plot of Trip_distance")
plt.xlabel("Trip_distance");
plt.ylabel("Miles");
No description has been provided for this image

This code generates a box plot to visualize the distribution of trip distances in January and adds labels and a title to the plot.

It can be seen that Trip_distance has numerous outliers which are remarkably large.

In [ ]:
# Filter the DataFrame to get instances where Trip_distance is greater than 1000
filtered_df = yellow_jan_df[yellow_jan_df['Trip_distance'] > 1000]

# Print the trip distances and their value counts
print(filtered_df['Trip_distance'].value_counts().sum())
67

According to the data, there are 45862 trips with 0 km distance. The reasons for 0 km distance can be:

  • The dropoff location couldn’t be tracked.
  • The driver deliberately took this ride to complete a target ride number.
  • The passengers canceled the trip.

The total area of the five boroughs that make up New York City is about 320 sq miles; the longest distance between its boundaries, from the northeast to the southwest, is about 35 miles.

The trip_distance data has 67 trips that reach more than 1.000 miles in length. These trip distances are the very long distances that may not be realistic or exceed the geographical limits of the area. These values should be considered as “noisy” (erroneous) values. The solution for this problem can be filtering them.

Passenger Count¶

In [ ]:
yellow_jan_df["Passenger_count"].value_counts()
Out[ ]:
1.0    2333143
2.0     451536
3.0     106353
4.0      53745
0.0      51164
5.0      42681
6.0      28124
8.0         13
7.0          6
9.0          1
Name: Passenger_count, dtype: int64

There are a large number of trips with even 0 passenger count. There is 20 trips with more than 6 passengers , respectively. These numbers of passengers are unusually higher than the capacity of a standard taxi. These values should be consider as “noisy” (erroneous) values. Extreme values in this column could indicate data entry errors or issues with data collection.

2.3 Identify 2 columns that are highly correlated and explain their correlation ¶

Data Preprocessing¶

The clean_data function is designed to prepare a dataset for analysis by applying various data cleaning and preprocessing steps. It:

  1. Filters out rows with negative or zero values.
  2. Filters out rows with extreme values for passenger_count and trip_distance.
  3. Removes duplicate rows and those with identical pickup and dropoff times.

The function ensures data quality and consistency for meaningful analysis.

In [ ]:
def clean_data(df):
    # Drop negative or zero values of trip_distance, fare_amount, total_amount, passenger_count
    df = df[(df['Trip_distance'] > 0) & (df['fare_amount'] > 0) & (df['total_amount'] > 0) & (df['Passenger_count'] > 0)]

    # Drop row with passenger_count > 6, trip_distance > 1000
    df = df[(df['Passenger_count'] <= 6) & (df['Trip_distance'] <= 1000)]

    # Drop rows with same pickup and dropoff time
    df = df[df['tpep_pickup_datetime'] != df['tpep_dropoff_datetime']]

    print("Finish Cleaning")
    return df
In [ ]:
yellow_jan_df = clean_data(yellow_jan_df)
Finish Cleaning

Exam the correlation¶

In [ ]:
# Select the numerical columns for the scatter plot matrix
numerical_columns = ['Passenger_count', 'Trip_distance', 'fare_amount',
                     'extra', 'MTA_tax', 'improvement_surcharge', 'tip_amount', 'tolls_amount',
                     'total_amount', 'congestion_surcharge', 'Airport_fee']

# Create a DataFrame with only the selected numerical columns
numerical_data = yellow_jan_df[numerical_columns]

#Get correlation data
correlation_matrix = numerical_data.corr()
new_correlation_matrix = correlation_matrix.iloc[1:, :-1]
# ones_like can build a matrix of booleans (True, False) with the same shape as our data
ones_corr = np.ones_like(correlation_matrix, dtype = bool)

# return only upper triangle matrix
mask = np.triu(ones_corr)
new_mask = mask[1:,0:-1]

fig, ax = plt.subplots(figsize=(8,6))

cmap = sns.diverging_palette(0, 230, 90, 60, as_cmap = True)
sns.heatmap(data=new_correlation_matrix, mask = new_mask,
           annot=True, fmt='.2f', vmin=-1, vmax = 1, cmap = cmap,
           linecolor='white', linewidths = 0.5);

yticks = [i.upper() for i in new_correlation_matrix.index]
xticks = [i.upper() for i in new_correlation_matrix.columns]

ax.set_yticklabels(yticks, rotation=0);
ax.set_xticklabels(xticks, rotation = 90);
title = 'CORRELATION MATRIX\n'
ax.set_title(title, loc='left', fontsize = 18);
No description has been provided for this image

This code creates a correlation matrix heatmap to visualize the relationships between numerical columns in a January dataFrame.

The graph indicates that :

  1. There are significantly positive relationships between Trip_distance, total_amount, and fare_amount. This observation indicates that as the distance of the trip (Trip_distance) increases, the total fare (total_amount) and fare amount (fare_amount) tend to increase as well. This is a logical relationship, as longer trips typically result in higher fares.

  1. Additionally, total_amount, tip_amount, and tolls_amount are positively related. This finding makes sense since the total_amount includes both the fare amount (fare_amount) and additional charges such as tolls and tips. Therefore, when any of these components increase, the total amount charged to passengers (total_amount) also increases.

3. Featurization¶

3.1 Create a feature which is a flag indicating if the trip is in rush-hour or not.¶

Extract and create pickup_hour and dropoff_hour from this datetime¶

In [ ]:
yellow_jan_df['pickup_hour']=yellow_jan_df['tpep_pickup_datetime'].dt.hour
yellow_jan_df['dropoff_hour']=yellow_jan_df['tpep_dropoff_datetime'].dt.hour

Create a feature which is a flag indicating if the trip is in rush-hour or not.¶

In [ ]:
# Define the rush-hour time periods
morning_rush_hours = range(7, 10)
evening_rush_hours = range(16, 19)

# Create a function to check if the pickup hour is during rush hour
def is_rush_hour(pickup_hour):
    if pickup_hour in morning_rush_hours or pickup_hour in evening_rush_hours:
        return 1  # It's rush hour
    else:
        return 0  # It's not rush hour

# Apply the function to create the rush-hour flag column
yellow_jan_df['is_rush_hour'] = yellow_jan_df['pickup_hour'].apply(is_rush_hour)
In [ ]:
yellow_jan_df[["pickup_hour", "is_rush_hour"]].value_counts()
Out[ ]:
pickup_hour  is_rush_hour
18           1               208189
17           1               201467
15           0               188477
16           1               188025
19           0               186133
14           0               183876
13           0               171525
12           0               163163
20           0               160027
21           0               156261
11           0               148100
22           0               142542
10           0               137917
9            1               125976
8            1               112273
23           0               110426
7            1                83247
0            0                81716
1            0                57402
6            0                41841
2            0                40156
3            0                26025
5            0                16891
4            0                16634
dtype: int64

3.2 Create a feature that encodes the “complexity” of the trip.¶

Merge the yellow taxi dataframe with taxi lookup csv file¶

Define a function to merge data frame with lookup_table and shape file¶

In [ ]:
def merge_taxi_with_zones(taxi_df, zones_df, zone_shp):
    """
    Merges a New York City taxi dataframe with a zones dataframe and zone shape file based on the pickup or dropoff location ID.
    :param taxi_df: New York City taxi dataframe with pickup and dropoff location IDs.
    :param zones_df: NYC taxi zone lookup table dataframe.
    :param zone_shp: zone shape file for the pickup or dropoff location ID in the taxi dataframe.
    :return: Merged dataframe with the additional zone, borough, service zone, shape area, shape length, geometry columns.
    """
    # Merge the taxi dataframe with the look_up dataframe based on the location ID column
    # For pickup
    merged_df = pd.merge(taxi_df, zones_df, left_on='PULocationID', right_on='LocationID')
    merged_df = merged_df.drop(columns='LocationID')
    merged_df = merged_df.rename(columns={'Borough': 'PU_borough', 'Zone': 'PU_zone', 'service_zone': 'PU_service_zone'})
    # For dropoff
    merged_df = pd.merge(merged_df, zones_df, left_on='DOLocationID', right_on='LocationID')
    merged_df = merged_df.drop(columns='LocationID')
    merged_df = merged_df.rename(columns={'Borough': 'DO_borough', 'Zone': 'DO_zone', 'service_zone': 'DO_service_zone'})

    # Merge the taxi dataframe with the shape dataframe based on the location ID column
    # For pickup
    merged_df = pd.merge(merged_df, zone_shp, left_on='PULocationID', right_on='LocationID')
    merged_df = merged_df.drop(columns=['LocationID','OBJECTID', 'zone', 'borough'])
    merged_df = merged_df.rename(columns={'Shape_Leng': 'PU_Shape_Leng', 'Shape_Area': 'PU_Shape_Area', 'geometry': 'PU_geometry'})

    # For pickup
    merged_df = pd.merge(merged_df, zone_shp, left_on='DOLocationID', right_on='LocationID')
    merged_df = merged_df.drop(columns=['LocationID','OBJECTID', 'zone', 'borough'])
    merged_df = merged_df.rename(columns={'Shape_Leng': 'DO_Shape_Leng', 'Shape_Area': 'DO_Shape_Area', 'geometry': 'DO_geometry'})

    return merged_df
In [ ]:
merged_df = merge_taxi_with_zones(yellow_jan_df, zones_lookup_df, zone_shp)
In [ ]:
merged_df.head()
Out[ ]:
VendorID tpep_pickup_datetime tpep_dropoff_datetime Passenger_count Trip_distance RateCodeID store_and_fwd_flag PULocationID DOLocationID payment_type ... PU_service_zone DO_borough DO_zone DO_service_zone PU_Shape_Leng PU_Shape_Area PU_geometry DO_Shape_Leng DO_Shape_Area DO_geometry
0 2 2023-01-01 00:32:10 2023-01-01 00:40:36 1.0 0.97 1.0 N 161 141 2 ... Yellow Zone Manhattan Lenox Hill West Yellow Zone 0.035804 0.000072 POLYGON ((991081.026 214453.698, 990952.644 21... 0.041514 0.000077 POLYGON ((994839.073 216123.698, 994786.740 21...
1 2 2023-01-01 00:21:43 2023-01-01 00:27:52 1.0 2.00 1.0 N 161 141 2 ... Yellow Zone Manhattan Lenox Hill West Yellow Zone 0.035804 0.000072 POLYGON ((991081.026 214453.698, 990952.644 21... 0.041514 0.000077 POLYGON ((994839.073 216123.698, 994786.740 21...
2 2 2023-01-01 00:11:44 2023-01-01 00:20:38 2.0 1.83 1.0 N 161 141 1 ... Yellow Zone Manhattan Lenox Hill West Yellow Zone 0.035804 0.000072 POLYGON ((991081.026 214453.698, 990952.644 21... 0.041514 0.000077 POLYGON ((994839.073 216123.698, 994786.740 21...
3 2 2023-01-01 00:35:52 2023-01-01 00:43:38 2.0 1.41 1.0 N 161 141 2 ... Yellow Zone Manhattan Lenox Hill West Yellow Zone 0.035804 0.000072 POLYGON ((991081.026 214453.698, 990952.644 21... 0.041514 0.000077 POLYGON ((994839.073 216123.698, 994786.740 21...
4 1 2023-01-01 00:51:12 2023-01-01 00:59:38 1.0 1.80 1.0 N 161 141 2 ... Yellow Zone Manhattan Lenox Hill West Yellow Zone 0.035804 0.000072 POLYGON ((991081.026 214453.698, 990952.644 21... 0.041514 0.000077 POLYGON ((994839.073 216123.698, 994786.740 21...

5 rows × 36 columns

After merge the taxi dataframe with loop_up table and zone shape file, I have got the final dataframe with the information about the zones and geometry. To calculate the straight-distance of taxi trip:

  • For the case that pick-up zone and dropoff-zone are different, I will calculate the straight distance between centroids of the pick-up zone and drop-off zone.

  • For the case that pick-up zone and dropoff-zone are same, the shape length will be the straight-distance.

Create a feature for the straight-line distance of the trip.¶

In [ ]:
# Structure of merged_df
merged_df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2897101 entries, 0 to 2897100
Data columns (total 36 columns):
 #   Column                 Dtype         
---  ------                 -----         
 0   VendorID               int64         
 1   tpep_pickup_datetime   datetime64[ns]
 2   tpep_dropoff_datetime  datetime64[ns]
 3   Passenger_count        float64       
 4   Trip_distance          float64       
 5   RateCodeID             float64       
 6   store_and_fwd_flag     object        
 7   PULocationID           int64         
 8   DOLocationID           int64         
 9   payment_type           int64         
 10  fare_amount            float64       
 11  extra                  float64       
 12  MTA_tax                float64       
 13  tip_amount             float64       
 14  tolls_amount           float64       
 15  improvement_surcharge  float64       
 16  total_amount           float64       
 17  congestion_surcharge   float64       
 18  Airport_fee            float64       
 19  pickup_day             object        
 20  dropoff_day            object        
 21  pickup_hour            int64         
 22  dropoff_hour           int64         
 23  is_rush_hour           int64         
 24  PU_borough             object        
 25  PU_zone                object        
 26  PU_service_zone        object        
 27  DO_borough             object        
 28  DO_zone                object        
 29  DO_service_zone        object        
 30  PU_Shape_Leng          float64       
 31  PU_Shape_Area          float64       
 32  PU_geometry            geometry      
 33  DO_Shape_Leng          float64       
 34  DO_Shape_Area          float64       
 35  DO_geometry            geometry      
dtypes: datetime64[ns](2), float64(16), geometry(2), int64(7), object(9)
memory usage: 817.8+ MB
In [ ]:
# Calculate straight-line distance for the entire DataFrame
merged_df['straight_distance'] = merged_df.apply(
    lambda row: row['PU_Shape_Leng'] * 0.00062137
    if row['PULocationID'] == row['DOLocationID']
    else row['PU_geometry'].centroid.distance(row['DO_geometry'].centroid) * 0.00062137,
    axis=1
)

This code calculates the straight-line distance for each row in the DataFrame. If the pickup and dropoff locations are the same (based on the IDs), it uses the 'PU_Shape_Leng' value converted to miles. Otherwise, it calculates the distance between the centroids of the pickup and dropoff geometries in miles and stores the result in the 'straight_distance' column.

In [ ]:
merged_df.head()
Out[ ]:
VendorID tpep_pickup_datetime tpep_dropoff_datetime Passenger_count Trip_distance RateCodeID store_and_fwd_flag PULocationID DOLocationID payment_type ... DO_borough DO_zone DO_service_zone PU_Shape_Leng PU_Shape_Area PU_geometry DO_Shape_Leng DO_Shape_Area DO_geometry straight_distance
0 2 2023-01-01 00:32:10 2023-01-01 00:40:36 1.0 0.97 1.0 N 161 141 2 ... Manhattan Lenox Hill West Yellow Zone 0.035804 0.000072 POLYGON ((991081.026 214453.698, 990952.644 21... 0.041514 0.000077 POLYGON ((994839.073 216123.698, 994786.740 21... 3.707452
1 2 2023-01-01 00:21:43 2023-01-01 00:27:52 1.0 2.00 1.0 N 161 141 2 ... Manhattan Lenox Hill West Yellow Zone 0.035804 0.000072 POLYGON ((991081.026 214453.698, 990952.644 21... 0.041514 0.000077 POLYGON ((994839.073 216123.698, 994786.740 21... 3.707452
2 2 2023-01-01 00:11:44 2023-01-01 00:20:38 2.0 1.83 1.0 N 161 141 1 ... Manhattan Lenox Hill West Yellow Zone 0.035804 0.000072 POLYGON ((991081.026 214453.698, 990952.644 21... 0.041514 0.000077 POLYGON ((994839.073 216123.698, 994786.740 21... 3.707452
3 2 2023-01-01 00:35:52 2023-01-01 00:43:38 2.0 1.41 1.0 N 161 141 2 ... Manhattan Lenox Hill West Yellow Zone 0.035804 0.000072 POLYGON ((991081.026 214453.698, 990952.644 21... 0.041514 0.000077 POLYGON ((994839.073 216123.698, 994786.740 21... 3.707452
4 1 2023-01-01 00:51:12 2023-01-01 00:59:38 1.0 1.80 1.0 N 161 141 2 ... Manhattan Lenox Hill West Yellow Zone 0.035804 0.000072 POLYGON ((991081.026 214453.698, 990952.644 21... 0.041514 0.000077 POLYGON ((994839.073 216123.698, 994786.740 21... 3.707452

5 rows × 37 columns

In [ ]:
# Calculating the "distance_complexity"
merged_df["distance_complexity"] = round(abs(merged_df["Trip_distance"] - merged_df["straight_distance"]), 3)

This code calculates the "distance_complexity" by subtracting the straight-line distance from the actual trip distance for each row and taking the absolute difference.

In [ ]:
merged_df.head()
Out[ ]:
VendorID tpep_pickup_datetime tpep_dropoff_datetime Passenger_count Trip_distance RateCodeID store_and_fwd_flag PULocationID DOLocationID payment_type ... DO_zone DO_service_zone PU_Shape_Leng PU_Shape_Area PU_geometry DO_Shape_Leng DO_Shape_Area DO_geometry straight_distance distance_complexity
0 2 2023-01-01 00:32:10 2023-01-01 00:40:36 1.0 0.97 1.0 N 161 141 2 ... Lenox Hill West Yellow Zone 0.035804 0.000072 POLYGON ((991081.026 214453.698, 990952.644 21... 0.041514 0.000077 POLYGON ((994839.073 216123.698, 994786.740 21... 3.707452 2.737
1 2 2023-01-01 00:21:43 2023-01-01 00:27:52 1.0 2.00 1.0 N 161 141 2 ... Lenox Hill West Yellow Zone 0.035804 0.000072 POLYGON ((991081.026 214453.698, 990952.644 21... 0.041514 0.000077 POLYGON ((994839.073 216123.698, 994786.740 21... 3.707452 1.707
2 2 2023-01-01 00:11:44 2023-01-01 00:20:38 2.0 1.83 1.0 N 161 141 1 ... Lenox Hill West Yellow Zone 0.035804 0.000072 POLYGON ((991081.026 214453.698, 990952.644 21... 0.041514 0.000077 POLYGON ((994839.073 216123.698, 994786.740 21... 3.707452 1.877
3 2 2023-01-01 00:35:52 2023-01-01 00:43:38 2.0 1.41 1.0 N 161 141 2 ... Lenox Hill West Yellow Zone 0.035804 0.000072 POLYGON ((991081.026 214453.698, 990952.644 21... 0.041514 0.000077 POLYGON ((994839.073 216123.698, 994786.740 21... 3.707452 2.297
4 1 2023-01-01 00:51:12 2023-01-01 00:59:38 1.0 1.80 1.0 N 161 141 2 ... Lenox Hill West Yellow Zone 0.035804 0.000072 POLYGON ((991081.026 214453.698, 990952.644 21... 0.041514 0.000077 POLYGON ((994839.073 216123.698, 994786.740 21... 3.707452 1.907

5 rows × 38 columns

3.3 Calculate the pickup and drop-off frequency in each taxi zone.¶

In [ ]:
pickups_by_zone = merged_df.groupby(["PU_zone"]).size().reset_index(name='count')
pickups_by_zone = pickups_by_zone.sort_values(['count'], ascending=[False])


dropoffs_by_zone = merged_df.groupby(["DO_zone"]).size().reset_index(name='count')
dropoffs_by_zone = dropoffs_by_zone.sort_values(['count'], ascending=[False])

This code calculates the number of pickups and drop-offs for each zone, creating two DataFrames:

  1. pickups_by_zone: Groups the merged data by the "PU_zone" (pickup zone) and counts the occurrences in each zone, then sorts them in descending order based on the count.

  2. dropoffs_by_zone: Similar to pickups_by_zone, this DataFrame groups the merged data by the "DO_zone" (drop-off zone), counts the occurrences in each zone, and sorts them in descending order based on the count. It results in a DataFrame showing the number of drop-offs in each zone, sorted by count in descending order.

In [ ]:
pickups_by_zone.head(10)
Out[ ]:
PU_zone count
117 JFK Airport 147508
225 Upper East Side South 143123
224 Upper East Side North 133630
150 Midtown Center 130550
175 Penn Station/Madison Sq West 105353
151 Midtown East 101806
131 Lincoln Square East 96676
218 Times Sq/Theatre District 95251
127 LaGuardia Airport 85880
159 Murray Hill 85253
In [ ]:
dropoffs_by_zone.head(10)
Out[ ]:
DO_zone count
229 Upper East Side North 141274
230 Upper East Side South 127618
154 Midtown Center 111398
223 Times Sq/Theatre District 85627
163 Murray Hill 85295
232 Upper West Side South 84752
135 Lincoln Square East 84694
134 Lenox Hill West 84414
155 Midtown East 79378
45 Clinton East 73937

Visualise the top 10 Pickup and Dropoff Zones¶

In [ ]:
# Extract the top 10 pickup and dropoff zones
top_10_pickup_zones = pickups_by_zone.head(10)
top_10_dropoff_zones = dropoffs_by_zone.head(10)

# Create subplots for pickup and dropoff zones
fig, axes = plt.subplots(1, 2, figsize=(15, 5))

# Create bar plots for top 10 pickup zones
sns.barplot(x='count', y='PU_zone', data=top_10_pickup_zones, ax=axes[0], color = "#00b894", width=0.3)
axes[0].set_title('Top 10 Pickup Zones')
axes[0].set_xlabel('')
axes[0].set_ylabel('')

# Create bar plots for top 10 dropoff zones
sns.barplot(x='count', y='DO_zone', data=top_10_dropoff_zones, ax=axes[1], color = "#fdcb6e", width=0.3)
axes[1].set_title('Top 10 Dropoff Zones')
axes[1].set_xlabel('')
axes[1].set_ylabel('')

sns.despine()
plt.tight_layout()
plt.show()
No description has been provided for this image
  • The top pickup zone in New York City is JFK Airport with 147,508 trips followed by Upper East Side South and Upper East Side North.
  • The top dropoff zone in New York City is Upper East Side North with 141,274 trips, closely followed by Upper East Side South and Midtown Center.

4. Data Analysis¶

4.1 Rank the vendors by popularity¶

In [ ]:
# Count the number of trips for each vendor
vendor_counts = merged_df['VendorID'].value_counts().reset_index()
vendor_counts.columns = ['VendorID', 'Trip Count']

# Sort the vendors by popularity (trip count)
vendor_counts = vendor_counts.sort_values(by='Trip Count', ascending=False)

# Create a bar plot to visualize the rankings
plt.figure(figsize=(10, 3))
ax_vendors = sns.barplot(x='Trip Count', y='VendorID', data=vendor_counts, color = "#00b894", orient = 'h', width = 0.3)
for bars_group in ax_vendors.containers:
    ax_vendors.bar_label(bars_group, padding=3, fontsize=15)
plt.xlabel('Trip Count')
plt.ylabel('VendorID')
plt.title('Vendor Popularity Rankings')
sns.despine()
plt.show()
No description has been provided for this image

VendorID 2 has a significantly higher trip count compared to VendorID 1, indicating its higher popularity or usage in the dataset.

4.2 What are the peak travel hours?¶

In [ ]:
# Count the number of trips for pickup_hour
pickup_hours_counts = merged_df['pickup_hour'].value_counts().reset_index()
pickup_hours_counts.columns = ['pickup_hour', 'Trip Count']

# Sort the pickup hours by popularity (trip count)
pickup_hours_counts = pickup_hours_counts.sort_values(by='pickup_hour')

# Count the number of trips for dropoff_hour
dropoff_hours_counts = merged_df['dropoff_hour'].value_counts().reset_index()
dropoff_hours_counts.columns = ['dropoff_hour', 'Trip Count']

# Sort the dropoff hours by popularity (trip count)
dropoff_hours_counts = dropoff_hours_counts.sort_values(by='dropoff_hour')

# Create a figure with two subplots
fig, axes = plt.subplots(1, 2, figsize=(14, 4))

# Plot for pickup hours
sns.barplot(x='pickup_hour', y='Trip Count', data=pickup_hours_counts, ax=axes[0], color="#00b894")
axes[0].set_xlabel('')
axes[0].set_ylabel('Trip Count')
axes[0].set_title('Pickup Hours')

# Plot for dropoff hours
sns.barplot(x='dropoff_hour', y='Trip Count', data=dropoff_hours_counts, ax=axes[1], color="#fdcb6e")
axes[1].set_xlabel('')
axes[1].set_ylabel('')
axes[1].set_title('Dropoff Hours')

# Adjust layout
sns.despine()
plt.tight_layout()
plt.show()
No description has been provided for this image

This code calculates and plots the popularity of pickup and dropoff hours by counting the number of trips for each hour, resulting in two bar plots side by side, where the x-axis represents the hours of the day, and the y-axis represents the trip count.

The graphs show that the peak travel hours are generally between 15:00 PM and 19:00 PM, which corresponds to the late afternoon and early evening when people are likely traveling to and from work or other activities. On the other hand, the early morning hours from 3:00 AM to 5:00 AM have the fewest number of trips, likely due to fewer people traveling during those hours.

This information can be valuable for understanding the trends and demand for taxi services during different times of the day.

4.3 What is the average distance of the trips on weekdays and weekends?¶

In [ ]:
# Create new feature weekday_or_weekend
merged_df['weekday_or_weekend'] = merged_df['pickup_day'].apply(lambda x: 1 if x in ["Saturday", "Sunday"] else 0)

# Group by 'weekday_or_weekend' and calculate the average trip distance
average_distance_by_day = merged_df.groupby('weekday_or_weekend')['Trip_distance'].mean().reset_index()

# Create a bar plot
plt.figure(figsize=(10, 3))
ax_trip_distance = sns.barplot(x="Trip_distance", y="weekday_or_weekend", data=average_distance_by_day, palette=['#00b894', '#fdcb6e'], orient = "h", width = 0.3)
ax_trip_distance.set_yticklabels(['Weekday', 'Weekend'])
ax_trip_distance.set_ylabel('Day of the Week')
ax_trip_distance.set_xlabel('Average Trip Distance (miles)')
ax_trip_distance.set_title('Average Trip Distance on Weekdays vs. Weekends')

# Add labels to the bars
for bars_group in ax_trip_distance.containers:
    ax_trip_distance.bar_label(bars_group, padding=3, fontsize=15)

sns.despine()
plt.show()
No description has been provided for this image

This code creates a new feature, 'weekday_or_weekend,' indicating whether a day is a weekday (0) or a weekend (1), calculates the average trip distance for both weekdays and weekends, and visualizes the results in a horizontal bar plot with labels showing the difference in average trip distance between weekdays and weekends.

The average trip distance on weekends is 3.4548 miles which is slightly longer than on weekdays , 3.31285 miles, indicating a modest increase in trip distance during weekends.

4.4 What is the average number of passengers in a trip on weekdays and weekends?¶

In [ ]:
# Group by 'weekday_or_weekend' and calculate the average trip distance
average_pc_by_day = merged_df.groupby('weekday_or_weekend')['Passenger_count'].mean().reset_index()

# Create a bar plot
plt.figure(figsize=(10, 3))
ax_passenger_count = sns.barplot(x="Passenger_count", y="weekday_or_weekend", data=average_pc_by_day, palette=['#00b894', '#fdcb6e'], orient = "h", width = 0.3)
ax_passenger_count.set_yticklabels(['Weekday', 'Weekend'])
ax_passenger_count.set_ylabel('Day of the Week')
ax_passenger_count.set_xlabel('Average Passenger Count')
ax_passenger_count.set_title('Average Passenger Count on Weekdays vs. Weekends')

# Add labels to the bars
for bars_group in ax_passenger_count.containers:
    ax_passenger_count.bar_label(bars_group, padding=3, fontsize=15)

sns.despine()
plt.show()
No description has been provided for this image

This code groups the data by 'weekday_or_weekend,' calculates the average passenger count for both weekdays and weekends, and visualizes the results in a horizontal bar plot that compares the average passenger count between weekdays and weekends.

The average number of passengers in a trip is slightly higher on weekends (1.4647) compared to weekdays (1.3452).

4.5 What is the correlation between the fare amount and the tip??¶

In [ ]:
# Get correlation data for these columns
fare_tip_corr = yellow_jan_df[['fare_amount', 'tip_amount']].corr()

fare_tip_corr
Out[ ]:
fare_amount tip_amount
fare_amount 1.000000 0.610648
tip_amount 0.610648 1.000000

The correlation matrix for the fare_amount and tip_amount columns shows a strong positive correlation of approximately 0.61, indicating that as the fare amount increases, there's a tendency for the tip amount to increase as well. This positive relationship suggests that passengers are more likely to tip more when the fare amount is higher.

4.6 What is the correlation between the fare amount and the number of passengers?¶

In [ ]:
# Get correlation data for these columns
fare_pass_count_corr = yellow_jan_df[['fare_amount', 'Passenger_count']].corr()

fare_pass_count_corr
Out[ ]:
fare_amount Passenger_count
fare_amount 1.000000 0.027862
Passenger_count 0.027862 1.000000

The correlation between the fare amount and the number of passengers in the provided dataset is quite low, with a correlation coefficient of approximately 0.03. This suggests that there is only a very weak positive relationship between the fare amount and the number of passengers, indicating that the number of passengers has a minimal impact on the fare amount in this dataset.

5. Conclusion¶

5.1 Summary¶

  • The dataset contains 🚕New York City yellow taxi trip data from different months, revealing insights into travel patterns.
  • There's a variation in the average distance, trip duration, and fare amount across the months, potentially reflecting seasonal changes.
  • The passenger count is generally low, with the majority of trips having only one passenger.
  • Pickup and dropoff patterns indicate popular zones, with JFK Airport being a top pickup location with 147,508 trips and Upper East Side North being top dropoff zone with 141,274 trips.
  • Weekday and weekend trips exhibit slight differences in terms of average trip distance and passenger count.
  • Vendor 2 is more popular based on trip count, while Vendor 1 shows higher average fare amounts.
  • There is a moderate positive correlation between fare amount and tip amount which suggests that passengers are more likely to tip more when the fare amount is higher.
  • There is only a very weak positive relationship between the fare amount and the number of passengers.

5.2 Challenges¶

  • The dataset may contain outliers, missing values, or incorrect data that could affect analysis accuracy.
  • Cleaning the data, such as handling missing values, outliers, and inconsistent data entries, was a necessary but time-consuming task.
  • Limited information about the dataset's specific context can make it challenging to interpret some findings.

5.3 Possible Next Steps¶

  • Further analysis could involve outlier detection and handling to improve data quality.
  • Geospatial analysis could provide insights into trip patterns and the impact of locations on fares.
  • Predictive modeling to forecast fares, trip duration, or passenger count based on different factors could be valuable.
  • Incorporating external data sources (e.g., weather data, special events) for more comprehensive analysis.
  • Conducting more detailed time-series analysis to identify trends over time.
  • Analyzing pricing structures and fare changes across different trip types and vendors.