Analysis on the Potential of Life on Exoplanets

By: Ethan Barr and Tim Freerksen

Introduction

Is there life in space? This has been a question for many years with no real evidence to back up the claim that there is. Our goal in this project is to take the exoplanets already discovered and see what the probability of life on these planets real is and hope that a good amount of the exoplanets that we discovered so far fit well into the criteria of haveing the ability to hold life.

Throughout this tutorial, we will try to see how many of the exoplanets we have discovered so far the right conditions in order to harbor life and then we will see what the probability is that a future exoplanet will end up holding life.

Required Tools

The following libraries used for the project:

  1. pandas
  2. regex
  3. numpy
  4. matplotlib
  5. sklearn
  6. seaborn

If having issues with python 3+ or panda, we recommend referring to these following websites for more information:

  1. https://docs.python.org/3/
  2. https://pandas.pydata.org/pandas-docs/stable/install.html

1. Data Collection

This is the first part of the data life cycle. In this part we will go through various websites to try and find data that both matches our topic at hand as well as gives enough information so that we can perform an analysis later on.

For an Exoplanet Database we found that https://exoplanetarchive.ipac.caltech.edu/cgi-bin/TblView/nph-tblView?app=ExoTbls&config=PS gave the best and most information from 1989 - 2020. In order to retrieve this data we first converted the online database into a csv file in which we could then read and manipulate.

The following tools were used for the data collection:

  1. panda
In [1]:
import pandas as pd                     # used in order to read the csv file and convert it successfully into a datafram
import re                               # used to easily gather columns of similar aspects
import numpy as np                      # used in order to calculate the abosolute value of a number and other mathematical operations
import matplotlib.pyplot as plot        # allows for the visualization of data
from sklearn import linear_model        # used for linear regression
import seaborn as sns                   # utilized in order to give us a regression plot

Since the exoplanetarchive website was nice enough to allow the downloading of the database into a csv file, there was not a lot of steps to fully access the entire database. It was sufficient to first download the database in a csv format and then add it as one of the files with this project. We could then easily access this file by performing a pandas read_csv which allowed for the entire csv file to be converted into a flexibile and readable DataFrame that we could use.

A DataFrame is a table that has rows and columns that correlate to certain pieces of data. Using DataFrames allows for better use of more pandas functions which help to manipulate this data much more flexibily. If interested in learning more about DataFrames then check out the pandas documentation of it at https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html

The only issue that the csv had when we first read it was that the data provided had its only numbering associated with each row. To combact this we decided to stick with how the original data numbered itself and assigned that row to be the row number titled, loc_rowid.

In [2]:
data = pd.read_csv('PS.csv')
data.set_index('loc_rowid', inplace=True)
data.head()
Out[2]:
pl_name hostname default_flag sy_snum sy_pnum discoverymethod disc_year disc_facility soltype pl_controv_flag ... sy_vmagerr2 sy_kmag sy_kmagerr1 sy_kmagerr2 sy_gaiamag sy_gaiamagerr1 sy_gaiamagerr2 rowupdate pl_pubdate releasedate
loc_rowid
1 11 Com b 11 Com 0 2 1 Radial Velocity 2007 Xinglong Station Published Confirmed 0 ... -0.023 2.282 0.346 -0.346 4.44038 0.003848 -0.003848 2014-07-23 2011-08 2014-07-23
2 11 Com b 11 Com 1 2 1 Radial Velocity 2007 Xinglong Station Published Confirmed 0 ... -0.023 2.282 0.346 -0.346 4.44038 0.003848 -0.003848 2014-05-14 2008-01 2014-05-14
3 11 UMi b 11 UMi 0 1 1 Radial Velocity 2009 Thueringer Landessternwarte Tautenburg Published Confirmed 0 ... -0.005 1.939 0.270 -0.270 4.56216 0.003903 -0.003903 2018-04-25 14:08:01 2009-10 2014-05-14
4 11 UMi b 11 UMi 0 1 1 Radial Velocity 2009 Thueringer Landessternwarte Tautenburg Published Confirmed 0 ... -0.005 1.939 0.270 -0.270 4.56216 0.003903 -0.003903 2018-04-25 14:08:01 2011-08 2014-07-23
5 11 UMi b 11 UMi 1 1 1 Radial Velocity 2009 Thueringer Landessternwarte Tautenburg Published Confirmed 0 ... -0.005 1.939 0.270 -0.270 4.56216 0.003903 -0.003903 2018-09-04 16:14:36 2017-03 2018-09-06

5 rows × 91 columns

2. Data Processing

After you successfully retrieve the data that you are looking for and have it in some sort of dataframe so that you can manipulate it then you move onto this next step. Within this step we want to try and tidy up the data that we just read in. This is an important step because of the fact that it will allow the data to be read and understood with much more fluidity. In our case we would be altering the structure of the DataFrame through the process of tidying data and / or data wrangling.

You can learn more about:

  1. tidying data: https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html
  2. data wrangling: https://www.elderresearch.com/blog/what-is-data-wrangling-and-why-does-it-take-so-long/#:~:text=Data%20wrangling%20is%20the%20process,20%25%20for%20exploration%20and%20modeling.

We will now go through the steps of tiding up our DataFrame so that any extra columns are discarded since we will not need them to perform our analysis. We also want to add a few columns to the DataFrame so that when we perform some calculations we can easily convert the column values out of their comparison units (example: 10 Earth Masses, which is just saying that the planet is 10 time the size of Earth). Another issue that we have with this DataFrame is that the names, due to there being many columns, may not be very intuitive, so in order to be able to understand the column values quicker without a cheat sheet of what each column represents we want to change the column names.

Due to space being such a hard thing to measure as we can't just take out a ruler and measure it that way, astronomers typically write down what they believe a value is and then the upper and lower limits of what that value could be. We will be just taking into account what they believe the specific value to be and ignore the upper and lower limits in order to help calculate certian aspects of the exoplanets and host stars much easier.

The following tools were used for Data Processing:

  1. regex
  2. numpy
In [3]:
# This Loop is systematically removing the Upper and Lower limit values as well as the Limit Flags attactched to certain
# values
for columnName, columnData in data.iteritems():
    if (re.search('[1|2]', columnName)):
        data.drop(columnName, axis=1, inplace=True)
    elif (re.search('lim', columnName)):
        data.drop(columnName, axis=1, inplace=True)

# Removing the columns that are not needed but were not captured by the previous for loop
data.drop('default_flag', axis=1, inplace=True)
data.drop('discoverymethod', axis=1, inplace=True)
data.drop('disc_facility', axis=1, inplace=True)
data.drop('soltype', axis=1, inplace=True)
data.drop('pl_controv_flag', axis=1, inplace=True)
data.drop('pl_refname', axis=1, inplace=True)
data.drop('ttv_flag', axis=1, inplace=True)
data.drop('rowupdate', axis=1, inplace=True)
data.drop('pl_pubdate', axis=1, inplace=True)
data.drop('releasedate', axis=1, inplace=True)
data.drop('sy_refname', axis=1, inplace=True)
data.drop('rastr', axis=1, inplace=True)
data.drop('ra', axis=1, inplace=True)
data.drop('decstr', axis=1, inplace=True)
data.drop('dec', axis=1, inplace=True)
data.drop('sy_gaiamag', axis=1, inplace=True)
data.drop('sy_kmag', axis=1, inplace=True)
data.drop('sy_vmag', axis=1, inplace=True)
data.drop('st_refname', axis=1, inplace=True)
data.drop('pl_orbeccen', axis=1, inplace=True)
data.drop('pl_insol', axis=1, inplace=True)
data.drop('sy_snum', axis=1, inplace=True)
data.drop('sy_pnum', axis=1, inplace=True)

data.head()
Out[3]:
pl_name hostname disc_year pl_orbper pl_orbsmax pl_rade pl_radj pl_bmasse pl_bmassj pl_bmassprov pl_eqt st_teff st_rad st_mass st_met st_metratio st_logg sy_dist
loc_rowid
1 11 Com b 11 Com 2007 NaN 1.21 NaN NaN 5434.7000 17.10 Msini NaN NaN NaN 2.60 NaN NaN NaN 93.1846
2 11 Com b 11 Com 2007 326.03000 1.29 NaN NaN 6165.6000 19.40 Msini NaN 4742.0 19.00 2.70 -0.35 [Fe/H] 2.31 93.1846
3 11 UMi b 11 UMi 2009 516.22000 1.54 NaN NaN 3337.0700 10.50 Msini NaN 4340.0 24.08 1.80 0.04 [Fe/H] 1.60 125.3210
4 11 UMi b 11 UMi 2009 NaN 1.51 NaN NaN 3432.4000 10.80 Msini NaN NaN NaN 1.70 NaN NaN NaN 125.3210
5 11 UMi b 11 UMi 2009 516.21997 1.53 NaN NaN 4684.8142 14.74 Msini NaN 4213.0 29.79 2.78 -0.02 [Fe/H] 1.93 125.3210

After removing the columns that aren't useful or won't be necessary for our calculations we can do the next step of tidying up our data. This would be to remove the rows that have missing data. Now we don't have to delete every row that is missing data because some columns can be interchangable. An example of an interchangable column is the radius and mass of the planet nut in the units of Earth Radius/Mass or Jupiter Radius/Mass which we can use either value in a formula since in the end we can just convert this value to any radius/mass units that we need. We also want to keep as many columns as we can so that we can get a better analysis.

To get more information regarding the data that we are using, check out this website:

  1. http://exoplanetarchive.ipac.caltech.edu
In [4]:
# Goes through each column and makes sure that our data has at least one valid data point to work with otherwise drop column
for index, row in data.iterrows():
    if (
        # This is checking to make sure that at least one of the values for radius was recorded for the planet
          ((str(data.at[index, 'pl_rade']) == 'nan') & (str(data.at[index, 'pl_radj']) == 'nan'))
        
        # This is checking to make sure that at least one of the values for the mass of the planet was recorded
        | ((str(data.at[index, 'pl_bmasse']) == 'nan') & (str(data.at[index, 'pl_bmassj']) == 'nan'))
        
        # This is making sure that we know what the Stellar Effective Temperature is for the host star
        | (str(data.at[index, 'st_teff']) == 'nan')        
        # This is the recording for the host stars radius                                                     
        | (str(data.at[index, 'st_rad']) == 'nan') 
        
        # This is the recorded mass for the host star                                               
        | (str(data.at[index, 'st_mass']) == 'nan')                                                                 
     ):
        data.drop(index, inplace=True)

data.head()
Out[4]:
pl_name hostname disc_year pl_orbper pl_orbsmax pl_rade pl_radj pl_bmasse pl_bmassj pl_bmassprov pl_eqt st_teff st_rad st_mass st_met st_metratio st_logg sy_dist
loc_rowid
23 1RXS J160929.1-210524 b 1RXS J160929.1-210524 2008 NaN 330.00000 18.647 1.664 2543.000 8.00000 Mass 1800.0 4060.0 1.35 0.85 NaN NaN NaN 139.1350
31 2MASS J02192210-3925225 b 2MASS J02192210-3925225 2015 NaN 156.00000 16.141 1.440 4417.837 13.90000 Mass NaN 3064.0 0.28 0.11 NaN NaN 4.59 NaN
36 2MASS J21402931+1625183 A b 2MASS J21402931+1625183 A 2009 7336.500000 NaN 10.310 0.920 6657.480 20.95000 Mass 2075.0 2300.0 0.12 0.08 NaN NaN NaN NaN
82 55 Cnc e 55 Cnc 2004 0.736539 0.01544 1.910 0.170 8.080 0.02542 Mass NaN 5250.0 0.96 0.90 0.35 [M/H] 4.42 12.5855
83 55 Cnc e 55 Cnc 2004 0.736546 0.01583 2.173 0.194 8.370 0.02600 Mass NaN 5250.0 0.96 0.90 0.35 [M/H] 4.42 12.5855

Now that we have cleaned most of the DataFrame up we can see that the rowid values for each row is not counting nicely, so we should not reset the index values by using an easy pandas function called .reset_index. This function will then move loc_rowid into its own column replacing its functionality. We can then say that the loc_rowid lost its reason for being in the column so to clean the DataFrame up more we can just remove it altogether

In [5]:
# This resets the inidicies and drops any empty ones
data.reset_index(inplace=True, drop=True)
data.head()
Out[5]:
pl_name hostname disc_year pl_orbper pl_orbsmax pl_rade pl_radj pl_bmasse pl_bmassj pl_bmassprov pl_eqt st_teff st_rad st_mass st_met st_metratio st_logg sy_dist
0 1RXS J160929.1-210524 b 1RXS J160929.1-210524 2008 NaN 330.00000 18.647 1.664 2543.000 8.00000 Mass 1800.0 4060.0 1.35 0.85 NaN NaN NaN 139.1350
1 2MASS J02192210-3925225 b 2MASS J02192210-3925225 2015 NaN 156.00000 16.141 1.440 4417.837 13.90000 Mass NaN 3064.0 0.28 0.11 NaN NaN 4.59 NaN
2 2MASS J21402931+1625183 A b 2MASS J21402931+1625183 A 2009 7336.500000 NaN 10.310 0.920 6657.480 20.95000 Mass 2075.0 2300.0 0.12 0.08 NaN NaN NaN NaN
3 55 Cnc e 55 Cnc 2004 0.736539 0.01544 1.910 0.170 8.080 0.02542 Mass NaN 5250.0 0.96 0.90 0.35 [M/H] 4.42 12.5855
4 55 Cnc e 55 Cnc 2004 0.736546 0.01583 2.173 0.194 8.370 0.02600 Mass NaN 5250.0 0.96 0.90 0.35 [M/H] 4.42 12.5855

Great! Now all our rows are perfectly numbered again and we only have the columns that we would need for us to perform an analysis. The next step that we will perform is to convert all the columns where the units are in Earth Radius, Earth Mass, Jupiter Radius, and Jupiter Mass and convert them into a more usable unit of measure that we can use in the calculations that we will do later. With that said we want to convert these units into kilometer (km) and kilograms (kg) as these are typically the units for astronomy calculations.

If you would like to learn more about the Astronomical System of Units then we recommend this website for a read:

  1. https://en.wikipedia.org/wiki/Astronomical_system_of_units

Now that we know what we are going to do we need to figure out how we are going to accomplish this. Based on a quick search we know:

  1. Earth Mass = 5.972 * 10 ^ 24 kg
  2. Earth Radius = 6,371 km
  1. Jupiter Mass = 1.898 * 10 ^ 27 kg
  2. Jupiter Radius = 69,911 km
  1. Solar Mass = 1.989 * 10 ^30 kg
  2. Solar Radius = 6.9 * 10 ^ 5 km

With this information we can then go through the columns and just multiply them by these values in order to get the mass of the planet in kilograms and the radius of the planet in kilometers

Before we convert all the masses and radii to SI units (kilograms and kilometers), we want to check to see which exoplanets are most similar to either Earth or Jupiter in terms of both mass and radius (this will be used in a later section of the project).

A Super Earth is classified as a planet that is between twice the size of Earth and up to 10 times the mass of Earth (for our classification we decided to slightly increase this range so that we could capture exoplanets that were slightly larger than a Super Planet but still had the right ratio to be consider Earth Like). This means that in order to determine whether a planet is more similar to Earth or Jupiter we much check both the mass and radius that is recorded and see whether the exoplanet would be considered in the range of the classification of a Super Earth before we check whether it is more similar to Jupiter.

To learn more about what Super Earths are, check out this website:

  1. https://exoplanets.nasa.gov/what-is-an-exoplanet/planet-types/super-earth/

In the code segment below, we will look at columns of the data related to the planet's mass and radius in Earth and Jupiter mass/radii and if the values are within the range of what a Super Earth is then we will classify the exoplanet to be similar to Earth however if it does not then we will check to see which planet (Earth or Jupiter) it is more similar to. If the value of the either column is "nan" (null) then we will assume that the exoplanet is most similar to the non-null value.

In [6]:
# This is a helper function to determine whether the exoplanet is closer to an Eath
# type planet or a Jupiter type planet. If the mass in Earth Masses is less then or equal to 15
# and radius in Earth Radii is less then or equal to 7 then the exoplanet is more Earth like.
def closerTo(eMass, eRad, jMass, jRad):
    
    if ((eMass <= 15) & (eRad <= 7)):
        return 'Earth'
    else:
        return 'Jupiter'

# Creates column to hold the most similar planet (Earth or Jupiter)
data["Similar to"] = "Earth"

# Loops through each row in order to determine which planet the exoplanet is more similar to
for i, row in data.iterrows():
    
    # If exoplanet doesn't have data for Earth mass or Radius then we say its closer to Jupiter
    if ((data.at[i, 'pl_bmasse'] == "nan") | (data.at[i, 'pl_rade'] == "nan")):
        data.at[i, 'Similar to'] = "Jupiter"
    
    # If exoplanet doesn't have data for Jupiter mass or Radius then we say its closer to Earth
    elif ((data.at[i, 'pl_bmassj'] == "nan") | (data.at[i, 'pl_radj'] == 'nan')):
        data.at[i,"Similar to"] = "Earth"
    
    # If exoplanet has data for both Jupiter and Earth we use the function above to 
    # determine which planet it is most similar to and edit the column based on the resutls
    else:
        eMass = data.at[i, 'pl_bmasse']
        eRad = data.at[i, 'pl_rade']
        jMass = data.at[i, 'pl_bmassj']
        jRad = data.at[i, 'pl_radj']
        
        data.at[i, "Similar to"] = closerTo(eMass, eRad, jMass, jRad)
    
data.head()
Out[6]:
pl_name hostname disc_year pl_orbper pl_orbsmax pl_rade pl_radj pl_bmasse pl_bmassj pl_bmassprov pl_eqt st_teff st_rad st_mass st_met st_metratio st_logg sy_dist Similar to
0 1RXS J160929.1-210524 b 1RXS J160929.1-210524 2008 NaN 330.00000 18.647 1.664 2543.000 8.00000 Mass 1800.0 4060.0 1.35 0.85 NaN NaN NaN 139.1350 Jupiter
1 2MASS J02192210-3925225 b 2MASS J02192210-3925225 2015 NaN 156.00000 16.141 1.440 4417.837 13.90000 Mass NaN 3064.0 0.28 0.11 NaN NaN 4.59 NaN Jupiter
2 2MASS J21402931+1625183 A b 2MASS J21402931+1625183 A 2009 7336.500000 NaN 10.310 0.920 6657.480 20.95000 Mass 2075.0 2300.0 0.12 0.08 NaN NaN NaN NaN Jupiter
3 55 Cnc e 55 Cnc 2004 0.736539 0.01544 1.910 0.170 8.080 0.02542 Mass NaN 5250.0 0.96 0.90 0.35 [M/H] 4.42 12.5855 Earth
4 55 Cnc e 55 Cnc 2004 0.736546 0.01583 2.173 0.194 8.370 0.02600 Mass NaN 5250.0 0.96 0.90 0.35 [M/H] 4.42 12.5855 Earth
In [7]:
# A quick peak into how many of the exoplanets in our data set are most similar to Earth and which are more similar to Jupiter
jupiter_counter = 0
earth_counter = 0
for index, row in data.iterrows():
    if data.at[index, 'Similar to'] == 'Jupiter':
        jupiter_counter += 1
    else:
        earth_counter += 1

print(('Number of counted Earth-Like: {}').format(earth_counter))
print(('Number of counted Jupiter-Like: {}').format(jupiter_counter))
Number of counted Earth-Like: 262
Number of counted Jupiter-Like: 1287

Based on our description of what a planet that would be 'Earth-Like' is we came up with around 262 of the 1549 exoplanets to be 'Earth-Like' just based on the parameters of their Mass and Radius compared to Earth's values. That means about 16 % of the exoplanets are similar to Earth so we already see that an even smaller amount of them will have the potential for life.

In [8]:
eMass = 5.972 * (10**24)
eRad = 6371
jMass = 1.898 * (10**27)
jRad = 69911
sMass = 1.989 * (10**30)
sRad = 6.9 * (10**5)

#Iterate over rows and convert Earth, Jupiter, and Solar masses and radii to SI units of kilograms and kilometers
for i, row in data.iterrows():
    
    # Each if makes sure that the specified cell in the data table isn't null and then converts it to SI units
    if (data.at[i, 'pl_bmasse'] != 'nan'):
        data.at[i, 'pl_bmasse'] = data.at[i, 'pl_bmasse'] * eMass
    
    if (data.at[i, 'pl_rade'] != 'nan'):
        data.at[i, 'pl_rade'] = data.at[i, 'pl_rade'] * eRad
        
    if (data.at[i, 'pl_bmassj'] != 'nan'):
        data.at[i, 'pl_bmassj'] = data.at[i, 'pl_bmassj'] * jMass
        
    if (data.at[i, 'pl_radj'] != 'nan'):
        data.at[i, 'pl_radj'] = data.at[i, 'pl_radj'] * jRad
        
    if (data.at[i, 'st_mass'] != 'nan'):
        data.at[i, 'st_mass'] = data.at[i, 'st_mass'] * sMass
    
    if (data.at[i, 'st_rad'] != 'nan'):
        data.at[i, 'st_rad'] = data.at[i, 'st_rad'] * sRad

data.head()
Out[8]:
pl_name hostname disc_year pl_orbper pl_orbsmax pl_rade pl_radj pl_bmasse pl_bmassj pl_bmassprov pl_eqt st_teff st_rad st_mass st_met st_metratio st_logg sy_dist Similar to
0 1RXS J160929.1-210524 b 1RXS J160929.1-210524 2008 NaN 330.00000 118800.037 116331.904 1.518680e+28 1.518400e+28 Mass 1800.0 4060.0 931500.0 1.690650e+30 NaN NaN NaN 139.1350 Jupiter
1 2MASS J02192210-3925225 b 2MASS J02192210-3925225 2015 NaN 156.00000 102834.311 100671.840 2.638332e+28 2.638220e+28 Mass NaN 3064.0 193200.0 2.187900e+29 NaN NaN 4.59 NaN Jupiter
2 2MASS J21402931+1625183 A b 2MASS J21402931+1625183 A 2009 7336.500000 NaN 65685.010 64318.120 3.975847e+28 3.976310e+28 Mass 2075.0 2300.0 82800.0 1.591200e+29 NaN NaN NaN NaN Jupiter
3 55 Cnc e 55 Cnc 2004 0.736539 0.01544 12168.610 11884.870 4.825376e+25 4.824716e+25 Mass NaN 5250.0 662400.0 1.790100e+30 0.35 [M/H] 4.42 12.5855 Earth
4 55 Cnc e 55 Cnc 2004 0.736546 0.01583 13844.183 13562.734 4.998564e+25 4.934800e+25 Mass NaN 5250.0 662400.0 1.790100e+30 0.35 [M/H] 4.42 12.5855 Earth

Awesome now that we have the converted values of the columns we can now begin to calculate some numbers that would help us better determine if an exoplanet is a potentially habitable and possibly contain life.

The columns that we have left are:

  1. pl_name: Planet Name
  2. hostname: Host Name
  3. disc_year: Discovery Year
  4. pl_orbper: Orbital Period [days]
  5. pl_orbsmax: Orbit Semi-Major Axis [AU] (distance for their host star)
  6. pl_rade: Planet Radius converted from Earth Radius
  7. pl_radj: Planet Radius converted from Jupiter Radius
  8. pl_bmasse: Planet Mass converted from Earth Mass
  9. pl_bmassj: Planet Mass converted from Jupiter Mass
  10. pl_bmassprov: Planet Mass Provenance
  11. pl_eqt: Equilibrium Temperature
  12. st_teff: Stellar Effective Temperature
  13. st_mass: Stellar Mass converted from Solar Mass
  14. st_rad: Stellar Radius converted from Solar Radius
  15. st_met: Stellar Metallicity [dex]
  16. st_metratio: Stellar Metallicity Ratio
  17. st_logg: Stellar Surface Gravity
  18. sy_dist: Distance to the planetary system

3. Exploratory Analysis and Data Visualization

After going through all the data and successfully tidying up the data so that we can perform an analysis, we come to this next step. Exploratory Analysis and Data Visualization, in this step we want to show some potential trends that show up in our data set as well as perform some statistical analyses in order to obtain a greater and deeper understanding of what type of trends there may be.

First we want to start by calculating where the frost line of the specific host stars may be. A Frost Line is known also known as the snow line or ice line and is boundary in a solar system in which it is cold enough for volatile compounds such as water, ammonia, methane, carbon dioxide, and carbon monoxide condense into a solid which unless an outside force is applied to the planet to keep it warm is not a good sign for life.

To learn more about Frost Line, we recommend to check out this website:

  1. https://en.wikipedia.org/wiki/Frost_line_(astrophysics)#:~:text=In%20astronomy%20or%20planetary%20science,condense%20into%20solid%20ice%20grains.

In order to caluclate the Frost Line we are going to use for formula :

\begin{equation*} T(frost)^4 = \frac{R^2 * T^4}{4 * r(frost)^2} \end{equation*}

Now that we have this equation the next step is to rework it in order to solve for r(frost) since this would show us where the frost line is located.

After a quick rework of the original formula and we get :

\begin{equation*} r(frost) = \sqrt[2]{\frac{R^2 * T^4}{4 * T(frost)^4}} \end{equation*}

Great! We have the equation we need to solve for the frost line. Now we need to figure out what each element means:

  1. R^2 and T^4 are both value correlating to the Host Star (Host Radius and Host Temperature)
  2. T(frost), this value will be set to 150 K, which represents the boundary between the Inner and Outer System (average between H2O (180 K) and NH3 (130 K))

After calculating this value for each row we will also calculate whether or not the planet in question is inside the frost line defined or outside.

The following tools were used for Exploratory Analysis and Data Visualization:

  1. matplotlib
  2. numpy
In [9]:
data['frost_line'] = 'nan'

# This is going through each of the rows and calculating where the frost line is located for the specific host star
km_to_AU = 6.68 * (10**-9)
for index, row in data.iterrows():
    solar_radius = data.at[index, 'st_rad']
    solar_temperature = data.at[index, 'st_teff']
    t_frost = 150

    # Carries out the frost line radius equation mentioned in markdown cell above and converts it to the SI unit kilometers(km)
    r_frost = np.power(((np.power(solar_radius, 2) * np.power(solar_temperature, 4)) / (4 * np.power(t_frost, 4))), (1/2))
    r_frost = r_frost * km_to_AU
    # Stores radius of frost line
    data.at[index, 'frost_line'] = r_frost

# After calculating where the frost line was located we also want to add a column that indicated whether the exoplanet in question is actually
# within the frost line in order to determine if it is a Terrestrial Exoplanet or a Gas / Ice Giant
data['placement'] = 'outside'
for index, row in data.iterrows():
    if data.at[index, 'frost_line'] > data.at[index, 'pl_orbsmax']:
        data.at[index, 'placement'] = 'inside'

data.head()
Out[9]:
pl_name hostname disc_year pl_orbper pl_orbsmax pl_rade pl_radj pl_bmasse pl_bmassj pl_bmassprov ... st_teff st_rad st_mass st_met st_metratio st_logg sy_dist Similar to frost_line placement
0 1RXS J160929.1-210524 b 1RXS J160929.1-210524 2008 NaN 330.00000 118800.037 116331.904 1.518680e+28 1.518400e+28 Mass ... 4060.0 931500.0 1.690650e+30 NaN NaN NaN 139.1350 Jupiter 2.27929 outside
1 2MASS J02192210-3925225 b 2MASS J02192210-3925225 2015 NaN 156.00000 102834.311 100671.840 2.638332e+28 2.638220e+28 Mass ... 3064.0 193200.0 2.187900e+29 NaN NaN 4.59 NaN Jupiter 0.269246 outside
2 2MASS J21402931+1625183 A b 2MASS J21402931+1625183 A 2009 7336.500000 NaN 65685.010 64318.120 3.975847e+28 3.976310e+28 Mass ... 2300.0 82800.0 1.591200e+29 NaN NaN NaN NaN Jupiter 0.0650204 outside
3 55 Cnc e 55 Cnc 2004 0.736539 0.01544 12168.610 11884.870 4.825376e+25 4.824716e+25 Mass ... 5250.0 662400.0 1.790100e+30 0.35 [M/H] 4.42 12.5855 Earth 2.71021 inside
4 55 Cnc e 55 Cnc 2004 0.736546 0.01583 13844.183 13562.734 4.998564e+25 4.934800e+25 Mass ... 5250.0 662400.0 1.790100e+30 0.35 [M/H] 4.42 12.5855 Earth 2.71021 inside

5 rows × 21 columns

Awesome! With the Frost line calculated we can now tell whether the planet will most likely be a terrestrial exoplanet or a gas/ice giant, lets make a quick graph to see how many exoplanets are inside the frost line and how many are outside.

In [10]:
inside_frost_line = 0
outside_frost_line = 0
labels = ['outside the frost line', 'inside the frost line']

# Iterates through the data and counts the number of exoplanets within and beyond the frost line
for index, row in data.iterrows():
    if data.at[index, 'placement'] == 'outside':
        outside_frost_line += 1
    else:
        inside_frost_line += 1

plot.bar(labels, [outside_frost_line, inside_frost_line])
plot.ylabel("Number of Exoplanets")
plot.title("Outside the Frost Line vs Inside the Frost Line")
plot.show()
    

Woah its unexpected but now we know a lot of the exoplanets in our data are within the frost line increasing our probability of having life. Now lets see whether or not the planets that are said to be within the frost line are also more similar to Earth or Jupiter.

In [11]:
earthlike_inside = 0
jupiterlike_outside = 0

# Itterates through the data in order to count those exoplanets that are both Earth like and inside the Frost Line
for index, row in data.iterrows():
    if ((data.at[index, 'placement'] == 'inside') & (data.at[index, 'Similar to'] == 'Earth')):
        earthlike_inside += 1
    else:
        jupiterlike_outside += 1

print(('Earth Like and inside Frost Line: {}').format(earthlike_inside))
print(('Jupiter Like and/or outside Frost Line: {}').format(jupiterlike_outside))
Earth Like and inside Frost Line: 156
Jupiter Like and/or outside Frost Line: 1393

That's great! We can now see that with those that lie within the frost line, 156 of them are also Earth Like. You may be wondering how a planet could be potentially Jupiter like and still be within the frost line, the answer is that they are Hot Jupiters. These are planets that are like Jupiter but have drifted closer to there host star. A truly an amazing topic to learn more about(https://en.wikipedia.org/wiki/Hot_Jupiter).

With this new view and calculation, we can see how many planets are terrestrial planets versus Gas / Ice Giants this gives us a better understanding of which and how many exoplanets could harbor life but before we are quick to judge we should see what the Actual Temperature is of the planets because even if it is beyond the frost line, if there is enough heat generated to have liquids contained on the exoplanet then it still has the possibility to harbor some type of life.

There is also a value called Effective Temperature however this does not take into account Greenhouse Effect.

To learn more about the difference between Effective Temperature and Actual Temperature, we recommend this website:

  1. https://www.physicsforums.com/threads/difference-between-effective-temp-and-actual-surface-temp-is-due-to-what.494651/

To give a good example about what a good Actual Temperature is: On Earth the Effective Temperature is 252 K and the Actual Temperature is 288 K (taking into account the Greenhouse Effect)

To calculate the Actual Temperature you can use the formula:

\begin{equation*} T(actual) = \sqrt[4]{\frac{(1 + gamma) * (1 - a)}{4} * \frac{R^2 * T^4}{r(orbit)^2}} \end{equation*}

What does each variable represent?

  1. (1 + gamma) represent the percentage of light reflected back to the surface after already being reflected from the planet
  2. (1 - a) represents the amount of heat that is absorbed by the planet (a being the albedo or reflectiveness of a planet)
  3. R^2 and T^2, like before represent the radius and temperature of the Host star that the exoplanet revolves around
  4. The division by 4 comes from the fact that exoplanets absorb light like a disk but emit heat like a sphere

Unfortunately, we do not have what the 'gamma' and 'a' values of the exoplanets are however we do have a column called 'pl_eqt' which stands for Equilibrium Temperature. This is very similar to Actual Temperature however its a theoretical temperature that an exoplanet would be heated only based on their host star.

To read more about this topic, visit this site:

  1. https://en.wikipedia.org/wiki/Planetary_equilibrium_temperature

Since we really want to see how many of the exoplanets we have that have a decent 'Actual Temperature' we will show a plot that is comparing what the steller effective temperature with the exoplanets Equilibrium Temperature

In [12]:
equilibrium_temp = []
stellar_temp = []

for index, row in data.iterrows():
    # Checks to make sure all cells have data in them and then append that data to the lists created above
    if ((data.at[index, 'pl_eqt'] != 'nan') & (data.at[index, 'st_teff'] != 'nan')):
        equilibrium_temp.append(data.at[index, 'pl_eqt'])
        stellar_temp.append(data.at[index, 'st_teff'])

        
plot.scatter(equilibrium_temp, stellar_temp)
plot.title('Stellar Effective Temperature vs Planetary Equilibrium Temperature')
plot.xlabel('Equilibrium Temperature')
plot.ylabel('Stellar Effective Temperature')
plot.show()

Wow! With this data we can see that normally the Equilibrium Temperature of the Exoplanets are smaller if the Stellar Effective Temperature is also smaller which makes sense since unless the exoplanet has a good gamma value and a relatively small albedo ('a') value, the exoplanet wouldn't be able to contain the heat therefore losing a lot of the heat that was radiated off their host star. On the other hand if they did have a good albedo and a good gamma value then in turn the exoplanet may be able to 'amplify' the heat that is radiated towards it and that will allow for a exoplanet that is orbiting a smaller Stellar Effective Temperature Star to be warm enough for potential life forms.

Now that we have see the Equilibrium Temperature of the Exoplanets compared to their Host Stars Stellar Effective Temperature, lets break the data down some more. First lets see what the Equilibrium Temperature of the Exoplanets are compared to the radius of said exoplanets.

In [13]:
equilibrium_temp = []
radius_exo = []

for index, row in data.iterrows():
    # Check to make sure the cells contain useable data
    if ((data.at[index, 'pl_eqt'] != 'nan') & (data.at[index, 'st_teff'] != 'nan')):
        equilibrium_temp.append(data.at[index, 'pl_eqt'])
        # If exoplanet is similar to Earth, add exoplanet's Earth radius to List
        if data.at[index, 'Similar to'] == 'Earth':
            radius_exo.append(data.at[index, 'pl_rade'])
        else:
            radius_exo.append(data.at[index, 'pl_radj'])

plot.scatter(equilibrium_temp, radius_exo)
plot.title('Exoplanet Radius vs Exoplanet Equilibrium Temperature')
plot.xlabel('Equilibrium Temperature')
plot.ylabel('Exoplanet Radius')
plot.show()

Based on this graph we can see which exoplanets with large temperatures also have a large radius which could mean that the exoplanet is very similar to Jupiter where the exoplanet may actually produce its own heat through radiation generated within the exoplanet itself. We can also see that a lot of exoplanets have relatively large temperatures which basically crosses those exoplanets off for potential life to reside since life can only form in relatively decent conditions and have an excessively high planet temperature does not help life progress in the slightest.

Now that we have visualized some of the data that we have so far calculated or had within the data table before, we now want to calculate where the Habitable Zone is for each solar system. The Habitable Zone, or the Circumstellar Habitable Zone, is the range of orbits around a star where a planet would be able to support liquid water given sufficient atmospheric pressure.

To learn more about Habitable Zones, check out this link:

  1. https://en.wikipedia.org/wiki/Circumstellar_habitable_zone

With the Habitable Zone we can see which exoplanets are capable of having liquid water and in turn may have potential life roaming around on the planet.

To calculate the Habitable Zone we can utilize the formula:

\begin{equation*} T(actual)^4 = \frac{(1 + gamma) * (1 - a)}{4} * \frac{R^2 * T^4}{r(orbit)^2} \end{equation*}

Since we want to solve for r(orbit) we can rearrange the formula to take that into account. We are then left with the final formula:

\begin{equation*} r(orbit) = \sqrt[2]{\frac{(1 + gamma) * (1 - a)}{4} * \frac{R^2 * T^4}{T(actual)^4}} \end{equation*}

For these calculations we can be much more flexible regarding 'a' and 'gamma' values so we can actually utilize it to help get the Habitable Zone range. Since we know that Earth is in the Habitable Zone, we can utilize its albedo and gamma values so that we can say that within the habitable zone all planets would have the same albedo and gamma values, this allows for easier calculations.

  1. Earth's albedo = 0.3
  2. Earth's gamma = 0.6

For T(actual) we can substitute the in the temperatures for when water boils and when water freezes as these would be the borders for where the Habitable Zone would lie since we would want liquid water on the planet and not steam or ice.

  1. Water Freezes at 273.15 Kelvin
  2. Water Boils at 373.1 Kelvin

Now that we know what we are solving for as well as the values to plug in, since the rest of the values depend on the Host Star, we can now start calculating for the Habitable Zone Range for the Exoplanet's Host Stars.

In [14]:
# Creates columns for being inner and outer range of habitable zone
data['Inner_Habitable_Zone'] = 'nan'
data['Outer_Habitable_Zone'] = 'nan'
km_to_AU = 1.496 * (10**8)

gamma = 0.6
albedo = 0.3
water_freezes = 273.15
water_boils = 373.1

# Calculate boundaries of habitable zone and adds value to data set
for index, row in data.iterrows():
    constant = ((1 + gamma)*(1-albedo)) / 4
    inner_habitable_fraction = ((np.power(data.at[index, 'st_rad'], 2) * np.power(data.at[index, 'st_teff'], 4)) / np.power(water_boils, 4))
    outer_habitable_fraction = ((np.power(data.at[index, 'st_rad'], 2) * np.power(data.at[index, 'st_teff'], 4)) / np.power(water_freezes, 4))
    data.at[index, 'Inner_Habitable_Zone'] = np.power(constant * inner_habitable_fraction, (1/2)) / km_to_AU
    data.at[index, 'Outer_Habitable_Zone'] = np.power(constant * outer_habitable_fraction, (1/2)) / km_to_AU
data.head()
Out[14]:
pl_name hostname disc_year pl_orbper pl_orbsmax pl_rade pl_radj pl_bmasse pl_bmassj pl_bmassprov ... st_mass st_met st_metratio st_logg sy_dist Similar to frost_line placement Inner_Habitable_Zone Outer_Habitable_Zone
0 1RXS J160929.1-210524 b 1RXS J160929.1-210524 2008 NaN 330.00000 118800.037 116331.904 1.518680e+28 1.518400e+28 Mass ... 1.690650e+30 NaN NaN NaN 139.1350 Jupiter 2.27929 outside 0.39015 0.727914
1 2MASS J02192210-3925225 b 2MASS J02192210-3925225 2015 NaN 156.00000 102834.311 100671.840 2.638332e+28 2.638220e+28 Mass ... 2.187900e+29 NaN NaN 4.59 NaN Jupiter 0.269246 outside 0.0460873 0.0859864
2 2MASS J21402931+1625183 A b 2MASS J21402931+1625183 A 2009 7336.500000 NaN 65685.010 64318.120 3.975847e+28 3.976310e+28 Mass ... 1.591200e+29 NaN NaN NaN NaN Jupiter 0.0650204 outside 0.0111297 0.020765
3 55 Cnc e 55 Cnc 2004 0.736539 0.01544 12168.610 11884.870 4.825376e+25 4.824716e+25 Mass ... 1.790100e+30 0.35 [M/H] 4.42 12.5855 Earth 2.71021 inside 0.463912 0.865534
4 55 Cnc e 55 Cnc 2004 0.736546 0.01583 13844.183 13562.734 4.998564e+25 4.934800e+25 Mass ... 1.790100e+30 0.35 [M/H] 4.42 12.5855 Earth 2.71021 inside 0.463912 0.865534

5 rows × 23 columns

With the Habitable Zone calculated, we can now see if any of the exoplanets are actually residing within the zone. Of course we can add some leeway with being within the Habitable Zone because even if a exoplanet is not perfectly in the habitable zone but is near it, it can still have the potential to have the right temperature due to other factors.

In [15]:
inside = 0
outside = 0

# Calculates the number of exoplanets within the habitable zone
for index, row in data.iterrows():
    if ((data.at[index, 'pl_orbsmax'] >= data.at[index, 'Inner_Habitable_Zone'] - 0.1) & (data.at[index, 'pl_orbsmax'] >= data.at[index, 'Outer_Habitable_Zone'] - 0.1)):
        inside += 1
    else:
        outside += 1
        
plot.bar(['Inside Habitable Zone', 'Outside Habitable Zone'], [inside, outside])
plot.ylabel("Number of Exoplanets")
plot.title("Inside the Habitable Zone vs Outside the Habitable Zone")
plot.show()

4. Hypothesis Testing and Machine Learning

In the following section, we will try and use machine learning models such as linear regression and polynomial prediction features in order to create a predictive model of the data. This predicitive model can be used in order to predict values of data outside our current data set.

With our predictive model, we are going to use our data to predict the number of exoplanets discovered in years outside of our data range and then predict how many of those exoplanets have the ability to be hosting life based on assumptions and the data set we are using.

In order to determine how many exolpanets are discovered in a future year such, such as 2021 in this example, we will be looking at the number of exoplanets discovered over the range of years included in our dataset to train the model to predict the number of planets that will be discovered in 2021.

We will start this predictive process by finding the number of exoplanets discovered over the course of the years within our dataset. This is done by looping through our data and counting the number of instances of year in our dataset. We will do this by defining a key/value pairing with the key being the year and the value being the number of exoplanets found in that year.

The following tools were used for Hypothesis Testing and Machine Learning:

  1. malplotlib
  2. sklearn
  3. seaborn
In [16]:
# List of all the years in our dataset
years = []
# Key/Value pairing
disc_per_year = {}
for i in range(1989, 2021):
    disc_per_year[i] = 0

# Adding to each year in the dictionary according to how many planets were observed in that given year
for i, row in data.iterrows():
    disc_per_year[data.at[i,"disc_year"]] += 1

# Ploting each year and number of exoplanets associated and then annotating that point to allow for each comprehension
for year in disc_per_year.keys():
    plot.scatter(year, disc_per_year[year])
    plot.annotate(year, (year, disc_per_year[year]))
plot.subplots_adjust(top=1, bottom=-1, right=1, left=-1)
# Adding title and axis labels, as well as changing plot size for an easier read
plot.title("Number of Exoplanets Discovered per Year")
plot.xlabel("Year")
plot.ylabel("Number of Exoplanets Discovered")
plot.show()

As you can see, the number of discovered exoplanets have been increasing over the course of the years however it is not an obvious trend. Below, we will explore a number of possible fits for the data in order to find the best possible fit to match the data and accurately predict future data points in the dataset.

The first type of possible fit we will try is a linear regression. We will test the linear regression by using the sklearn linear_model module. The model will be based around and taught by the number of exoplanets found per year. The linear_model module will allow us to create a linear line of best fit for all the data in order to see if a linear model is the best overall fit.

In [17]:
years = []
num_disc = []

# Adds the number of exoplanet discorveries to the num_disc list and the years to the years list
for y in disc_per_year:
    if disc_per_year[y] != 0:
        years.append(y)
        num_disc.append(disc_per_year[y])

# Creates plot 
plot.plot(years, num_disc, 'o')


# Creates second years list to use with the linear model
years2 = []
for y in years:
    years2.append([y])
    

# Creates linear model and uses the years and number of discs to fit it
model = linear_model.LinearRegression()
model.fit(years2, num_disc)
predicted = model.predict(years2) 

plot.plot(years, predicted)
plot.title("Number of Exoplanets Discovered per Year")
plot.xlabel("Year")
plot.ylabel("Number of Exoplanets Discovered")
plot.show()

The linear regression line that our model predicts doesn’t obviously fit the data trend super well. This is because of the advancements made in technology used in the process of discovering exoplanets. As technology advances the ability to find exoplanets increases and that is why there is a spike in our data and why we can predict a spike in future data. Those spikes are short lived however since technology advances at a slow rate. Without a new advancement in technology, the number of possible exoplanets we can discover with current technology will be more and more limited each year after an initial technological advancement. This is where you can see the downward trend of discoveries after an initial spike in discoveries.

With this information in mind, we use our model to try and predict the number of exoplanets with the conditions to be met such that the planet has the possible to be able host life on it. In the following cells, you can see the calculated accuracy of our model as well as the predictions made by our model for a future year (aka 2021).

In [18]:
print(model.score(years2, num_disc))
0.32910456846199854

With the result of the accuracy prediction, the closer to the value of one the predicition is the more accurate our model will be able to predict future data points. With this in mind, you can see that our model isn't the most accurate with the predictions and the reasons for this are explained above.

In [19]:
num_2021_planets = model.predict([[2021]])
print(num_2021_planets)
[122.00089366]

This is the number of planets our model predicts to have the conditions to possibly host life on the exoplanet. As you can see, our model predicts that about 122 exoplanets with conditions that will possibly be able to host life will be found.

In [20]:
sns.regplot(x=years, y=num_disc, scatter_kws={'color': 'black'}).set_title('Regression Plot of Number of Exoplanets discovered in specific years')
Out[20]:
Text(0.5, 1.0, 'Regression Plot of Number of Exoplanets discovered in specific years')

The above plot represents the regression plot the number of exoplanets discovered each specific year. In the regression plot, we have a line of best fit as well as a shaded area on the plot. The shaded area represents the range of predictions our model makes and shows the general trend of the number of exoplanets over time. As you can see, the general trend is a positive trend in the number of exoplanets discovered each year.

In [21]:
years_disc = {}
exoplanets = {}

# Creating a dictionary of the exoplanets that fit our criteria and creating a dictionary regarding which year found a lot of potential canditates for life
for index, row in data.iterrows():
    # Our Conditions for a potential canditate for life
    inside_habitable_zone = ((data.at[index, 'pl_orbsmax'] >= data.at[index, 'Inner_Habitable_Zone'] - 0.1) & (data.at[index, 'pl_orbsmax'] >= data.at[index, 'Outer_Habitable_Zone'] - 0.1))
    if ((data.at[index, 'placement'] == 'inside') & inside_habitable_zone & (data.at[index, 'pl_eqt'] <= 800)):
        if data.at[index, "disc_year"] not in years_disc:
            years_disc[data.at[index, "disc_year"]] = 0
        years_disc[data.at[index, 'disc_year']] += 1
        exoplanets[data.at[index, 'pl_name']] = data.at[index, 'hostname']

years = []
num_disc = []

for i in years_disc.keys():
    years.append(i)
    num_disc.append(years_disc[i])

years2 = []
for y in years:
    years2.append([y])

plot.plot(years, num_disc, 'o')

model = linear_model.LinearRegression()
model.fit(years2, num_disc)
predicted = model.predict(years2) 

plot.plot(years, predicted)
plot.title("Number of Potential Life Harboring Exoplanets per Year")
plot.xlabel("Year")
plot.ylabel("Number of Potential Life Harboring Exoplanets Discovered")
plot.show()

Wow we found many exoplanets that fit our criteria regarding what qualities of an exoplanet would let us know that it has the potential to harbor life. Our criteria was as followed:

  1. Lies within the frost line because if it didn't then it most likely would not be a terrestrial planet
  2. It lied within the Habitable Zone that is created by its host star
  3. The Equilibrium temperature is less than or equal to 800 Kelvin, which may seem high however if the exoplanet is still in its formation stage then it can cool down and then become a very likely canditate for life to formation

Now lets see who the potential candiates were.

In [22]:
for i in exoplanets.keys():
    print(('Exoplanet: {} Host Star: {}').format(i, exoplanets[i]))
Exoplanet: EPIC 248847494 b Host Star: EPIC 248847494
Exoplanet: GJ 1132 b Host Star: GJ 1132
Exoplanet: K2-18 b Host Star: K2-18
Exoplanet: Kepler-1654 b Host Star: Kepler-1654
Exoplanet: Kepler-1661 b Host Star: Kepler-1661
Exoplanet: Kepler-62 e Host Star: Kepler-62
Exoplanet: Kepler-62 f Host Star: Kepler-62
Exoplanet: L 98-59 b Host Star: L 98-59
Exoplanet: L 98-59 c Host Star: L 98-59
Exoplanet: L 98-59 d Host Star: L 98-59
Exoplanet: LHS 1140 b Host Star: LHS 1140
Exoplanet: LHS 1140 c Host Star: LHS 1140
Exoplanet: LTT 1445 A b Host Star: LTT 1445 A
Exoplanet: LTT 3780 c Host Star: LTT 3780
Exoplanet: TOI-1266 c Host Star: TOI-1266
Exoplanet: TRAPPIST-1 b Host Star: TRAPPIST-1
Exoplanet: TRAPPIST-1 c Host Star: TRAPPIST-1
Exoplanet: TRAPPIST-1 d Host Star: TRAPPIST-1
Exoplanet: TRAPPIST-1 e Host Star: TRAPPIST-1
Exoplanet: TRAPPIST-1 f Host Star: TRAPPIST-1
Exoplanet: TRAPPIST-1 g Host Star: TRAPPIST-1
Exoplanet: WD 1856+534 b Host Star: WD 1856+534
In [23]:
print(model.score(years2, num_disc))
0.2878073770491585

The cell above, represents the results of an accuracy test that we have done without model. In this test, the closer to the value of 1 the results are the more accurate we expect our model’s predictions to be. As you can see, our model isn’t the most accurate, but it still has a bit of validity to its predictions.

In [24]:
potential_life_2021 = model.predict([[2021]])
print(potential_life_2021)
[4.47131148]

The above cell shows our model’s prediction of the number of potentially life hosting planets in the year 2021 based on the data learned from our dataset. As seen by the output of the cell, our model predicts that there will be about 5 potentially life harboring exoplanets discovered in the year 2021.

In [25]:
print((potential_life_2021 / num_2021_planets) * 100)
[3.66498256]

In the cell above, we will calculate the percentage of the predicted number of potentially life harboring planets found in 2021 out of the predicted number of exoplanets to be found in 2021. As seen in the output, the percentage of predicted potentially life harboring planets out of the predicted number of exoplanets predicted to be found in 2021 is approximately 3.6%.

5. Observations and Analysis

This is the section in which we summarize all that we have done and then come up with a conclusion based on our analysis

Summary

  1. When looking at all the exoplanets recorded with enough data help determine whether or not the exoplanet has the potential of life, we found that about 16 % (262 (Earth-like)/ 1549 (total planets)) of the exoplanets found were Earth like or potenitally a Super Earth
  2. We then found that many of the exoplanets found were within the frost line created by their host star which meant that many were terrestrial like planets
  3. Checking the Equilibrium Temperatures of the exoplanets we found that the predicted temperatures were relatively high however since they were predictions we took the values with a grain of salt and said that if a planet was within a range that could then fall into a more suitable range then it was a candidate for life
  4. Afterwards we solved for the habitable zone so that we could see whether or not the exoplanets were in this habitable zone which would greatly increase the odds of whether or not the exoplanet had the potential for life
  5. We then found the number of planets with enough data discovered per year and then use Linear Regression to create and train a model to predict how many exoplanets were found in 2021.
  6. Continuing to utilize Linear Regression we also counted how many potentially habitable exoplanets were discovered from the ones we counted and then predicted how many potentially habitable exoplanets would be found in 2021.
  7. Using both of the predicted values we then determined how many exoplanets found in 2021 would be exoplanets that had the potential for life, which came to around 3 %

Observations based on our analysis and modeling:

Based on the massive number of exoplanets discovered in recent years we can conclude that finding a potential exoplanet that has the ability to harbor life is extremely small but, is not zero. As we found through our analysis a small percentage of the exoplanets have the potential to host life based on factors such as its placement compared to its star and location in comparison to the calculated frost line and habitable zone of the star.

If we conducted further analysis, we would use more exoplanet data and expand the factors that we are using to define what a potentially habitable exoplanet would have to have in order to be a candidate. We would also try to cross check our calculations with more databases that may have had done similar experiments in order to access a better approach to classifying and identifying an exoplanet with the potential to host life. Lastly, we would try to find a way to utilize all exoplanets discovered instead of deleting many of them due to certain values of an exoplanet missing which has the potential to skew our predictions.