Climate coding challenge, Part 6¶
Getting your own data
There are more Earth Observation data online than any one person could ever look at¶
NASA’s Earth Observing System Data and Information System (EOSDIS) alone manages over 9PB of data. 1 PB is roughly 100 times the entire Library of Congress (a good approximation of all the books available in the US). It’s all available to you once you learn how to download what you want.
Here we’re using the NOAA National Centers for Environmental Information (NCEI) Access Data Service application progamming interface (API) to request data from their web servers. We will be using data collected as part of the Global Historical Climatology Network daily (GHCNd) from their Climate Data Online library program at NOAA.
For this example we’re requesting daily summary data in Boulder, CO (station ID USC00050848).
** Your task:**
- Research the Global Historical Climatology Network - Daily data source.
- In the cell below, write a 2-3 sentence description of the data source.
- Include a citation of the data (HINT: See the ‘Data Citation’ tab on the GHCNd overview page).
Your description should include:
- who takes the data
- where the data were taken
- what the maximum temperature units are
- how the data are collected
YOUR DATA DESCRIPTION AND CITATION HERE 🛎️
Access NCEI GHCNd Data from the internet using its API 🖥️ 📡 🖥️¶
The cell below contains the URL for the data you will use in this part of the notebook. We created this URL by generating what is called an API endpoint using the NCEI API documentation.
Note
An application programming interface (API) is a way for two or more computer programs or components to communicate with each other. It is a type of software interface, offering a service to other pieces of software (Wikipedia).
First things first – you will need to import the pandas
library to
access NCEI data through its URL:
# Import required packages
import pandas as pd
import numpy as np
Your task:
- Pick an expressive variable name for the URL.
- Reformat the URL so that it adheres to the 79-character PEP-8 line limit. You should see two vertical lines in each cell - don’t let your code go past the second line.
- At the end of the cell where you define your url variable, call your variable (type out its name) so it can be tested.
Boulder_data = ('https://www.ncei.noaa.gov/access/services/data/v1'
'?dataset=daily-summaries'
'&dataTypes=TOBS,PRCP'
'&stations=USC00050848'
'&units=standard'
'&startDate=1901-10-01'
'&endDate=2023-09-30')
Boulder_data
'https://www.ncei.noaa.gov/access/services/data/v1?dataset=daily-summaries&dataTypes=TOBS,PRCP&stations=USC00050848&units=standard&startDate=1901-10-01&endDate=2023-09-30'
Download and get started working with NCEI data¶
Just like you did with the practice data, go ahead and use pandas to import data from your API URL into Python. If you didn’t do it already, you should import the pandas library at the top of this notebook so that others who want to use your code can find it easily.
# Import data into Python from NCEI API
Boulder_df = pd.read_csv(
Boulder_data,
index_col='DATE',
parse_dates=True,
na_values=['NaN'])
Boulder_df
STATION | PRCP | TOBS | |
---|---|---|---|
DATE | |||
1901-10-01 | USC00050848 | 0.0 | 62.0 |
1901-10-02 | USC00050848 | 0.0 | 60.0 |
1901-10-03 | USC00050848 | 0.0 | 57.0 |
1901-10-04 | USC00050848 | 0.0 | 53.0 |
1901-10-05 | USC00050848 | 0.0 | 49.0 |
... | ... | ... | ... |
2023-09-26 | USC00050848 | 0.0 | 74.0 |
2023-09-27 | USC00050848 | 0.0 | 69.0 |
2023-09-28 | USC00050848 | 0.0 | 73.0 |
2023-09-29 | USC00050848 | 0.0 | 66.0 |
2023-09-30 | USC00050848 | 0.0 | 78.0 |
43862 rows × 3 columns
#Choose a different location
DC_data = ('https://www.ncei.noaa.gov/access/services/data/v1'
'?dataset=daily-summaries'
'&dataTypes=TOBS,PRCP'
'&stations=USC00186350'
'&units=standard'
'&startDate=1950-10-01'
'&endDate=2023-09-30')
DC_data
'https://www.ncei.noaa.gov/access/services/data/v1?dataset=daily-summaries&dataTypes=TOBS,PRCP&stations=USC00186350&units=standard&startDate=1950-10-01&endDate=2023-09-30'
# Import data into Python from NCEI API
DC_df = pd.read_csv(
DC_data,
index_col='DATE',
parse_dates=True,
na_values=['NaN'])
DC_df
STATION | PRCP | TOBS | |
---|---|---|---|
DATE | |||
1950-10-01 | USC00186350 | 0.00 | 80.0 |
1950-10-02 | USC00186350 | 0.00 | 81.0 |
1950-10-03 | USC00186350 | 0.00 | 81.0 |
1950-10-04 | USC00186350 | 0.27 | 61.0 |
1950-10-05 | USC00186350 | 0.00 | 62.0 |
... | ... | ... | ... |
2023-09-26 | USC00186350 | 0.07 | 58.0 |
2023-09-27 | USC00186350 | 0.04 | 58.0 |
2023-09-28 | USC00186350 | 0.00 | 53.0 |
2023-09-29 | USC00186350 | 0.01 | 61.0 |
2023-09-30 | USC00186350 | 0.00 | 66.0 |
25895 rows × 3 columns
# Plot data
DC_df.plot(
y='TOBS',
title='Washington, DC Temperature',
xlabel='Year',
ylabel='Temp ($^\circ$F)',
legend = False)
<Axes: title={'center': 'Washington, DC Temperature'}, xlabel='Year', ylabel='Temp ($^\\circ$F)'>
# Resample to annual
ann_DC_df = DC_df.resample('YS').agg({'PRCP': np.sum, 'TOBS': np.mean})
ann_DC_df
/tmp/ipykernel_30674/2078979801.py:2: FutureWarning: The provided callable <function sum at 0x73932fd37250> is currently using SeriesGroupBy.sum. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass 'sum' instead. ann_DC_df = DC_df.resample('YS').agg({'PRCP': np.sum, 'TOBS': np.mean}) /tmp/ipykernel_30674/2078979801.py:2: FutureWarning: The provided callable <function mean at 0x73932fd581f0> is currently using SeriesGroupBy.mean. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass 'mean' instead. ann_DC_df = DC_df.resample('YS').agg({'PRCP': np.sum, 'TOBS': np.mean})
PRCP | TOBS | |
---|---|---|
DATE | ||
1950-01-01 | 9.05 | 53.363636 |
1951-01-01 | 39.90 | 64.176966 |
1952-01-01 | 48.09 | 61.715190 |
1953-01-01 | 20.73 | 67.351695 |
1954-01-01 | 32.56 | 64.362573 |
... | ... | ... |
2019-01-01 | 43.92 | 54.866667 |
2020-01-01 | 63.66 | 55.389831 |
2021-01-01 | 42.55 | 55.108635 |
2022-01-01 | 43.04 | 53.752941 |
2023-01-01 | 27.35 | 56.803922 |
74 rows × 2 columns
# Plot the annual data
ann_DC_df.plot(
y='PRCP',
title='Washington, DC Average Annual Precip',
xlabel='Year',
ylabel='Precip (in)',
legend = False)
<Axes: title={'center': 'Washington, DC Average Annual Precip'}, xlabel='Year', ylabel='Precip (in)'>