Skip to content

Commit 4c083be

Browse files
committed
Update chapter5.md
The dataset in url seems updated, so the code in guide is modified so it is still relevant to the updated dataset
1 parent cd86aa3 commit 4c083be

File tree

1 file changed

+8
-23
lines changed

1 file changed

+8
-23
lines changed

content/pandas cookbook/chapter5.md

Lines changed: 8 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -64,7 +64,7 @@ To get the data for March 2013, we need to format it with month=3, year=2012.
6464

6565
```python
6666
url = url_template.format(month=3, year=2012)
67-
weather_mar2012 = pd.read_csv(url, index_col='Date/Time', parse_dates=True)
67+
weather_mar2012 = pd.read_csv(url, index_col='Date/Time (LST)', parse_dates=True)
6868
```
6969

7070
This is super great! We can just use the same read_csv function as before, and just give it a URL as a filename. Awesome.
@@ -1604,7 +1604,7 @@ Output:
16041604
Let's plot it!
16051605

16061606
```python
1607-
weather_mar2012[u"Temp (\xc2\xb0C)"].plot(figsize=(15, 5))
1607+
weather_mar2012[u"Temp (°C)"].plot(figsize=(15, 5))
16081608
```
16091609

16101610
Output:
@@ -1617,18 +1617,6 @@ Notice how it goes up to 25° C in the middle there? That was a big deal. It was
16171617

16181618
And I was out of town and I missed it. Still sad, humans.
16191619

1620-
I had to write '\xb0' for that degree character °. Let's fix up the columns. We're going to just print them out, copy, and fix them up by hand.
1621-
1622-
```python
1623-
weather_mar2012.columns = [
1624-
u'Year', u'Month', u'Day', u'Time', u'Data Quality', u'Temp (C)',
1625-
u'Temp Flag', u'Dew Point Temp (C)', u'Dew Point Temp Flag',
1626-
u'Rel Hum (%)', u'Rel Hum Flag', u'Wind Dir (10s deg)', u'Wind Dir Flag',
1627-
u'Wind Spd (km/h)', u'Wind Spd Flag', u'Visibility (km)', u'Visibility Flag',
1628-
u'Stn Press (kPa)', u'Stn Press Flag', u'Hmdx', u'Hmdx Flag', u'Wind Chill',
1629-
u'Wind Chill Flag', u'Weather']
1630-
```
1631-
16321620
You'll notice in the summary above that there are a few columns which are are either entirely empty or only have a few values in them. Let's get rid of all of those with dropna.
16331621

16341622
The argument `axis=1` to `dropna` means "drop columns", not rows", and `how='any'` means "drop the column if any value is null".
@@ -1758,12 +1746,12 @@ Output:
17581746
</div>
17591747
</div>
17601748

1761-
The Year/Month/Day/Time columns are redundant, though, and the Data Quality column doesn't look too useful. Let's get rid of those.
1749+
The Year/Month/Day/Time columns are redundant, though. Let's get rid of those.
17621750

17631751
The `axis=1` argument means "Drop columns", like before. The default for operations like `dropna` and `drop` is always to operate on rows.
17641752

17651753
```python
1766-
weather_mar2012 = weather_mar2012.drop(['Year', 'Month', 'Day', 'Time', 'Data Quality'], axis=1)
1754+
weather_mar2012 = weather_mar2012.drop(['Year', 'Month', 'Day', 'Time (LST)'], axis=1)
17671755
weather_mar2012[:5]
17681756
```
17691757

@@ -1857,7 +1845,7 @@ Awesome! We now only have the relevant columns, and it's much more manageable.
18571845
This one's just for fun -- we've already done this before, using groupby and aggregate! We will learn whether or not it gets colder at night. Well, obviously. But let's do it anyway.
18581846

18591847
```python
1860-
temperatures = weather_mar2012[[u'Temp (C)']].copy()
1848+
temperatures = weather_mar2012[[u'Temp (°C)']].copy()
18611849
print(temperatures.head)
18621850
temperatures.loc[:,'Hour'] = weather_mar2012.index.hour
18631851
temperatures.groupby('Hour').aggregate(np.median).plot()
@@ -1948,13 +1936,10 @@ I noticed that there's an irritating bug where when I ask for January, it gives
19481936

19491937
```python
19501938
def download_weather_month(year, month):
1951-
if month == 1:
1952-
year += 1
19531939
url = url_template.format(year=year, month=month)
1954-
weather_data = pd.read_csv(url, skiprows=15, index_col='Date/Time', parse_dates=True, header=True)
1940+
weather_data = pd.read_csv(url, index_col='Date/Time (LST)', parse_dates=True)
19551941
weather_data = weather_data.dropna(axis=1)
1956-
weather_data.columns = [col.replace('\xb0', '') for col in weather_data.columns]
1957-
weather_data = weather_data.drop(['Year', 'Day', 'Month', 'Time', 'Data Quality'], axis=1)
1942+
weather_data = weather_data.drop(['Year', 'Month', 'Day', 'Time (LST)'], axis=1)
19581943
return weather_data
19591944
```
19601945

@@ -2050,7 +2035,7 @@ Output:
20502035
Now we can get all the months at once. This will take a little while to run.
20512036

20522037
```python
2053-
data_by_month = [download_weather_month(2012, i) for i in range(1, 13)]
2038+
data_by_month = [download_weather_month(2012, i) for i in range(1, 12)]
20542039
```
20552040

20562041
Once we have this, it's easy to concatenate all the dataframes together into one big dataframe using [pd.concat](http://pandas.pydata.org/pandas-docs/version/0.20/generated/pandas.concat.html). And now we have the whole year's data!

0 commit comments

Comments
 (0)