This is another blog post about using Pandas package. This time, I’ll show you how to import table data from a web page. To be able to get table data, there should be a table defined with table tags (table,td,tr) in the web page we access. Unfortunately most web sites do not use “tables” anymore. They usually prefer to use “div” tags, so if this code doesn’t work, check HTML source code of the page.
For testing purposes, I’ll try to fetch exchange rates from CNN Money International web site. There are two tables in the page, one for the exchange rates and one for the world markets.
Python code is very simple:
import pandas as pd df_list = pd.read_html( "http://money.cnn.com/data/currencies/" ) print df_list Output: [ Currencies $1= Change inU.S. dollars % Change \ 0 Argentinean Peso 17.6299 0.0099 +0.056% 1 Brazilian Real 3.2817 -0.0244 -0.738% 2 Canadian Dollar 1.2760 0.0001 +0.005% 3 Chilean Peso 631.8000 -2.2000 -0.347% 4 Dominican Peso 47.3280 -0.1750 -0.368% 5 Mexican Peso 19.1027 -0.0645 -0.337% 52-week range 0 14.85Today|||17.83 1 3.04Today|||3.58 2 1.21Today|||1.38 3 604.04Today|||670.90 4 45.29Today|||48.06 5 17.45Today|||22.03 , Index 1 day change Level 0 NaN Nikkei 225 Japan +0.04% 22548.35 1 NaN Hang Seng China -0.02% 28596.80 2 NaN FTSE 100 England +0.02% 7561.64 3 NaN CAC 40 France -0.23% 5505.17]
I examined the HTML code of the page and see that these tables have different IDs. The ID of the exchange rates table is “wsod_currencyExhangeRatesTable”. I use this ID to fetch only the exchange rates table:
import pandas as pd df_list = pd.read_html( "http://money.cnn.com/data/currencies/", attrs = {'id': 'wsod_currencyExhangeRatesTable'} ) print df_list Output: [ Currencies $1= Change inU.S. dollars % Change \ 0 Argentinean Peso 17.6300 0.0100 +0.057% 1 Brazilian Real 3.2838 -0.0223 -0.675% 2 Canadian Dollar 1.2755 -0.0004 -0.031% 3 Chilean Peso 631.8300 -2.1700 -0.342% 4 Dominican Peso 47.3280 -0.1750 -0.368% 5 Mexican Peso 19.0832 -0.0841 -0.439% 52-week range 0 14.85Today|||17.83 1 3.04Today|||3.58 2 1.21Today|||1.38 3 604.04Today|||670.90 4 45.29Today|||48.06 5 17.45Today|||22.03 ]
The read_html function returns a list of DataFrames even there’s only one table. We need to use indexes (i.e. df_list[0]) to access the first table.
You probably noticed that the last column contains both min and max values and it could be better to extract these data into separate columns. Here’s the script:
# -*- coding: utf-8 -*- """ Created on Tue Nov 6 16:01:21 2017 @author: Gokhan Atil """ import pandas as pd def main(): """ main """ df_json_raw = pd.read_json('test.json') df_json = df_json_raw.apply(lambda x: pd.Series([x[0]['name'], x[0]['email']]), axis=1) df_json.columns = ['name', 'email'] print df_json main()
and the output:
Currencies $1= Change inU.S. dollars % Change min max 0 Argentinean Peso 17.6290 0.0090 +0.051% 14.85 17.83 1 Brazilian Real 3.2899 -0.0162 -0.490% 3.04 3.58 2 Canadian Dollar 1.2761 0.0002 +0.016% 1.21 1.38 3 Chilean Peso 632.3000 -1.7000 -0.268% 604.04 670.90 4 Dominican Peso 47.3280 -0.1750 -0.368% 45.29 48.06 5 Mexican Peso 19.1031 -0.0641 -0.334% 17.45 22.03
So we successfully fetched the table data and parsed it from a web site. Did you see how easy to manipulate columns of Pandas DataFrames? See you next blog post!