Quantcast
Channel: Gokhan Atil – Gokhan Atil’s Blog
Viewing all articles
Browse latest Browse all 108

Python for Data Science – Importing table data from a web page

$
0
0

This is another blog post about using Pandas package. This time, I’ll show you how to import table data from a web page. To be able to get table data, there should be a table defined with table tags (table,td,tr) in the web page we access. Unfortunately most web sites do not use “tables” anymore. They usually prefer to use “div” tags, so if this code doesn’t work, check HTML source code of the page.

For testing purposes, I’ll try to fetch exchange rates from CNN Money International web site. There are two tables in the page, one for the exchange rates and one for the world markets.

Python code is very simple:

import pandas as pd

df_list = pd.read_html( "http://money.cnn.com/data/currencies/" )

print df_list

Output:

[         Currencies       $1=  Change inU.S. dollars % Change  \
0  Argentinean Peso   17.6299                 0.0099  +0.056%   
1    Brazilian Real    3.2817                -0.0244  -0.738%   
2   Canadian Dollar    1.2760                 0.0001  +0.005%   
3      Chilean Peso  631.8000                -2.2000  -0.347%   
4    Dominican Peso   47.3280                -0.1750  -0.368%   
5      Mexican Peso   19.1027                -0.0645  -0.337%   

          52-week range  
0    14.85Today|||17.83  
1      3.04Today|||3.58  
2      1.21Today|||1.38  
3  604.04Today|||670.90  
4    45.29Today|||48.06  
5    17.45Today|||22.03  ,                    Index 1 day change     Level
0 NaN  Nikkei 225  Japan       +0.04%  22548.35
1 NaN   Hang Seng  China       -0.02%  28596.80
2 NaN  FTSE 100  England       +0.02%   7561.64
3 NaN     CAC 40  France       -0.23%   5505.17]

I examined the HTML code of the page and see that these tables have different IDs. The ID of the exchange rates table is “wsod_currencyExhangeRatesTable”. I use this ID to fetch only the exchange rates table:

import pandas as pd

df_list = pd.read_html( "http://money.cnn.com/data/currencies/", attrs = {'id': 'wsod_currencyExhangeRatesTable'} )

print df_list

Output:

[         Currencies       $1=  Change inU.S. dollars % Change  \
0  Argentinean Peso   17.6300                 0.0100  +0.057%   
1    Brazilian Real    3.2838                -0.0223  -0.675%   
2   Canadian Dollar    1.2755                -0.0004  -0.031%   
3      Chilean Peso  631.8300                -2.1700  -0.342%   
4    Dominican Peso   47.3280                -0.1750  -0.368%   
5      Mexican Peso   19.0832                -0.0841  -0.439%   

          52-week range  
0    14.85Today|||17.83  
1      3.04Today|||3.58  
2      1.21Today|||1.38  
3  604.04Today|||670.90  
4    45.29Today|||48.06  
5    17.45Today|||22.03  ]

The read_html function returns a list of DataFrames even there’s only one table. We need to use indexes (i.e. df_list[0]) to access the first table.

You probably noticed that the last column contains both min and max values and it could be better to extract these data into separate columns. Here’s the script:

# -*- coding: utf-8 -*-
"""
Created on Tue Nov  6 16:01:21 2017

@author: Gokhan Atil
"""

import pandas as pd

def main():
    """ main """
    df_json_raw = pd.read_json('test.json')
    df_json = df_json_raw.apply(lambda x: pd.Series([x[0]['name'], x[0]['email']]), axis=1)
    df_json.columns = ['name', 'email']
    print df_json

main()

and the output:

Currencies       $1=  Change inU.S. dollars % Change     min     max
0  Argentinean Peso   17.6290                 0.0090  +0.051%   14.85   17.83
1    Brazilian Real    3.2899                -0.0162  -0.490%    3.04    3.58
2   Canadian Dollar    1.2761                 0.0002  +0.016%    1.21    1.38
3      Chilean Peso  632.3000                -1.7000  -0.268%  604.04  670.90
4    Dominican Peso   47.3280                -0.1750  -0.368%   45.29   48.06
5      Mexican Peso   19.1031                -0.0641  -0.334%   17.45   22.03

So we successfully fetched the table data and parsed it from a web site. Did you see how easy to manipulate columns of Pandas DataFrames? See you next blog post!


Viewing all articles
Browse latest Browse all 108

Trending Articles