March 21, 2013

Scraping a Feature Together

YQL (Yahoo Query Language) allows you to do amazing things, and has many feeds for a developer to play with. Explore how we use it in Peak Stocks!

Scraping a Feature Together

YQL (Yahoo Query Language) allows you to do amazing things, and has many feeds for a developer to play with. In Peak Stocks, we use it for Option Chains and it works beautifully. An additional feature that we'd thought of was expanding social media feeds. We've gotten a lot of requests for twitter and thought, why not the yahoo message boards too? This is where YQL comes into play.

YQL has community tables, which aren't actual SQL tables. These "tables" are just XML files that scrape pages or using existing feeds and return a response object that is formatted to your liking. Within YQL console, these XML files are passed off a tables making it seem like your actually accessing a database. Anyway, the community tables allow anyone to create their own custom table, be it through the methods mentioned earlier. The finance message boards dont have an open API so i had to use the page scraping method. Having not a lick of experience in doing any of this, it took me a couple of hours to get it going, with countless attempts of trial and error (view the commit history, you'll see). Once i finally got some data to work with, it got easier and easier. The final XML file can be found on my GitHub here.

The structure of the JavaScript is very simple, its a try-catch block and 2 methods. One method gets the next-page URL and the other loops through the <tr> elements within <tbody> and appends the rows to a <messages> element.

{
    "query": {
        "count": 1,
        "created": "2013-03-20T22:48:092",
        "lang": "en-US",
        "results": {
            "result": {
                "base_url": "//finance.yahoo.com",
                "next_page": "/mb/forumview/?&bn=12b865d6-8061-30c4-86c1l-40adf7076aee&page=3",
                "size": "20",
                "symbol": "GOOG",
                "message_board": {
                    "messages": [
                        {
                            "title": "SHORT THIS MONSTER BUBBLEEEEEEEEEEEEEE INTOOOOO THE GROUNDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDD...",
                            "link": "/mbview/threadview/?&bn=12b865d6-8061-30c4-86c1-40adf7076aee&tid=1363805153334-73fc3ece-f9cd-4819-b5b6-d9cc206bc590&t1ls=lat2Cdt2C20%2C3",
                            "published": {
                                "time": "4 hours ago",
                                "user": "pharmaheroouu",
                                "user_link": "/mbview/userview/?&u=pharmaheroouusébn=12b865d6-8061-30c4-86c1-40adf7076aee"
                            },
                            "popularity": {
                                "up": "0",
                                "down": "0"
                            },
                            "replies": {
                                "count": "1",
                                "last_date": "4 hours ago",
                                "last_user": "pharmaheroouu",
                                "last_user_link": "/mbview/userview/?&u=pharmaheroouvébn=12b865d6-8061-30c4-86c1-40adf7076aee"
                            }
                        }
                    ]
                }
            }
        }
    }
}
Output for GOOG

Within the YQL console, you can have two YQL queries to choose from:

SELECT * FROM messageboard WHERE symbol='GOOG';
Most basic YQL, giving you page 1 of the message board.
SELECT * FROM messageboard WHERE symbol='GOOG' AND page=$LINK;
Additional query allows you to pick which page you'd like. The page value is given in the initial URL request as link

As it stands, the file only parses through the a message board for a particular stock, getting all the threads within it. While it does provide you the the thread link, there is no way to get a feed for that individual thread.....yet.

The next step is to create another file that handles an individual thread. This shouldn't be difficult since I have experience from creating this one.