Chapter 02

Chapter 2– second draft posted

For your reading pleasure, I just posted the second draft of Chapter 2 ("Uncovering the Mashup Potential of Websites").  It is a thoroughly rewritten version of the first draft,  a much deeper piece of work.  It did take me a lot of rethinking to get this 39 page chapter in order -- but I think you'll like it.  This concentrated effort also accounts for my silence on this blog....

At this point, I have written a first of draft of everything in the book, except Chapters 10, 11, and 12.  (I have a pre-first-draft version of Chapter 10 online that I've uploaded so that my students can work from it.)

Chapter 02
Chapter 10

Comments (0)

Permalink

Screen-scraping references

Even though my book focuses on the use of formal APIs for mashups, I'd like to provide guidance on screen-scraping and other forms of reverse engineering to my readers. There's plenty of mashup work that can be done even if you confine yourself to using only formal APIs. Sometimes, it's handy or even necessary to supplement your use of formal APIs with other ways to get at the data, functionality, or user-interface elements that you want to recombine or mashup.

Some inter-related areas to cover (or at least to make reference to):

Some issues I want to address:

  • In the book, we seek to exploit as many of the structured information available designed for consumption by programs as we can before we move on to interpreting output meant primarily for human viewing. Screen-scraping brings up a lot of issues, technical and social, that we can get back to once you learn how to use APIs.
  • legal issues, terms of use -- see Web scraping - Wikipedia, the free encyclopedia

Some references:

Chapter 02
screen scraping

Comments (0)

Permalink

Work to do for the second draft of Chapter 2

Chapter 2 analyzes Flickr for what makes it a mashup platform par excellence through which you can learn how to remix a specific application and exploit features that make it so remixable. The chapter compares and contrasts Flickr with other remixable platforms: del.icio.us, Google Maps, and amazon.com. On my plate is writing the second draft of Chapter 2. Besides correcting small scale errors, refining the prose of the chapter and giving it a jazzier and more accurate title, my focus is on providing more details about mashups that could actually be created from the features I write about. I write that "a goal of this chapter is to train you [readers] to deconstruct applications for their remix and mashup potential." While I do spell out in substantial detail the ways URLs are constructed and organized for Flickr, amazon.com, Google Maps, and del.icio.us, I need to describe how to generalize these ideas to other circumstances and suggest possible mashups that can be built.

Here are some other issues to work out:

URLs as little languages and the connection to REST

I spend a lot of effort in Chapter 2 on the notion of "URLs as little languages to understand and to speak." I think that it's easy for experienced programmers to these ideas about URLs (e.g., Hacking the URL) for granted. But I want to show the importance of being able to link to specific resources. For instance, LibraryLookup depends on being able to point to a book by constructing a URL based on an ISBN. If you can't easily link to a resource, you are going to be hard-pressed to reuse it, especially if there is not formal API. (Note: Some library catalogues have odd session-dependent cookies that make it difficult to forge such a URL to the book. You can sometimes manage to create a URL that will work (temporarily), through a multi-step screen-scraping -- in contrast to just dropping an ISBN into a URL.)

Having a simple URL to represent a specific resource means one of the simplest mashup design patterns is possible: you can substitute some parameters and get the corresponding web page. For websites that don't have formal APIs, such URLs are the closest one comes to a programming interface. (Sometimes, even if there is an API, it is simpler to use the human user interface URL and do a bit of screen-scraping. And sometimes even with an API that does not cover the functionality that you care about, having access to the URL is the only way to go.)

I have a sense that there are deep connections between RESTful architecture and the importance of little URL languages -- but I can't put my fingers on the specific connections. I just ordered a copy of Leonard Richardson and Sam Ruby's Restful Web Services (RESTful Web Services) to help me better understand REST. Some impressions that I have about REST that I believe to be correct

  • A fundamental idea behind REST is using URLs to represent resources.
  • If the website that you are trying to mashup is truly RESTful, then figuring out the structures of URLs is akin to figuring how resources are named in the application -- what are the "nouns".
  • There would be pretty strong continuity between the structure of the human-facing website and any API in a RESTful site.
  • Coherent, clean URL languages correlate with good REST design.

Identifiers as glue

I want to strengthen my description of how to use identifiers, tags, and search terms to correlate similar or the same things within and across websites and applications. Think about the use of an ISBN in LibraryLookup and latitude and longitude in Google Maps in Flickr -- how those identifiers and broadly used ways of describing things connect websites together.

How the mashups we studied in Chapter 1 make use of the techniques of Chapter 2

To make the three mashups we studied in chapter 1, their creators had to understand the functioning of the constituent applications they were recombining. For instance:

  • for LibraryLookup, Udell needed to understand the use of ISBNs as identifiers among library catalogs and other book-oriented websites (such as amazon.com and other bookstores). Then you can use this ISBN (and speak the URL languages of various library catalogs) to glue together these various websites via JavaScript. (There are some challenges: it was difficult for Jon Udell to craft a totally user-friendly system for easily creating the LibraryLookup bookmarklet just for your library.)
  • for GMiF, a Greasemonkey script -- which is very much about remixing the existing user interface of an application, CK Yuan had need to understand the user interface of Flickr in order to insert the GMap icon among the other icons, how others have exploited the user tagging can be hacked to hold location data (in a system that ultimately become productized by Flickr in to machine tags). Moreover, on a prosaic level, you have to understand how to form URLs for each of the pictures.
  • housingmaps.com depends on craigslist, which has no formal API. Hence, Paul Rademacher has to parse the HTML and understand the URL structure of craigslist, what cities are covered, how to make use of the RSS and supplement that data with screen-scraping.

What you get by studying the application and not just the API

My point is the developers need to understand apps as end-users too and not just jump into the API. Learn the application first (if you are an experienced developer and user of these types of applications, it won't take that long.). It's worth the investment of time. Why not just jump into the API?

  • You're more likely to make a more useful mashup by availing yourself of knowledge as an end-user
  • You can plug the mashup into the context of how users are already using the application
  • You understand what is currently missing from the application and can be improved
  • You see hooks into the application that are not necessarily obvious from the API alone
  • You can more easily make sense of the API when you know what key data entities are and some of the functionality -- you can ask, how might it be reflected in the APIs.

Looking for signs of mashability; ties to further chapters

Chapter 2 is also a prelude to the chapters that immediately follow it, elements of a website that make it more remixable. Indeed, the topics are the basis of a checklist of questions to pose in assessing the mashability/remixability/recombinatorial potential of applications:

  • Are tags used to describe resources on the website (described in greater detail in Chapter 3)
  • Are RSS and other syndication feeds available? (We will deal with this issue in greater depth in Chapter 4)
  • Do you see functionality for integrating with weblogs? (Chapter 5)
  • Is there an API for the application (Chapter 6, 7, and 8.)

In addition, you would look for the existence of browser toolbars, desktop clients, and mobile interfaces that interact with the websites -- they not only show that the website is remixable but often show how you can do so. (I will have to give specific examples here in the chapter, but I have some already installed in my own browser: del.icio.us Firefox extension and Amazon S3 Firefox Organizer(S3Fox)).

Data formats, nouns, and Verbs

"What is the underlying data format?" -- and a related question "What are the core entities or resources in the website" -- are useful questions to pose when studying an application. If we use grammatical analogies, what are the "nouns"? When we look then at what functionality there is around the entities, we are asking what the "verbs" are. If there is an API, it will make a lot more sense if you have a sense of what those entities and their functionality are.

Chapter 02
REST

Comments (0)

Permalink

Amazon URL structures

I spend a substantial part of Chapter 2 on the topic of understanding the syntax and semantics of URLs in web applications. Knowing how URLs are formed lays the foundation of mashing them up later but also enables users to recombine content from various sites without much programming.

In the chapter, I look at URLs in Flickr. Google Maps, del.icio.us, and amazon.com. Below is an excerpt of the chapter about amazon.com. One major question I have is whether someone has documented the URL structures for amazon.com in a more comprehensive fashion, akin to what Google Map Parameters - Google Mapki does for Google Maps. I will post that question on the appropriate forums when I figure what they are. Anyone out there know the answer?

Amazon walkthrough

Amazon.com is another interesting site to look at. Not only is it a popular e-commerce site, it is a pioneering e-commerce platform which is easily remixed and recombined with other content. Although we will study the Amazon APIs later, we focus here on how amazon.com from the view of an end-user. Moreover, the goal in this section is not learn all the features of amazon.com but rather to study the structure of URLs used in amazon.com -- specifically the question of how to link to the site. (While Amazon sells a lot of merchandise other than books, we will look at books to focus our walk-through. Moreover, we focus here on amazon.com, the site geared to the USA instead of the network of sites aimed to customers outside the USA.)

The strategy we follow here is to discern the key entities of the amazon.com site through a combination of using and experimenting with the site, sifting through documentation, seeing what other users have done. Note that since some of the conclusions are not supported by official documentation from amazon.com, there is no long term guarantee behind the URLs.

Amazon items

It doesn't take much use of amazon.com to see that the central entity of the site is an item for sale (akin to a photo in Flickr). By looking at the URL of a given item and looking throughout a page describing it, you will see that Amazon uses ASIN (Amazon Standard Identification Number) as a unique identifier for its products.[1] For books that have an ISBN, the ASIN is the same as the ISBN for the book. According to the Wikipedia article, on amazon.com, you can point to a product with an ASIN with the following URL:

http://www.amazon.com/gp/product/[ASIN]

Take for instance, Czeslaw Milosz’s New and Collected Poems (paperback edition), which has an ISBN of 0060514485. You can find it on amazon.com at

http://www.amazon.com/gp/product/0060514485

It is important to know that the way to link to amazon.com has changed in the past and will likely continue to change. For instance, you can also linkt to the book with

http://www.amazon.com/exec/obidos/ASIN/0060514485

or even a shorter form.

http://amazon.com/o/ASIN/0060514485

The use of this syntax would ideally be founded on some official documentation from amazon.com. Where would one find definitive documentation on how to structure a link to a product of a given ASIN? A search through the amazon developers' site leads to the the technical documentation[2], whose latest version at the time of writing is the 2007-04-04 edition of the technical docs[3] That trial leads ultimately to a page on the use of identifiers , which, alas, does not spell out how to formuate the URL for an item with a given ASIN.[4] The bottom line for now: the Wikipedia plus experimentation is the best way to discern the URL structures of amazon.com.

Let's apply this approach to other functions of amazon.com. For instance, can we generate a URL for a full-text search? Go to amazon.com and drop in your favorite search term. Take for example, flower. When you hit submit, you'll get a URL that looks like:

http://amazon.com/s/ref=nb_ss_gw/102-1755462-2944952?url=search-alias%3Daps&field-keywords=flower&Go.x=0&Go.y=0

If you do the search again, say in a different browser, you will get another URL. I got:

http://amazon.com/s/ref=nb_ss_gw/102-8204915-1347316?url=search-alias%3Daps&field-keywords=flower&Go.x=0&Go.y=0&Go=Go

Notice where things are similar and where the URLs are different from one another. Looking for what's common (the http://amazon.com/s prefix and ?url=search-alias%3Daps&field-keywords=flower&Go.x=0&Go.y=0&Go=Go argument), you might try to eliminate the sections which are different:

http://amazon.com/s/?url=search-alias%3Daps&field-keywords=flower&Go.x=0&Go.y=0&Go=Go

which seems to work fine. You can even eliminate &Go.x=0&Go.y=0&Go=Go to boil the request down to

http://amazon.com/s/?url=search-alias%3Daps&field-keywords=flower

How to limit it to books? If you go to amazon.com and select the book section and use a flower keyword, you will get a URL similar to

http://amazon.com/s/ref=nb_ss_gw/102-6984159-2338509?url=search-alias%3Dstripbooks&field-keywords=flower&Go.x=12&Go.y=6

Stripping away the parameters that we had done before give you:

http://amazon.com/s/?url=search-alias%3Dstripbooks&field-keywords=flower

This trick works for the other departments. For example, to do a search on flowers in Home & Garden:

http://amazon.com/s/?url=search-alias%3Dgarden&field-keywords=flower

Let's run through the syntax of other organizational structures:

Lists

To go to the wishlist section:

http://www.amazon.com/gp/registry/wishlist/

If you are logged in, you will see a list of your lists on the left. Look at the URL of one of them, which will look like

http://www.amazon.com/gp/registry/wishlist/1U5EXVPVS3WP5/ref=cm_wl_rlist_go/102-5889202-4328156

You'll see that the since the right hand number (e.g., 102-5889202-4328156) remains the same but one number (e.g., 1U5EXVPVS3WP5) changes for each list that 1U5EXVPVS3WP5 is the identifier for the list. You can point to a list by its list identifier by

http://www.amazon.com/gp/registry/wishlist/1U5EXVPVS3WP5

Tags

Tags are a recent introduction to Amazon.com. You will see links like

http://www.amazon.com/tag/czeslaw%20milosz/ref=tag_dp_ct/102-8204915-1347316

which can be reduced to

http://www.amazon.com/tag/czeslaw%20milosz/

Subject headings

In looking through the Browse-subject section of amazon.com (http://www.amazon.com/Subjects-Books/b/?ie=UTF8&node=1000), you can find a link such as

http://www.amazon.com/b/ref=amb_link_1760642_21/104-0367717-9318361?ie=UTF8&node=5&pf_rd_m=ATVPDKIKX0DER&pf_rd_s=center-3&pf_rd_r=0J0MADE0YSN1VRBA6XZS&pf_rd_t=101&pf_rd_p=233185601&pf_rd_i=1000

(which refers to the Computers & Internet Section) to

http://www.amazon.com/b/?ie=UTF8&node=5

(The fact that the node is specified by number rather than any word-based descriptor makes one concerned about the long term stability of the link. Will 5 always refer to computers or if there is another section added that goes before it alphabetically, will the link break?)

There are plenty of other entities whose URL structures can be discerned, including Listmania lists (e.g., http://www.amazon.com/favorite-literary-poles/lm/1FH0E3G892IA/ and http://www.amazon.com/lm/1FH0E3G892IA/), So You'd Like to Guides (e.g., http://www.amazon.com/gp/richpub/syltguides/fullview/3T3I3YDBG889B), personal profiles (e.g., http://www.amazon.com/gp/pdp/profile/A2D978B87TKMS2/)



 

[1] http://en.wikipedia.org/wiki/Amazon_Standard_Identification_Number

 

[2] http://developer.amazonwebservices.com/connect/kbcategory.jspa?categoryID=19

[3] http://developer.amazonwebservices.com/connect/entry.jspa?externalID=703&categoryID=19

 

[4] http://docs.amazonwebservices.com/AWSECommerceService/2007-04-04/DG/ItemIdentifiers.html

Amazon
Chapter 02

Comments (1)

Permalink

Chapter 2: First draft

I have posted the first draft of Chapter 2 (pdf) "Looking at Flickr, Del.icio.us, Google maps, and Amazon.com as end-user tools". The chapter analyzes Flickr (as our primary extended example) for what makes it the remix platform par excellence for learning how to remix a specific application and exploit its many features that make it so remixable. The chapter compares and contrasts flickr with other remixable platforms: del.icio.us, Google Maps, and amazon.com.

Amazon
Chapter 02
Google Maps
del.icio.us
drafts

Comments (0)

Permalink