[API Suggestion] MD5 Checksum API
Kaz.5430:
I’ve been considering making a project using some of the API’s, but I’d cache the output to prevent unnecessary calls to the API
A lot of the API results contain a lot of data.
Downloading and parsing that data is pointless if I’ve already cached the output and nothing has changed.
It would be nice if all api variations could have a checksum to identify the content output, without actually calling the particular api
i.e.
I have a script that uses the items api, I see from the build id, that the build has changed but don’t know if it’s just a bug fix, or new items have been added.
Instead of calling the items api and parsing everything again, I call the checksum api to see if the items output has actually changed from when I last called it.
This would also be useful for map details, dyes, objectives etc
DarkSpirit.7046:
Instead of finding out if it is changed, why don’t you just pass the cached value back along with the age of the cached value.
Some clients may not mind an older value, sometimes, in return for a faster response time. How about creating a meta server that accepts a max_age, maximum age cached value parameter along with the web api call? If the meta server can satisfy the request on its own, then it will return its own cached value. Otherwise, it would have to fetch it from the other meta servers, or the ArenaNet server.
famfamfam.9137:
As I currently see it, most of the endpoints are one of the following:
- Would be expected to be different if called even within a few seconds of each other (so API-client caching isn’t helpful other than as a method of rate-limiting)
OR - Fairly reliably on build change. There might be some instances where a new game build does not update these data, but the game doesn’t update often enough that I can see it hurting to hit an essentially static resource and re-parse it.
If this is implemented API-side, it would make much more sense for it to be handled by ETags, If-None-Match and 304 responses. HTTP is a complex yet wonderful beast: http://www.mnot.net/cache_docs/
DarkSpirit.7046:
If this is implemented API-side, it would make much more sense for it to be handled by ETags, If-None-Match and 304 responses. HTTP is a complex yet wonderful beast: http://www.mnot.net/cache_docs/
But as stated in your link, that is not a trivial task. Furthermore only arenanet origin servers would know if the content has changed so clients would have to query it. And if the request is authenticated or secure (i.e. https) it won’t be cached.
Clients may not always need the most up-to-date response, so I prefer to leave it to the clients to decide what they need. As mentioned in your link, ETags are useful in the case of serving static contents (i.e. files). However, the web server may not know enough about the dynamic content to generate them.
Kaz.5430:
I’m talking about caching in the context of storing the parsed json responses in a local relational database so that you can query that database for frequently requested content, instead of asking the server for the same response over and over.
The build updates any time a bug is hotfixed, so a new build doesn’t mean that there have been changes to whatever you’re querying the API for.
As I see it, any efficient script that has anything to do with items or maps should be storing the information locally but checking that the copy it has stored is still valid. There is absolutely no point in calling a 200KB script if the contents have not changed. Sure that might be easier, but I prefer efficiency to ease.
Calling an API and getting a 32 character MD5 checksum as the response, is far more efficient than calling the API itself and getting a 200KB response. Especially if it turns out to be the same 200KB response it gave you last time.
e.g.
https://api.guildwars2.com/md5.json?api=https%3A%2F%2Fapi.guildwars2.com%2Fv1%2Fmaps.json
It would be very simple to implement, you pass a urlencoded api URL to the MD5 API, it loads the script on the ANet servers, hashes the results and sends you the hash instead of the full json response. You compare the hash to the version you’ve cached, if it’s different, you download the full 200KB.
It could be further optimized if ANet stored timestamp and update information with the hashes it gives out, and you could then call and API and just get changes to the response since the hash to parse, rather than the entire response.
e.g.
https://api.guildwars2.com/v1/items.json?hash=32bhgdfyh125abhrfhj67k0dvns15d4g
DarkSpirit.7046:
I agree that a new build doesn’t mean that there have been changes to whatever you’re querying the API for. The term “new build” can be ambiguous because I presume that ArenaNet has server and client builds and that they can patch their servers without needing to patch clients and vice versa. I may be wrong, but this would mean that client and server can have different build numbers and each can have a ‘new build’ at different times.
What you are proposing would be similar to the ETags feature mentioned by famfamfam. I have a few questions about your proposal:
1. What if certain api responses are fewer than 32 characters? Wouldn’t it be more efficient to just return the response than generating the MD5 checksum, and then returning them?
2. What about those APIs whose responses change often? Wouldn’t generating the MD5 checksum be quite redundant in those cases and be a further hit to the server performance?
3. Does this require all clients to be able to generate MD5 checksums from server responses or for the server to always return the MD5 checksum with each of its responses? If it is the former, that would be additional restrictions on the clients. If it is the latter, then the clients would have to remember and manage checksums.
4. What are the advantages/disadvantages of this approach over ETags?
Healix.5819:
They try to avoid processing with the API. Having a checksum option would most likely require them to constantly recalculate it since the API itself probably doesn’t know when it has been changed. It probably just reads the data directly and returns the response. With the amount of requests they receive, it’s not feasible to calculate the sum with every request. That would mean they would have to calculate it every so often, which would open up problems where the data has changed but the sum has not. It wouldn’t really matter though, since you don’t need up to the millisecond data.
A 3rd party site could provide this functionality. Sites like gwstats.net are already constantly checking some of the APIs~, so sites like those could also provide a checksum or modified since.
Checking the build is meaningless when it comes to new data. Remember, the data is only available once it is found. That means items might now show up for days or even months after a new build is introduced. If any items are changed however, they will be changed with a build.
The only advantage to using “md5.json?=” over a etag or modified since header is that they could hope nobody uses it so they wouldn’t have to calculate the checksum. The obvious downside is that it would take 2 requests to download something, whereas with the header tags, the server would stop or continue the response. The problem with additional headers is that it further bloats responses.
1. What if certain api responses are fewer than 32 characters? Wouldn’t it be more efficient to just return the response than generating the MD5 checksum, and then returning them?
When requesting the status of a single event, the response header is around the same size as the data. Furthermore, if you request it to be gzipped, the response is bloated, becoming larger than it would be uncompressed. Cases where efficiency is lost already exist.
DarkSpirit.7046:
A 3rd party site could provide this functionality. Sites like gwstats.net are already constantly checking some of the APIs~, so sites like those could also provide a checksum or modified since.
The problem with gw2stats.net is that it shows the response times with reference from only one particular server. Furthermore, they do not cache and provide API responses back to other clients (only API statuses), so it is somewhat of a wasted bandwidth.
If we have a number of servers across the world that would cache and redistribute API responses that would have been more useful since the more popular responses would tend to be cached. With enough caches, even if the anet origin server goes down, clients can still operate in a limited fashion.
Drakma.1549:
A 3rd party site could provide this functionality. Sites like gwstats.net are already constantly checking some of the APIs~, so sites like those could also provide a checksum or modified since.
The problem with gw2stats.net is that it shows the response times with reference from only one particular server. Furthermore, they do not cache and provide API responses back to other clients (only API statuses), so it is somewhat of a wasted bandwidth.
If we have a number of servers across the world that would cache and redistribute API responses that would have been more useful since the more popular responses would tend to be cached. With enough caches, even if the anet origin server goes down, clients can still operate in a limited fashion.
I’ve actually been thinking of adding something like this to gw2stats.net since I implemented the API status tool.
However, this is no small task. For example, I have been caching the data since May 29th (shortly after the API was released) and the amount of data is staggering. So staggering it’s at the point where I’m seriously having to consider moving hosts.
For example, the WvW data alone is 18GB without indexes in the database. Indexes add another 6GB. Now I realise I wouldn’t have to have that sort of retention for a simple checksum API, but the number of records alone that would have to be checked constantly numbers in the millions.
The events.json alone is 422,000 elements. There are over 26,000 items that would need to be checked constantly in the item_details.json call alone. To compare each item to an old version would take approximately 8 hours with the speed of the API, data storage, and latency added in.
That being said, there are some things that can be done pragmatically that should help you out. For instance, you will almost always have to pull some APIs live each time instead of checking to see if they changed. Some of those off the top of my head are match_details.json, events.json, match_details.json and guild_details.json. Some calls even give you a timestamp on when you should check for an update next (matches.json).
In addition to the above, I also have some tricks that help to keep the data up-to-date. For instance, when a user uses my website to view item information, it will first check the API to see if the data has changed. If it has, it updates it before it presents it to the user. This keeps the frequently accessed data refreshed and the seldom or never accessed data the same. Lower bandwidth, less overhead.
I’ll try to wrap this up here as this post has become longer than I expected, but. I will attempt to make a checksum API for less “active” data to see if that is something that will be useful.
Finally, you mentioned that gw2stats.net is only one frame of reference. You are absolutely correct. I would love to be able to provide a service where you can run a script on your side and have it send data to gw2stats.net for a more “worldwide” representation of access to the GW2 API. If there is interest in that, I will be more than happy to provide it.
DarkSpirit.7046:
In addition to the above, I also have some tricks that help to keep the data up-to-date. For instance, when a user uses my website to view item information, it will first check the API to see if the data has changed. If it has, it updates it before it presents it to the user. This keeps the frequently accessed data refreshed and the seldom or never accessed data the same. Lower bandwidth, less overhead.
By “user” you only meant a HTML browser right? From your website, the only JSON API you exposed are the ones for API statuses. Would you be planning to introduce JSON APIs on your website for clients to access your caches? You probably do not need to cache too far into the past. Furthermore, each server (including yours) do not need to support all of the APIs, as they can pick and choose which API they want to cache and re-propagate to requesting clients. We can share the load. You can also allocate your resources dynamically based on demand.
Finally, you mentioned that gw2stats.net is only one frame of reference. You are absolutely correct. I would love to be able to provide a service where you can run a script on your side and have it send data to gw2stats.net for a more “worldwide” representation of access to the GW2 API. If there is interest in that, I will be more than happy to provide it.
Or you can just share the parts of your code for the caching and status calculations so that every server would be processing them in the same way. There would be less work for you this way.
Drakma.1549:
OK, I think I get it now. You’re basically looking for an API CDN. I, honestly would worry about aging data as it moved across ‘net. Some of the data literally changes multiple times per second. I really don’t know how that would work out.
But, I’ll try to answer your questions in order.
- Yes, by “user” I mean an HTML browser
- Yes, I am planning on releasing a JSON (and possibly CSV) API
As far as only caching what I need, I do that now. There is not a single API that I don’t use currently so I end up caching them all. But as I said previously, I don’t always cache the everything (specifically the lesser used items/recipes).
I’m hesitant to release my code (honestly) only because I am a casual programmer. Back in the day, I was a pretty hard-core Perl, C/C++ programmer, but those days are long gone. I set out to learn PHP and JS with this project so I am quite sure that my code would look horrendous in the eyes of some “real” programmers.
However, I am more than happy to talk about my methodologies and even calculations that I use.
Quite basically, I retrieve what I need as often as I possibly can. With the exception of one call, I always put the daemon program to “sleep” for a short period of time before it just continues on refreshing the data. That one exception is match_details.json. I literally pull that as often as can be pulled for the sheer fact that I maintain a Live Map of WvW objectives.
As far as status calculations are concerned, that’s pretty simple. I already proved that information in the status_codes.json from my site. Viewing http://gw2stats.net/api/status_codes.json should tell you what I do for calculations.
As far as what I do pragmatically, it’s quite simple (This is what I do in PHP):
1) Start a timer
2) Fetch the latest API using curl.
3) Stop timer and calculate difference from Step 1 in millisends: This becomes retrieval time.
3) Using json_decode, I convert the raw json to an array.
4) Using count(), I count the total number of elements in the array (this includes nesting)
5) Ping the specified API domain (return -1 if down)
6) This all get’s thrown into MySQL
7) API is a live pull from the MySQL database
I’m really enjoying this thread. It’s making me think about the way I do some things and sometimes I just like to talk “nerd.”
Kaz.5430:
With a header-based solution, I think ANet still has to generate the response and my server still has to download it.
Unless of course I was to implement some sort of apache level solution that would intelligently parse the header and then break the connection some how. I’m not sure on the technical side of that, but I’d assume that the ANet server still needs to generate the content and probably upload it. All I’m doing is using API server resources, and then binning the output because I don’t need it.
If ANet are generating and uploading the json response anyway, and I’m downloading it anyway then I could just take an md5 hash from the download and do the checksum comparison locally. The whole point is to prevent ANet parsing and uploading superfluous data, and then my server downloading it.
On the subject of ‘what if the response was smaller than 32 characters’. It wouldn’t have to be an MD5 hash, a smaller length CRC hash would probably work just as well, so would the size of the output in bits, or a unix timestamp. MD5 was literally the first thing I thought of, and I could imagine how ANet might implement the API. However, if you’re expecting less data than the hash, you’d not bother asking for the hash because doing so would not be beneficial to either the ANet API sever, nor your own.
Sending a hash with every request is also over-kill. The API server would need to be constantly using resources to generate a value that a lot of people would ignore anyway. This idea is more for larger apps that ask for a lot of content that doesn’t change all that much. You’d hope it would be the larger apps that have more users that take advantage of the caching options, in order to speed up their app.
If the API services take off, then eventually ANet will need to implement some form of authentication and query limiting solution. Amazon for example limits authenticated users to 2000 queries an hour (plus additional queries based on how much they sell) and after that queries will produce a 503 error until the next hour. That’s basically 1 query every 2 seconds. Scripts using the map API – especially if combined with real-time event status for every map on every server – are likely to bring about the need for rate limiting quicker, unless people implement local caching.
Lets say you have a map script in which every time a user zooms or moves to generate a new view of the map, it calls https://api.guildwars2.com/v1/map_floor.json?continent_id=1&floor=1. That’s currently a 427KB document that has probably not changed between calls. Likewise calling all the individual tiles adds up to a lot of upload from the API server and – depending on implementation – download to the client, and possibly your server.
Now lets imagine your app has 500 concurrent users, all requesting the 427KB map response and a selection of tiles simultaneously and repetitively. Caching the map information and images locally would likely lead to a faster experience for your app, and would equate to far less load on the API server.
The information that comes in large responses is pretty much static. Having some way of querying to see if anything in the world map has changed, to establish if the api is returning more information for a map, or to see if a tile jpg has been changed – without having to downloading the lot – would be significantly faster and reduce load.
To me, whether this functionality is provided as an MD5 api, or something else doesn’t really matter all that much. An MD5 hash was just what came to mind first.
DarkSpirit.7046:
Thanks Drakma. I think you did excellent work on your website and many people (myself included) posted worse code publicly as our learning attempts.
I enjoy this thread too, it gives me the chance to vent what I have always had in my mind, without actually finding time in my busy schedules to do the work.
Healix.5819:
Lets say you have a map script in which every time a user zooms or moves to generate a new view of the map, it calls https://api.guildwars2.com/v1/map_floor.json?continent_id=1&floor=1.
I hope that’s just an example. Requesting static data should only be done once, when the app initializes.
Map tile images already implement an “md5 checksum” using headers. When using these headers, in your request, you supply your checksum and then the server makes a comparison. If they match, the server returns just a header stating that the content was not modified and if they don’t match, the request occurs like normal.
The actual request headers are “If-Modified-Since: x” or “If-None-Match: y” where x and y are from the response headers “Last-Modified: x” and “ETag: y”.
famfamfam.9137:
Unless of course I was to implement some sort of apache level solution that would intelligently parse the header and then break the connection some how.
This is exactly how HTTP works: http://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html#sec10.3.5 (I’d highly recommend the HTTP spec as some light reading by the way, it was a fantastic eye-opener when I read through it all).
But as stated in your link, that is not a trivial task. Furthermore only arenanet origin servers would know if the content has changed so clients would have to query it. And if the request is authenticated or secure (i.e. https) it won’t be cached.
These are not concerns.
1. Authenticated requests should be cached according to the outgoing Vary.
2. HTTPS is cached client-side by default providing no Cache-Control/Pragma headings say otherwise.
Clients may not always need the most up-to-date response, so I prefer to leave it to the clients to decide what they need. As mentioned in your link, ETags are useful in the case of serving static contents (i.e. files). However, the web server may not know enough about the dynamic content to generate them.
The API servers are the authority for the data, so therefore should also be also be the authority for the allowed staleness of the data.
That said, I don’t think the API needs this at all:
- For fast-moving data (events, WvW objective status) you should treat the API as being realtime. No revision information needed.
- If you’re concerned about re-running your code on the slower-moving data (items/etc), just make a request every so often and compare MD5 hashes of the body response locally; its a lot of work to implement document versioning via ETags/Revision APIs when its something that you won’t be requesting very often anyway. Or just re-parse it every time anyway, and optimise it only when you need to?
DarkSpirit.7046:
The API servers are the authority for the data, so therefore should also be also be the authority for the allowed staleness of the data.
That may make sense with a traditional web server, serving static content from files but it may not make as much sense here. The problem with that is, does ArenaNet web server even know what the allowed staleness of the data should be? Even their items and recipes depend on their discovery by players, so how is their web server suppose to know when the responses are going to change?
This is why I suggested that the clients would be in a better position to know what they would be using the responses for and for some tasks, they really do not need the latest data. Furthermore, they always have the choice to get the latest data if they want to, but perhaps at the cost of a longer response time.