{"id":163,"date":"2011-04-14T19:51:25","date_gmt":"2011-04-15T02:51:25","guid":{"rendered":"http:\/\/themanwhosoldtheweb.com\/blog\/?p=163"},"modified":"2011-04-14T20:05:14","modified_gmt":"2011-04-15T03:05:14","slug":"autoscaling-with-publicly-available-data","status":"publish","type":"post","link":"http:\/\/themanwhosoldtheweb.com\/blog\/2011\/04\/autoscaling-with-publicly-available-data\/","title":{"rendered":"Use publicly available datasets to create a value-added megasite."},"content":{"rendered":"<p>A few days ago, I launched a <a href=\"http:\/\/themanwhosoldtheweb.com\/blog\/2011\/04\/live-case-study-build-300000-page-autoscale-autopilot-site\/\">live case study where I created a 300,000 page site<\/a>. One curious reader emailed me and referred to it as a &#8220;megasite.&#8221;\u00a0 She saw the value in creating these massive sites and was very interested in creating her own megasite.\u00a0 So, let&#8217;s discuss this concept further.<\/p>\n<p>In my recent case study, I was able to create 300,000 page site (upon launch) by <a href=\"http:\/\/themanwhosoldtheweb.com\/blog\/2011\/04\/bests-apis-to-autoscale\/\">leveraging an API<\/a>.\u00a0 In this article, we will explore another method of creating a value-added megasite.\u00a0 We will leverage publicly available datasets, instead.<\/p>\n<p><strong>First thing&#8217;s first.\u00a0 What is a dataset? <\/strong>With some help from Wikipedia, a dataset is defined as a collection of data, usually presented in a table. Each column represents a particular attribute.\u00a0 Each row corresponds to a given entry of the dataset.\u00a0\u00a0 For instance, if we have dataset on cars, the columns can be &#8220;model,&#8221; &#8220;make,&#8221; &#8220;color,&#8221; &#8220;year,&#8221; and &#8220;license.&#8221;\u00a0 Then, an example of a row entry could take on the values &#8220;Accord,&#8221; &#8220;Honda,&#8221; &#8220;White,&#8221; &#8220;2009,&#8221; &#8220;TMWSW23.&#8221;<!--more--><\/p>\n<p>You can create megasite by converting that dataset into a website.\u00a0 In this situation, each row of your dataset will correspond to a page on your website.\u00a0 In other words, if our dataset of cars had 5,000,000 rows, you can convert that into a site with 5,000,000 pages&#8211;1 page for each car.<\/p>\n<p>Now, on a high level, here is the 3 steps to creating a megasite that adds value.<\/p>\n<p>&nbsp;<\/p>\n<p><strong>1. Find a dataset.<\/strong><\/p>\n<p>A great source of freely downloadable large datasets are non-profit and government organizations.\u00a0 Here, I&#8217;ve compiled just a small list for you to chew on.\u00a0 Google around for more.<\/p>\n<p><a href=\"http:\/\/data.worldbank.org\">http:\/\/data.worldbank.org<\/a><br \/>\n<a href=\"http:\/\/www.bls.gov\/data\/\"> http:\/\/www.bls.gov\/data\/<\/a><br \/>\n<a href=\"http:\/\/data.vancouver.ca\"> http:\/\/data.vancouver.ca<\/a><br \/>\n<a href=\"http:\/\/stats.oecd.org\/index.aspx\"> http:\/\/stats.oecd.org\/index.aspx<\/a><br \/>\n<a href=\"http:\/\/data.un.org\/Explorer.aspx\"> http:\/\/data.un.org\/Explorer.aspx<\/a><br \/>\n<a href=\"http:\/\/mdgs.un.org\/unsd\/mdg\/Data.aspx\"> http:\/\/mdgs.un.org\/unsd\/mdg\/Data.aspx<\/a><br \/>\n<a href=\"http:\/\/www.ngdc.noaa.gov\/ngdc.html\">http:\/\/www.ngdc.noaa.gov\/ngdc.html<\/a><br \/>\n<a href=\"http:\/\/www.data.gov\"> http:\/\/www.data.gov<\/a><br \/>\n<a href=\"http:\/\/www.data.gov.uk\"> http:\/\/www.data.gov.uk<\/a><br \/>\n<a href=\"http:\/\/www.census.gov\/main\/www\/access.html\"> http:\/\/www.census.gov\/main\/www\/access.html<\/a><\/p>\n<p>&nbsp;<\/p>\n<p><strong> 2. Narrow down on a subset of the data that people are truly interested in.<\/strong><\/p>\n<p>This is the hard part.\u00a0 The dataset contains a tremendous amount of data.\u00a0 You don&#8217;t care about all of that and neither will your site&#8217;s visitors.\u00a0 Your job now is to think carefully about this data and figure out what slice of it people actually find interesting.\u00a0 In essence, you are picking the niche focus of your site.<\/p>\n<p>Here is an example to illustrate this step.\u00a0 About a year ago, some company created a dataset of all of Facebook&#8217;s public profiles.\u00a0 They made this available as a free download. \u00a0 I downloaded this and it was massive.\u00a0 It was several gigabytes!<\/p>\n<p>Anyway, I&#8217;m not interested in all that data.\u00a0 And, I&#8217;m not gonna spend hours scouring through that data for what I am interested in.\u00a0 Here are, however, some random subsets of that dataset, which I would be interested in:<\/p>\n<ul>\n<li><strong>Single girls in Los Angeles \ud83d\ude00<\/strong><\/li>\n<\/ul>\n<ul>\n<li><strong>People who listen to Foo Fighters<\/strong><\/li>\n<\/ul>\n<ul>\n<li><strong>People who watch the show 24<\/strong><\/li>\n<\/ul>\n<ul>\n<li><strong>People who like sushi<\/strong><\/li>\n<\/ul>\n<ul>\n<li><strong>Models on Facebook<\/strong><\/li>\n<\/ul>\n<ul>\n<li><strong>People who attended Cornell and now live in Los Angeles<\/strong><\/li>\n<\/ul>\n<ul>\n<li><strong>People who have over 5,000 friends<\/strong><\/li>\n<\/ul>\n<ul>\n<li><strong>Distribution of where people are living across the US from my hometown<\/strong><\/li>\n<\/ul>\n<ul>\n<li><strong>Distribution of where people are living across the US from my alma mater<\/strong><\/li>\n<\/ul>\n<ul>\n<li><strong>Starbucks baristas living in Los Angeles<\/strong><\/li>\n<\/ul>\n<ul>\n<li><strong>Actresses living in Los Angeles<\/strong><\/li>\n<\/ul>\n<p>But, because I was unwilling to spend the time dissecting that dataset, I won&#8217;t be able to browse the above listed subsets that are of interest.\u00a0 Here is where you come in.<\/p>\n<p>&nbsp;<\/p>\n<p><strong>3. Create a website based on this subset.<\/strong><\/p>\n<p>Countless people downloaded that Facebook dataset.\u00a0 However, few people actually explored it in any meaningful way.\u00a0 Why?\u00a0 To summarize, two main reasons.\u00a0 One, it contained too much information.\u00a0 Two, it wasn&#8217;t presented in a friendly way.\u00a0 The files were text files.\u00a0 When I opened one of the files in Notepad, it froze my laptop.\u00a0 <em>What the&#8230;<\/em><\/p>\n<p>You&#8217;ve narrowed down on a specific subset of that data.\u00a0 For example, you&#8217;ve narrowed down only profiles of single girls living in and around Los Angeles.\u00a0 Also, you are presenting the data in the form of a website&#8211;not as text files!\u00a0 People who visit the site can browse profiles like they would on Facebook.\u00a0 This is intuitive to them.\u00a0 Maybe they can filter by age, filter by district within Los Angeles, and search by interest.<\/p>\n<p>This is where your site creates true value and convenience for your users.\u00a0\u00a0 You are presenting data in a useful, focused, and digestible way to your visitors.<\/p>\n<p>Best of all, we now have a megasite based on publicly available data.\u00a0 Content creation is always a pain, but we&#8217;ve bypassed it&#8211;all hundreds of thousands of pages worth of it!<\/p>\n<p><em><strong>dave<\/strong><\/em><\/p>\n","protected":false},"excerpt":{"rendered":"<p>A few days ago, I launched a live case study where I created a 300,000 page site. One curious reader emailed me and referred to it as a &#8220;megasite.&#8221;\u00a0 She saw the value in creating these massive sites and was very interested in creating her own megasite.\u00a0 So, let&#8217;s discuss this concept further. In my [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[23],"tags":[73,72,74,75],"class_list":["post-163","post","type-post","status-publish","format-standard","hentry","category-value-add","tag-dataset","tag-megasite","tag-publicly-available-data","tag-publicly-available-dataset"],"_links":{"self":[{"href":"http:\/\/themanwhosoldtheweb.com\/blog\/wp-json\/wp\/v2\/posts\/163","targetHints":{"allow":["GET"]}}],"collection":[{"href":"http:\/\/themanwhosoldtheweb.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"http:\/\/themanwhosoldtheweb.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"http:\/\/themanwhosoldtheweb.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"http:\/\/themanwhosoldtheweb.com\/blog\/wp-json\/wp\/v2\/comments?post=163"}],"version-history":[{"count":3,"href":"http:\/\/themanwhosoldtheweb.com\/blog\/wp-json\/wp\/v2\/posts\/163\/revisions"}],"predecessor-version":[{"id":165,"href":"http:\/\/themanwhosoldtheweb.com\/blog\/wp-json\/wp\/v2\/posts\/163\/revisions\/165"}],"wp:attachment":[{"href":"http:\/\/themanwhosoldtheweb.com\/blog\/wp-json\/wp\/v2\/media?parent=163"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"http:\/\/themanwhosoldtheweb.com\/blog\/wp-json\/wp\/v2\/categories?post=163"},{"taxonomy":"post_tag","embeddable":true,"href":"http:\/\/themanwhosoldtheweb.com\/blog\/wp-json\/wp\/v2\/tags?post=163"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}