Local Brit started off as a simple idea. Roughly 5 minutes in it became a contest with myself - could it be done?
With the data in hand, could I create a local search for all of UK in 24 hours?
Back when I began work on iBegin, a friend of mine in the UK got very excited. One of the first people to try out our Toronto local search, he became a quick fan. Being the bum he was (now employed), he started to heckle me every waking moment on expanding to the UK quickly. While I told him we would in good time, I did have him start investigating how hard it would be.
Problems immediately cropped out. From data source to geocoding to many other headaches, the UK market was unique (hell, even the address system confused me). So while he kept his eye open, we decided to wait until we were ready to expand into the UK.
This was half a year ago. About a week ago, by what can only be described as a freak series of coincidences, I had a database of business-data for the UK fall into my lap. I thought to myself - I was waiting on the redesign to be completed, so I decided to give it a go.
Incidentally, the very day I was going to start, my dog was going nuts and would not shut the hell up. Lumbering from bed at 7:00 am, I was working half an hour later, and by 8:00 am was ready to tackle the local search engine.
And thus began a twisted journey. Racing against the clock, I set about to create a functional search engine that spanned all of the UK. But good enough to be actually useful to someone who might be searching.
Barring the speed of the search engine (now fixed), I succeeded in what I set out to do. Exactly 24 hours after I had started sat a shiny new site. From domain registration to design to programming, it was all done in a span of 24 hours.
The end result? A lot of fun. It was actually a lot of fun sitting down, putting my head down, and just grinding it through. Sure my fiancee wasn't too thrilled, but doing it once a month sounds do-able.
So I hope you enjoy the idea behind the site. Remember - it was done by one guy in a span of 24 hours. And hopefully you find it useful too.
Attached below is my log while I worked. The only thing changed is that I went in and hyperlinked properly. All typos/ramblings/aggravations have been kept intact:
8:00 am - Started Work. Told Andrew quickly what I need.
8:10 am - wrote data importer. Takes data in text format and throws it into split mysql databases.
8:20 am - first peak of frontpage design. Happy. Need inner page done, and then cut.
8:27 am - received spam email. Good to know the world is still spinning
8:30 am - started to normalize table format
8:37 am - database import stuck, needed to restart.
8:38 am - table format finalized. A bit of confusion on some of the data content, but will figure it out!
8:45 am - researching geocoding, this could be a problem
8:46 am - received innerpage design. Also happy. Now to find someone who can slice this.
8:50 am - dog annoying the crap out of me
8:50 am - cannot locate any decent (both featurewise and pricewise) geocoder. Did find one, but it charges 5 cents per lookup. That will add up too fast. So for now, going to use nearby.org.uk and just do a postal code convertor.
8:52 am - importer still going
9:06 am - half of the first step of the import is done. The importer which synchronizes all the tables is not ready yet.
9:11 am - Synchronization importer is now ready.
9:14 am - Both importers are now working at full steam. Need to find someone to HTMLize the site asap.
9:19 am - Key generated from nearby.org.uk
9:32 am - Geocoding is turning out to be frustrating. Demio did find me someone to HTMLize the site, hopefully she can deliver.
9:42 am - Activated iShareMaps account
9:43 am - Paid for iShareMaps
10:00 am - Hurrah! Geocoding is now working. The second step of the importer is now going. There were problems with the server, but I have contained those. Took away valuble 10-15 minutes.
10:01 am - first 'set' of data is being synchornized. Blazing fast. Goody
10:02 am - Dammit. Artifacts in the db - duplicate db. Have to restart process.
10:07 am - Restarted. Headaches galore! But - design is being cut, data is being cleaned and imported. Need to figure out how I want to do search. A full blown search engine, or simple matching?
10:15 am - Early analysis says I have roughly 600 business types (eg Barristers, Doctors, Accountants) spread across 15 different categories (Shops/Retailres, Businesses at Home, Places of Worship, etc).
10:24 am - Starting work on GMap
10:52 am - GMap is causing me headaches. The move from V1 to V2 is not so easy :)
11:09 am - Excellent! I got the data in an array, and wrote the JS that reads the data and creates the markers/points. I had a problem with three tabs making the popup window screwed up, but using http://www.econym.demon.co.uk/googlemaps/examples/map10a.htm got that fixed. Hurrah!
11:10 am - Synchronization of data is still going. Slow going.
11:27 am - menu based navigation works. You can get a list of links, click on one, and it automatically zooms. Distracted as dog just got a bath and is going nuts.
11:29 am - 50% of data synchronization is done.
11:30 am - Talking to Guida, the HTMLification of the design is almost done. Good timing. May break soon for some lunch.
11:31 am - Cannot advance much. The HTML needs to be done, and the data is still being synchronized. Without it cannot formulate exact plan. Once the data is in, will geocode it while I implement the design.
11:42 am - Analyzing the data more, I will need to create a separate search db. This will push things a bit further back, but still should finish by deadline.
11:47 am - Search db specced out, time for lunch.
12:22 pm - back on the warpath.
12:35 pm - design is here, a few issues. Getting them fixed.
12:40 pm - Creating an index on the data. 5+ minutes of waiting time now.
12:45 pm - Still struggling with creating an index. Not good.
12:55 pm - Still stuck waiting. Not good at all. But in positive light, the design is almost fixed up, can get cracking on that.
1:00 pm - Design is here, and looking good.
1:25 pm - Long time without an update. The data is in, no need for index. Page names for all have been generated. Working on the search database, and also beginning to get geocoded data. I am not sure how the receiving server will like 1.7 million queries.
1:35 pm - Huge problem. The geocoder is only doing one every 2 seconds. For 1.7 million records, that means 3.4 million seconds. at 86,400 seconds per hour, that = 40 hours. Oh shit.
1:45 pm - The geocoder is running, will tackle it when I can re-think. Maybe I will just populate initially with postal codes, and then run more specific geocoding. Also have gotten the search db script finished, time to tweak it.
1:53 pm - Problem with the search script. It is time consuming (as it takes a while for the mysql result set to get back).
1:57 pm - Doh! Stupid indexing error is what was popping up here and there.
2:00 pm - Search db script running fine. I am afraid of how large it will get (I guess 5 million). Now checking for distinct postal codes.
2:05 pm - too much strain on the db server, it crashed. Not good.
2:12 pm - Well well. 640k unique postal codes. With the above math, that would mean 15 hours. I think I will do two in parallel - one through the earlier mentioned geocoder, and the current with iShareMap
2:23 pm - only 1661 addresses geocoded. Now attempting to use Nearby.org.uk for the postal codes.
2:26 pm - That did not work out. Nearby.org.uk shut me off after roughly 5 queries. They do provide the geolocated points for the general postal codes (last two digits cut off), but an accuracy of 2km is far too off for me. I have to say, they are infinitely faster than the other one. Maybe I am doing something wrong - hrmm ....
2:30 pm - Trying to figure out what is wrong. On the free batch geocoder, they claim the desktop version can do 75 records per second. I must be missing something.
2:43 pm - Still hunting Geocoding. This is a problemo.
2:45 pm - I found one, but they actually want you to print it out and mail it to them. What in the hell? http://www.evoxfacilities.co.uk/evox7.htm
2:47 pm - Another one, and still no ability to order/download. http://www.graticule.com/data/postcode/index.html#unitpoints
2:49 pm - Ahh the pain continues. http://www.postcodeanywhere.co.uk/products/prices.aspx - on a credit level it is very expensive. Im willing to settle on just postal codes now, those are accurate enough (estimated 16 houses per postal code)
2:53 pm - Performed a modified test on iShareMap. Definitely a throttling issue. Decided to stop it, and just run it on the unique postal codes.
2:58 pm - Decisions time. Basically I will be using Nearby.org.uk's zip that has roughly ~12k post codes.
3:05 pm - Imported 10,339 postal codes. The accruacy is 2km - this does peeve me off a bit, but alas, nothing I can do for now. The searh script was restarted, and on the side I am still pinging iShareMaps for the location of the 640k specific postal codes.
3:14 pm - Planning out structure of site. Starting to tire a bit, but I think still on track. Once the search db is done I should be approaching the home stretch.
3:21 pm - Working on frontend, realized screwup in the search script. Dammit. Had to restart.
3:27 pm - My friend Leigh found a place where I could buy the data. Only problem is its in GIS (Grid) format. Will wait till later to see if have the time to work with it.
3:36 pm - The frontpage is ready. Break time, need to figure out dinner.
3:39 pm - Organic, gluten-free pizza it is. I love my fiancee. Back to work I go.
3:43 pm - Stopped the iShareMaps Geocoding. Leigh is working on the geocoding part, plus it was too slow anyhoo
3:44 pm - I hate mod_rewrite :)
4:00 pm - mod_rewrite done
4:01 pm - Running tests again for exact category and business type number.
4:05 pm - It is taking too long to manipulate the data. Worried.
4:26 pm - The very basics of the search system is working.
4:45 pm - Search results yield fruit
4:58 pm - Pagination working successfully
4:59 pm - Search db is still being compiled! A total of 3.4 million records. Three hours for 3.4 million records. I have been doing other stuff that was taxing the db, so I would say 1.5 million records per hour.
5:04 pm - Stopped search db compilation. Currently at 3.6 million records. Need to restart - need postal code and also some info properly set.
5:09 pm - Restarted search db compilation.
5:10 pm - Search db compiler has connected with db, it is now generating.
5:11 pm - Did a quick timing, it is doing roughly 50k entries per minute. If my estimation of 6 million entries is right - that means 120 minutes. 2 hours. Remains to be seen.
5:12 pm - Leigh says the data is good, and he has the Grid -> LL convertor working.
5:37 pm - Another problem with the search db. Need to restart again. Not good.
5:40 pm - Restarted search db.
5:47 pm - Tested again, doing roughly 4000 listings per minute. With roughly 1.7 million listings, that means ... 425 minutes. 7 hours!
5:59 pm - While Leigh seems to have disappeared, I hooked up the search results with the puny XXXX postal code lookup. With a range of 2km, it is rather poor, but at least it gets it somewhere close :)
6:00 pm - Break time, getting tired out.
6:06 pm - Back. Fixed a little bug in search results.
6:15 pm - I overlooked a component of the listings page. Had to pause search compilation
6:18 pm - Fixed the component, hoping search compilation can be re-started.
6:24 pm - Extracting distinct business types, the search db compilation is running at about 20% speed now.
6:26 pm - Dinner time. She has made some damn yummy looking pizza. Have I mentioned how much I love her?
6:51 pm - Back from dinner - damn delicious pizza. Doing some quick side-business.
6:56 pm - Compiled array of all the business types.
7:10 pm - Good progress on directory navigation
7:18 pm - 225,000 processes, and already 2.5 million search db entries. We could be headed towards some problems.
7:28 pm - Talked to Leigh. No go on the convertor he says. He couldn't get the formula to click, and had to head to bed.
7:38 pm - Looking good. Working on category navigation.
7:45 pm - Other than the the search compilation taking ages, everything else is looking fine and dandy. I hope to have this wrapped up in an hour. 13 hours to go :)
8:13 pm - Finalized frontpage splash. Category pages still take a while, waiting for the search compilation to be over before I index those areas.
8:20 pm - Fixed some bugs with the category pages. 350k listings processed, and already 3.7 million entries. Not looking good at all.
8:23 pm - More tweaking to the frontpage. Going to take a slight break, then regroup to figure out the two problems remaining: 1) Searching (size of table, more efficient way of doing it, etc) and 2) Geocoding.
8:24 pm - Somehow had two indexes of pname in the search db. Deleted it (or told the system to), and now the search compilation is stuck. No good!
8:28 pm - Fixed a display bug with three tabs. Search db compiler still stuck. Ruh roh.
8:29 pm - Fixed bug where searching within a postal code gave you no love.
8:32 pm - Had to restart search compilation. Bugger. Talking to Jason about a potential solution to the db size.
8:35 pm - Search db compiler stuck on fetching data. The damn index kill must have hurt it badly - ugh. It was a duplicate dammit.
8:37 pm - Added JS so that the 'what' input box is automatically active
8:38 pm - Break time. Will tackle above 2 when I can think clearly.
8:48 pm - Came into search db compiler. Stuck. Checked MySQL, the index is gone. Cursed ferociously, restarted it. 350k done, and a crapload more to go.
9:02 pm - Out of the shower, so fresh and clean. Search compiled is back on, 380k done. Me goes afk - still tird out.
11:09 pm - Went out for a break, just got back. Flourless chocolate cake at Matisse restaurant = yummy. The search db compiler is just chugging along, at roughly 680k. A pretty damn large 7.3 million rows (taking up 2.8 gigs of space).
11:13 pm - Just checking up on some stuff. Search seems to be rounding out better. Going to put the missus to bed, and then figure out the damn geolocation headaches. I don't know if the search db will be ready by 8 am tomorrow, but since the site is usable in its current form, I declare myself winnar!
11:22 pm - Some optimizations on search algorithm performed.
12:56 am - Back. Daily Show and Colbert Report are most hilarious. Search DB compiler at 910k done. Search DB size is a pretty massive 9.8 million rows. Battery recharged, all I need to do now is figure out how to do proper geocoding, and I can hit the bed.
1:25 am - Found phpcoord, it may be the answer.
1:40 am - After much crying, cursing, and wonderment - I have it! I can geocode the unique postal codes. Wahoo! Now to import. The search compiler is now upto 980k.
1:49 am - Still uploading data. 25 megs of 300+. Tested the convertor to long/lat, worked perfectly. by 2:00 am I hope to have the geocoding well under way.
1:52 am - Lala, waiting for data to be uploaded. This is gonna take time. Passed 1 million entries processed for the search DB - over 10.6 million records now in the system.
1:56 am - 50 megs of the geocoding data uploaded. I did upload a small sample to test the importer, it worked perfectly. Can't wait to get this show on the road!
1:58 am - Silly me. Made a mistake on the upload. There are benefits of owning your own servers - now going at 100 kb/s. ETA 1 minute says wget
2:04 am - Server is going through some high load. There is a large site on this server (roughly 60k+ uniques/day, 250k+ pageviews/day), and it does its stats processing every 2 am. Oh dear. Anyway, data is uploaded, trying to run importer. Between the 300+ megs of data and the search db compiler and the big site's stat processing ... tis a bit ugly.
2:06 am - Restarted with some debug informations. The hard part is not loading the data. Nope. The hard part is taking that data, and exploding it. PHP array over 1.5 million? Yes sir :)
2:08 am - Side effect of the import script is that the search compiler has greatly slowed. If it doesn't pick up by 2:15 am I will have to do it line by line. Or could try file(). Will see.
2:10 am - Bah. Some error in my processing, invalid SQL queries. Fixing now, trying out using file().
2:16 am - Argh. Another stupid bug. Problem is it takes 5 minutes to test it out due to the damn filesize/load.
2:20 am - Still tackling. Confident once it is ifgured out we will ZOOM along
2:32 am - Okay should be fixed, running import again.
2:34 am - Importer seems to be running damn well.
2:36 am - Just checked my math. Will come out to roughly 3 million records. Only upto 50k. Gonna go to bed now, will wake up early enough tomorrow to integrate and relish :)
2:58 am - Finally heading to bed. 280k entries already into the postal_code db. Search DB is at 1.05 million. I don't know if it will be done in 5 hours, but it will definitely be usable and live.
7:24 am - Back on!
7:24 am - Problems in Krondor! The postal code lookup db finished at 1.8 million records. The search DB is massive with 11.8 million records (4.5 gigs).
7:28 am - Oops. Made a mistake in the postal code lookup (spaces). Updating now, its flying baby!
7:31 am - 400,000 postal codes updated. Another million to go.
7:39 am - 1.5 million postal codes processed.
7:50 am - Success! 'restaurants in london' now generates properly located results. Checked them with Google, pretty much spot on. The page takes <1.0 seconds to generate, which is fast than enough for me. Hurrah! The local search engine is complete!
7:51 am - Okay, it is 99% complete. The search table generation screwed up somewhere, making it very very slow, but with 1.7 million business listings in there, did something right :)
7:57 am - Ever so slight error in the serach query, fixed. Whew :D
8:01 am - I declare myself winner.
Would I do this again? Definitely.
-Ahmed Farooq
Browse the listingsCategories:
Office & Administration | Factories & Manufacturing | Head Office | Other | Workshop & Repair Centres | Business at Home | Shops & other Retail Outlets | Hospitals & Medical Establishments | School & Educational Establishments | Warehouses Wholesalers | Sports, Leisure, Entertainment | Police, Fire, Ambulance | Transport | Places of Worship | Unknown
