Naming each centroid is always a challenge. Some results make a lot of sense - while others give great insight into what are the prevalent surrounding technologies to any Stack Overflow tag. These are the 50 groups that k-means clustering found - given the 1-hot encoding of related tags we did earlier in this post. I’ll even look out for some tags I’m interested in: How are google, amazon, and azure represented in each cluster? Get ready for the results: The 50 clusters are… More on hyper-parameter tuning below.ĥ00 one-hot encoded dimensions reduces time per iteration to 30 seconds, and a lower loss. The same with only 30 dimensions lowers the time to 90 seconds - but I like the results better with 500. It also reduces the time for training the model in BigQuery from 24 minutes to 3. Now we wait - while BigQuery shows us the progress of our training:Īnd when it’s done, we even get an evaluation of our model:ĭo we really need 4,000 one-hot encoded dimensions to obtain better clusters? Turns out that 500 are enough - and I like the results better. With this line, I’m creating a one-hot encoding string that I can use later to define the 4,000+ columns I’ll use for k-means:įORMAT("IFNULL(ANY_VALUE(IF(tag2='%s',1,null)),0)X%s", tag2, REPLACE(REPLACE(REPLACE(REPLACE(tag2,'-','_'),'.','D'),'#','H'),'+','P'))Īnd training a k-means model in BigQuery is really easy:ĬREATE MODEL `deleting.kmeans_tagsubtag_50_big_a_01` Now - instead of using this small table, let’s use the whole table to compute k-means with BigQuery. ,IFNULL(ANY_VALUE(IF(tag2='jquery',1,null)),0) XjqueryįROM `deleting.stack_overflow_tag_co_ocurrence` ,IFNULL(ANY_VALUE(IF(tag2='android',1,null)),0) Xandroid ,IFNULL(ANY_VALUE(IF(tag2='python',1,null)),0 ) Xpython ,IFNULL(ANY_VALUE(IF(tag2='javascript',1,null)),0) Xjavascript You can reduce or augment the sensibility of these relations with the percent threshold:
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |