Jekyll2020-10-30T10:07:54-05:00https://austinbrian.github.io///Brian AustinLong on data science. The website and portfolio page of Brian Austin.Brian AustinCFPB enforcement actions spike consumer complaints, but changes aren’t permanent2017-06-18T00:00:00-05:002017-06-18T00:00:00-05:00https://austinbrian.github.io//company_cfpb_complaints<p>When consumer complaints come to the <a href="https://www.consumerfinance.gov/">Consumer Financial Protection Bureau</a> (CFPB) complaints database, consumers get the chance to vent their spleen about <a href="../cfpb-topic-modeling">whatever happened to them</a>. As part of these complaint records, they also name the banks that caused the problems. With this information, we can see what happens when a big bank gets a big fine.</p>
<p>And in September 2016, Wells Fargo got a big fine.</p>
<p>You may remember this fine. The CFPB laid down the law with a <a href="https://www.consumerfinance.gov/about-us/newsroom/consumer-financial-protection-bureau-fines-wells-fargo-100-million-widespread-illegal-practice-secretly-opening-unauthorized-accounts/" title="The total fine was even more, with an additional $85 million levied between another federal regulator and Los Angeles county government.">$100 million fine</a> after consumers reported that Wells Fargo employees were opening up fake credit card accounts in their names in order to meet internal quotas. Consumers, when they heard about this became about twice as likely to complain each day.</p>
<p><img src="../images/cfpb/wells_bokeh_plot.png" alt="" /></p>
<p>…for a while.</p>
<p>In the four months prior to the fine, the CFPB registered <strong>3,090 total complaints</strong> about Wells Fargo. In the four months after the fine was levied, complaints jumped 58 percent to <strong>4,896 complaints</strong>. But by the end of the year, as the graph shows, the average complaint frequency dipped down to its prior levels.</p>
<p>Beyond the sheer numbers of complaints registered against Wells Fargo, however, there isn’t a major shift in topics of the complaints that consumers made to the CFPB.</p>
<p>The graph below on the left demonstrates the distribution of <a href="../cfpb-topic-modeling">topics</a> over the entire time the CFPB has been collecting those complaints. The other two graphs demonstrate the most frequent topics of complaints in the <em>four months prior</em> to the enforcement action being taken, and then their distribution in the <em>four months after</em> the enforcement action. As we covered earlier, four months was about the period of time in which the frequency of complaints had regressed to the same level as it was prior to enforcement.</p>
<p><br /></p>
<h4>Wells Fargo complaint topics</h4>
<p><img src="../images/cfpb/wells_topics_4_mo.png" alt="" /></p>
<p>The post-enforcement topics largely mirror the pre-enforcement topics - there isn’t a large shift in conversation to entirely different topics. But three things are different:</p>
<ol>
<li>The topic most frequently represented changed from topic 17 to topic 31.</li>
<li>There are 4 more topics after enforcement than there were prior to enforcement. New topics post-enforcement include: 5, 44, 13, 18, 26, 39, and 1. Topics that were not represented in comments after enforcement include: 10, 21, and 41.</li>
<li>Topics that saw increases after enforcement tended to be those that included words about contracts and agreements, which is consistent with more complaints about the types of financial company abuses that prompted the CFPB’s enforcement action.</li>
</ol>
<p><strong>Selected topics from Wells Fargo complaints</strong></p>
<table>
<thead>
<tr>
<th style="text-align: center">Topic No.</th>
<th style="text-align: left">Words in topic</th>
<th style="text-align: center">Count prior</th>
<th style="text-align: center">Count after</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: center">31</td>
<td style="text-align: left"><em>acct, pmi, remove, block, thru, bayview, dob, child, I’ve, dear</em></td>
<td style="text-align: center">37</td>
<td style="text-align: center">47</td>
</tr>
<tr>
<td style="text-align: center">17</td>
<td style="text-align: left"><em>boa, box, equip, internet, service, phh, return, install, west, cable</em></td>
<td style="text-align: center">55</td>
<td style="text-align: center">38</td>
</tr>
<tr>
<td style="text-align: center">27</td>
<td style="text-align: left"><em>associ, recovery, new, portfolio, old, expire, type, unknown, york, service</em></td>
<td style="text-align: center">33</td>
<td style="text-align: center">23</td>
</tr>
<tr>
<td style="text-align: center">14</td>
<td style="text-align: left"><em>party, third, agreement, term, page, consent, condition, disclose, site, exhibit</em></td>
<td style="text-align: center">11</td>
<td style="text-align: center">16</td>
</tr>
<tr>
<td style="text-align: center">5</td>
<td style="text-align: left"><em>chase, charge, refund, discover, dispute, claim, cancel, merchant, transaction, purchase</em></td>
<td style="text-align: center">0</td>
<td style="text-align: center">3</td>
</tr>
<tr>
<td style="text-align: center">44</td>
<td style="text-align: left"><em>contract, sign, son, signature, cancel, asset, manage, account, consult, never</em></td>
<td style="text-align: center">0</td>
<td style="text-align: center">3</td>
</tr>
</tbody>
</table>
<p><br /> <br />
In some cases, the topics covered include words similar to those that would be common with other bank or credit-card transaction issues.</p>
<p>One significant limitation to this analysis is that only about 10% of the Wells Fargo complaints have corresponding narrative descriptions, which is where the topics arise. Just looking at the total number that included a narrative, there were 238 complaints prior to enforcement, and 218 after. This may partially explain why there was limited change in the topics discussed after the enforcement action. While many more people claimed harm, approximately as many filled in the details of their stories.</p>
<p>Taking a look at the way that the products that drove the complaints shows a huge shift for the “Bank account or service” product category, here in dark red, in the four months after enforcement. The lighter red indicates credit card complaints, which more than doubled over the same time period.</p>
<p><img src="../images/cfpb/wells_products_4_mo.png" alt="" /></p>Brian AustinWhen consumer complaints come to the Consumer Financial Protection Bureau (CFPB) complaints database, consumers get the chance to vent their spleen about whatever happened to them. As part of these complaint records, they also name the banks that caused the problems. With this information, we can see what happens when a big bank gets a big fine. And in September 2016, Wells Fargo got a big fine. You may remember this fine. The CFPB laid down the law with a $100 million fine after consumers reported that Wells Fargo employees were opening up fake credit card accounts in their names in order to meet internal quotas. Consumers, when they heard about this became about twice as likely to complain each day. …for a while. In the four months prior to the fine, the CFPB registered 3,090 total complaints about Wells Fargo. In the four months after the fine was levied, complaints jumped 58 percent to 4,896 complaints. But by the end of the year, as the graph shows, the average complaint frequency dipped down to its prior levels. Beyond the sheer numbers of complaints registered against Wells Fargo, however, there isn’t a major shift in topics of the complaints that consumers made to the CFPB. The graph below on the left demonstrates the distribution of topics over the entire time the CFPB has been collecting those complaints. The other two graphs demonstrate the most frequent topics of complaints in the four months prior to the enforcement action being taken, and then their distribution in the four months after the enforcement action. As we covered earlier, four months was about the period of time in which the frequency of complaints had regressed to the same level as it was prior to enforcement. Wells Fargo complaint topics The post-enforcement topics largely mirror the pre-enforcement topics - there isn’t a large shift in conversation to entirely different topics. But three things are different: The topic most frequently represented changed from topic 17 to topic 31. There are 4 more topics after enforcement than there were prior to enforcement. New topics post-enforcement include: 5, 44, 13, 18, 26, 39, and 1. Topics that were not represented in comments after enforcement include: 10, 21, and 41. Topics that saw increases after enforcement tended to be those that included words about contracts and agreements, which is consistent with more complaints about the types of financial company abuses that prompted the CFPB’s enforcement action. Selected topics from Wells Fargo complaints Topic No. Words in topic Count prior Count after 31 acct, pmi, remove, block, thru, bayview, dob, child, I’ve, dear 37 47 17 boa, box, equip, internet, service, phh, return, install, west, cable 55 38 27 associ, recovery, new, portfolio, old, expire, type, unknown, york, service 33 23 14 party, third, agreement, term, page, consent, condition, disclose, site, exhibit 11 16 5 chase, charge, refund, discover, dispute, claim, cancel, merchant, transaction, purchase 0 3 44 contract, sign, son, signature, cancel, asset, manage, account, consult, never 0 3 In some cases, the topics covered include words similar to those that would be common with other bank or credit-card transaction issues. One significant limitation to this analysis is that only about 10% of the Wells Fargo complaints have corresponding narrative descriptions, which is where the topics arise. Just looking at the total number that included a narrative, there were 238 complaints prior to enforcement, and 218 after. This may partially explain why there was limited change in the topics discussed after the enforcement action. While many more people claimed harm, approximately as many filled in the details of their stories. Taking a look at the way that the products that drove the complaints shows a huge shift for the “Bank account or service” product category, here in dark red, in the four months after enforcement. The lighter red indicates credit card complaints, which more than doubled over the same time period.Topic modeling complaints to the CFPB2017-06-01T00:00:00-05:002017-06-01T00:00:00-05:00https://austinbrian.github.io//cfpb-topic-modeling<p>The <a href="https://www.consumerfinance.gov/">Consumer Financial Protection Bureau</a> (CFPB) collects consumer complaints against financial companies, as I’ve <a href="https://austinbrian.github.io/blog/cfpb-complaints-inauguration/">discussed in this space</a>. In trying to meet the problems identified in these complaints, whether by the CFPB or the company itself, it’s important to understand what type of product is causing the problem. Teams work on products, and the issues across different types of financial products can vary significantly.</p>
<p>But the distinction between “My bank is opening up credit card applications I didn’t ask for” and “my credit card company keeps calling my house, but I haven’t missed a payment” is subtle, and the sort of thing that doesn’t lend itself to easy product classifications.</p>
<p>You can’t just count the number of times the term “bank” is said, or “credit card” is said and get that this is about credit cards. The first is really about the bank’s service, not the credit cards themselves. The second is really about a debt collector calling someone’s home repeatedly.</p>
<p>To better understand what people are talking about in these complaints, I performed a modeling technique called <strong>Latent Dirichlet Allocation</strong>.</p>
<p>Here’s a brief introduction to how it works.</p>
<ol>
<li>The words in each complaint are converted to simplified versions of the words (i.e., both “property” and “properties” become “properti”) and given unique identifier numbers.</li>
<li>Numbers that appear in the same complaint documents across the entire set of complaints are grouped together into “topics”. These topics typically include words from multiple documents, and the words in a document are likely to be in multiple documents. Each word in a topic is given a weighting, indicating how “important” a word is to the topic - or how many times it is mentioned across documents in conjunction with the other words.</li>
</ol>
<p>In the example below, the chart below demonstrates how words in documents on the left match up with the topics across the entire document, on the right.</p>
<p><img src="/images/conceptual_topic_modeling.png" alt="" /></p>
<p>The first topic is primarily about getting money back, and how consumers interact with the customer service gatekeepers who can do that. The second is all about making mortgage or other home payments (e.g., escrow, taxes). And the third is a little more opaque, with words that describe the account, and the way that interactions go. Not all topics are readily interpretable.</p>
<p>I tested out a series of the topic models to see what size might be a good descriptor, and landed on a topic size of 50. The graphic below shows the relationship between topics and the words that comprise them. Bigger circles indicate topics that are more common.</p>
<details><summary>More graph details</summary> The topics are organized by plotting them on the axes of the two linear combinations that best describe their features - a method known as <a href="https://medium.com/towards-data-science/a-one-stop-shop-for-principal-component-analysis-5582fb7e0a9c" title="Really excellent overview of the technicals behind PCA">Principal Component Analysis</a>. The blue bars on the right-hand side indicate the frequency of a word across all the documents in the entire dataset. The red bar indicates how frequent the terms are within a topic. The relevance metric λ is a representation of the relative exclusivity of a term - higher values are more frequent, less exclusive, and lower values are more exclusive, but may be more idiosyncratic.<br />
<em>The LDA visualization tool used for this project was developed by <a href="http://www.kennyshirley.com/LDAvis/">Carson Sievert and Kenny Shirley</a></em>.
</details>
<p><br /> <br />
<span align="center">
<a href="../images/cfpb/model_50_topics_graphic.html">
<img src="../images/cfpb/lda_50_topic_static_topic10.png" /></a>
</span></p>
<p><em>Click the image to see how all the topics are distributed</em></p>
<p>The topic highlighted here (just called #10), groups together words about debt collectors. The right-hand side walks includes the words that are common in each topic (the red bars) and the words frequencies across the entire dataset (blue bars).</p>
<p>This type of clustering is especially convenient for large datasets, where it’s impractical to read every one of the entries. Instead, you can look at the topics, and then check the distribution of those topics, especially where they are limited to certain groupings.</p>
<p>In the next iteration, I’ll be looking at the way these topics change depending on the companies and products that the complaints are about.</p>Brian AustinThe Consumer Financial Protection Bureau (CFPB) collects consumer complaints against financial companies, as I’ve discussed in this space. In trying to meet the problems identified in these complaints, whether by the CFPB or the company itself, it’s important to understand what type of product is causing the problem. Teams work on products, and the issues across different types of financial products can vary significantly. But the distinction between “My bank is opening up credit card applications I didn’t ask for” and “my credit card company keeps calling my house, but I haven’t missed a payment” is subtle, and the sort of thing that doesn’t lend itself to easy product classifications. You can’t just count the number of times the term “bank” is said, or “credit card” is said and get that this is about credit cards. The first is really about the bank’s service, not the credit cards themselves. The second is really about a debt collector calling someone’s home repeatedly. To better understand what people are talking about in these complaints, I performed a modeling technique called Latent Dirichlet Allocation. Here’s a brief introduction to how it works. The words in each complaint are converted to simplified versions of the words (i.e., both “property” and “properties” become “properti”) and given unique identifier numbers. Numbers that appear in the same complaint documents across the entire set of complaints are grouped together into “topics”. These topics typically include words from multiple documents, and the words in a document are likely to be in multiple documents. Each word in a topic is given a weighting, indicating how “important” a word is to the topic - or how many times it is mentioned across documents in conjunction with the other words. In the example below, the chart below demonstrates how words in documents on the left match up with the topics across the entire document, on the right. The first topic is primarily about getting money back, and how consumers interact with the customer service gatekeepers who can do that. The second is all about making mortgage or other home payments (e.g., escrow, taxes). And the third is a little more opaque, with words that describe the account, and the way that interactions go. Not all topics are readily interpretable. I tested out a series of the topic models to see what size might be a good descriptor, and landed on a topic size of 50. The graphic below shows the relationship between topics and the words that comprise them. Bigger circles indicate topics that are more common. More graph details The topics are organized by plotting them on the axes of the two linear combinations that best describe their features - a method known as Principal Component Analysis. The blue bars on the right-hand side indicate the frequency of a word across all the documents in the entire dataset. The red bar indicates how frequent the terms are within a topic. The relevance metric λ is a representation of the relative exclusivity of a term - higher values are more frequent, less exclusive, and lower values are more exclusive, but may be more idiosyncratic. The LDA visualization tool used for this project was developed by Carson Sievert and Kenny Shirley. Click the image to see how all the topics are distributed The topic highlighted here (just called #10), groups together words about debt collectors. The right-hand side walks includes the words that are common in each topic (the red bars) and the words frequencies across the entire dataset (blue bars). This type of clustering is especially convenient for large datasets, where it’s impractical to read every one of the entries. Instead, you can look at the topics, and then check the distribution of those topics, especially where they are limited to certain groupings. In the next iteration, I’ll be looking at the way these topics change depending on the companies and products that the complaints are about.Donald Trump’s inauguration prompted the most CFPB complaints ever2017-05-24T00:00:00-05:002017-05-24T00:00:00-05:00https://austinbrian.github.io//cfpb-complaints-inauguration<p>The <a href="https://www.consumerfinance.gov/">Consumer Financial Protection Bureau</a> (CFPB) has been around since 2011. The organization is charged with policing financial services companies to ensure that they are fairly treating consumers. One of the ways they gather information on consumers’ experiences is by tallying complaints of wrongdoing against companies.</p>
<p>And for Donald Trump’s Inauguration Day, there were more complaints to the CFPB than there have been on any other day.</p>
<p><img src="https://raw.githubusercontent.com/austinbrian/blog/master/images/Total_cfpb_complaints_5_24.png" alt="" title="That orange spike there is You-Know-Who" /></p>
<p>While complaints have increased steadily over the CFPB’s existence, the 24 hours before Trump’s inauguration saw more than twice as many reports to the CFPB as any preceding day. There were more than 1,500 individual complaints on both Inauguration Day and the day before it.</p>
<p>The CFPB houses a massive amount of data in its <a href="https://data.consumerfinance.gov/dataset/Consumer-Complaints/s6ew-h6mp">complaints database</a>, which is open (and uses an exceptionally well-documented <a href="https://dev.socrata.com/foundry/data.consumerfinance.gov/jhzv-w97w">API, courtesy of Socrata</a>). One particularly interesting way that this data lines up is around various types of products that people have complained about. For instance, over the course of the CFPB’s existence, mortgages and debt collection have been the two top issues that consumers have complaints about.</p>
<p>But on Inauguration Day, more than 60% of complaints were about student loans.</p>
<p><img src="https://raw.githubusercontent.com/austinbrian/blog/master/images/obama_trump_prod_complaints.png" alt="" /></p>
<p>I’m especially interested in how these complaints demonstrate the ways people talk about their mistreatments, and the way those translate to actions. I’ll be pursuing those questions in the following days as part of a larger project.</p>Brian AustinThe Consumer Financial Protection Bureau (CFPB) has been around since 2011. The organization is charged with policing financial services companies to ensure that they are fairly treating consumers. One of the ways they gather information on consumers’ experiences is by tallying complaints of wrongdoing against companies. And for Donald Trump’s Inauguration Day, there were more complaints to the CFPB than there have been on any other day. While complaints have increased steadily over the CFPB’s existence, the 24 hours before Trump’s inauguration saw more than twice as many reports to the CFPB as any preceding day. There were more than 1,500 individual complaints on both Inauguration Day and the day before it. The CFPB houses a massive amount of data in its complaints database, which is open (and uses an exceptionally well-documented API, courtesy of Socrata). One particularly interesting way that this data lines up is around various types of products that people have complained about. For instance, over the course of the CFPB’s existence, mortgages and debt collection have been the two top issues that consumers have complaints about. But on Inauguration Day, more than 60% of complaints were about student loans. I’m especially interested in how these complaints demonstrate the ways people talk about their mistreatments, and the way those translate to actions. I’ll be pursuing those questions in the following days as part of a larger project.In terms of research dollars, Johns Hopkins stands alone2017-05-22T00:00:00-05:002017-05-22T00:00:00-05:00https://austinbrian.github.io//higher-ed-RD-clusters<p>Institutions of higher education are constantly comparing themselves to one another, and one of the ways they do so is using their total amount of funding dollars.</p>
<p>But total amount of funding is, at best, a blunt instrument. The National Center for Science and Engineering Statistics (part of the National Science Foundation) releases an annual survey of these research dollars, and breaks them down <a href="https://ncsesdata.nsf.gov/herd/2015/">all sorts of ways</a>, including <a href="https://ncsesdata.nsf.gov/herd/2015/html/HERD2015_DST_18.html">by field</a>.
<br /></p>
<h3>Data</h3>
<p>This table is a selection of the 640 universities that responded to the survey for 2015. <em>Rank</em> indicates the university’s relative positioning to other institutions in terms of total research funding. Johns Hopkins has the most funding, and comes in with nearly twice as much as the next-highest institution (the University of Michigan - Ann Arbor).</p>
<p>The other variable names correspond with research funding received by various fields in 2015 (<em>measured in thousands of dollars</em>).</p>
<table>
<thead>
<tr>
<th style="text-align: center">Rank</th>
<th>Institution</th>
<th>Environmental sciences</th>
<th>Life sciences</th>
<th>Math and computer sciences</th>
<th>Physical sciences</th>
<th>Psychology</th>
<th>Social sciences</th>
<th>Sciences, nec</th>
<th>Engineering</th>
<th>All non-S&E fields</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: center">1</td>
<td>Johns Hopkins U.</td>
<td>31854</td>
<td>867715</td>
<td>171205</td>
<td>167009</td>
<td>3663</td>
<td>11034</td>
<td>54640</td>
<td>991937</td>
<td>6622</td>
</tr>
<tr>
<td style="text-align: center">2</td>
<td>U. Michigan, Ann Arbor</td>
<td>14609</td>
<td>779922</td>
<td>25434</td>
<td>52449</td>
<td>21989</td>
<td>149805</td>
<td>1627</td>
<td>254505</td>
<td>68938</td>
</tr>
<tr>
<td style="text-align: center">24</td>
<td>Georgia Institute of Technology</td>
<td>19068</td>
<td>19879</td>
<td>113353</td>
<td>47279</td>
<td>7431</td>
<td>9132</td>
<td>7645</td>
<td>533329</td>
<td>8254</td>
</tr>
<tr>
<td style="text-align: center">28</td>
<td>U. Southern California</td>
<td>20051</td>
<td>411987</td>
<td>93765</td>
<td>16924</td>
<td>9935</td>
<td>27941</td>
<td>327</td>
<td>69527</td>
<td>40574</td>
</tr>
<tr>
<td style="text-align: center">32</td>
<td>U. Illinois, Urbana-Champaign</td>
<td>7214</td>
<td>220029</td>
<td>114512</td>
<td>67182</td>
<td>17276</td>
<td>21340</td>
<td>5000</td>
<td>161458</td>
<td>25806</td>
</tr>
<tr>
<td style="text-align: center">89</td>
<td>Carnegie Mellon U.</td>
<td>348</td>
<td>11212</td>
<td>109026</td>
<td>14162</td>
<td>7757</td>
<td>6791</td>
<td>3479</td>
<td>89054</td>
<td>175</td>
</tr>
</tbody>
</table>
<p><br /></p>
<h3>Initial Exploration</h3>
<p>I wanted to look beyond these total figures and see how unsupervised machine learning would classify these schools compared to one another. I started by examining total research dollars for various sets of fields against one another. Scatter plots are a useful way to see that information. Here, darker points indicate greater total research dollars.</p>
<p><img src="https://raw.githubusercontent.com/austinbrian/blog/master/images/es_engineering_rd.png" alt="" /></p>
<p>Most plots that are highly clustered around 0, with bands of institutions fanning out. In some cases, such as engineering, institutions with a greater concentration of students are outliers, but don’t necessarily show a high degree of difference between them.</p>
<p><img src="https://raw.githubusercontent.com/austinbrian/blog/master/images/psy_lifesci_rd.png" alt="" /></p>
<p>Psychology and Life Sciences, however, have a data distribution that allow us to see some of the spread of data, which makes them a decent place to start visualizing the way some of these schools might cluster together.
<br /></p>
<h3>Density clustering</h3>
<p>These data points seem relatively tightly grouped, and I’ll start with DBSCAN, a density-based clustering algorithm that accounts for noise as part of its algorithm. <em>For an exceptional visualization of DBSCAN, check out <a href="https://www.naftaliharris.com/blog/visualizing-dbscan-clustering/">Naftali Harris’s blog</a>.</em></p>
<p>I’ll use the same plot of Psychology vs Life Sciences that looked promising before, but colorized by the clusters that DBSCAN assigns.</p>
<p><img src="https://raw.githubusercontent.com/austinbrian/blog/master/images/psy_lifsci_dbscan.png" alt="" /></p>
<p>Yikes. This only shows 2-3 clusters, and they don’t seem to be particularly useful in describing the data. It doesn’t look particularly good, but eyeballing it isn’t always a great way to determine the usefulness of unsupervised learning algorithms, so I’ll check the numbers.</p>
<ul>
<li><strong>Estimated number of clusters:</strong> 3, as we could <em>just</em> make out <br /></li>
<li><strong>Homogeneity:</strong> 0.114<br />
<ul>
<li>Ouch - our clusters heavily overlap with one another; the best score is 1</li>
</ul>
</li>
<li><strong>Completeness:</strong> 1.000 <br />
<ul>
<li>All the members of a given class are in the same cluster - this is to be expected since we have such unbalanced classes</li>
</ul>
</li>
<li><strong>V-measure:</strong> 0.205 <br />
<ul>
<li>The harmonic mean between homogeneity and completeness, higher is better</li>
</ul>
</li>
<li><strong>Silhouette Coefficient:</strong> -0.074 <br />
<ul>
<li>This score shows how distinct clusters are from one another. Scores near zero indicate overlapping clusters (<em>bad</em>) and negative scores indicate incorrect clustering (<em>worse</em>).</li>
</ul>
</li>
</ul>
<p>So on the whole, this doesn’t seem like a very good algorithm to use. The data is too dense on the whole to create unique cluster values.
<br /></p>
<h3>Hierarchical clustering</h3>
<p>To improve our approach, it would be useful to be able to see how many clusters we would have at various levels. Fortunately, there is a useful approach called hierarchical clustering that builds agglomerative groupings as data points are more similar to one another.</p>
<p>This results in a tree-type diagram that shows the “distance” (here measured in research dollars) between two data points.</p>
<p><img src="https://raw.githubusercontent.com/austinbrian/blog/master/images/dendrogram_he_rd.png" alt="" /></p>
<p>On first inspection, there look to be a couple of good places to break up our clusters.</p>
<ul>
<li>If we want a very simple way to break down groups, we could still break it at ~$1.5 million, which would separate into three total groups, though one would include far more data points than the other. This is basically what the DBSCAN algorithm did.</li>
<li>We could do it at around $1,000,000, which would break up the green data points into two groups, and would break up the red grouping into what appears to be four clusters.</li>
<li>For more granularity, a break point at $500,000 appears to create around 10 clusters, though the distinction is hard to see with this diagram.</li>
</ul>
<p>For the purposes of grouping schools together, I think that middle class of clusters is going to create the most useful separation, so I’ll create clusters at the $1,000,000 mark.</p>
<p>And here is what the Psychology vs Life Sciences plot looks like, with different colors indicating cluster groupings.</p>
<p><img src="https://raw.githubusercontent.com/austinbrian/blog/master/images/psy_lifsci_clustered.png" alt="" /></p>
<p>Hey! It looks like there are distinct bands of colors in there! This looks like something we can use to better understand how universities are different. Let’s look at one more.</p>
<p><img src="https://raw.githubusercontent.com/austinbrian/blog/master/images/es_life_sci_clustered.png" alt="" /></p>
<p>Again, a useful banding, and one that is more prominent for the Life Sciences dimension than for the Environmental Sciences one.</p>
<p>Not all the comparisons will look as robust, but that’s part of why the clustering itself is good - this can be difficult to eyeball.</p>
<p><img src="https://raw.githubusercontent.com/austinbrian/blog/master/images/math_cs_clustered.png" alt="" />
<br /></p>
<h3>Final cluster insights</h3>
<p>Below I have included a table of the final cluster groups. The big takeaway is that Johns Hopkins receives so much R&D funding that it stands alone as an institution. It is a cluster unto itself. It is the übercluster.</p>
<p>The next cluster includes a number of state universities, many of which fund a <a href="https://raw.githubusercontent.com/austinbrian/austinbrian.github.io/master/assets/UNCCH_budget.png" title="23% in 2012 at UNC-Chapel Hill">significant portion of their operating budgets</a> from federal grant money.</p>
<p>The next group of institutions includes a dozen largely technical institutions or large medical centers (as well as the University of California - Berkeley). Following that are smaller research-focused institutions, then larger institutions with less research focus, and finally a large group of primarily access-oriented institutions.</p>
<p><em>A detailed walkthrough of this analysis is available on my <a href="https://github.com/austinbrian/DSI-labs/blob/master/Higher%20Ed%20R%26D%20Analysis.ipynb">github</a>, with a more reader-friendly view on <a href="http://nbviewer.jupyter.org/github/austinbrian/DSI-labs/blob/master/Higher%20Ed%20R%26D%20Analysis.ipynb">NBViewer</a>.</em></p>
<p>Here is a full list of the way these clusters break out.</p>
<table>
<thead>
<tr>
<th style="text-align: center">Cluster Size</th>
<th style="text-align: left">Institutions</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: center">1</td>
<td style="text-align: left">Johns Hopkins U.</td>
</tr>
<tr>
<td style="text-align: center">28</td>
<td style="text-align: left">U. California, San Diego; U. Minnesota, Twin Cities; U. Wisconsin-Madison; Ohio State U.; Stanford U.; U. California, Los Angeles; Columbia U. in the City of New York; U. Michigan, Ann Arbor; U. North Carolina, Chapel Hill; U. Washington, Seattle; Cornell U.; Duke U.; U. Texas M. D. Anderson Cancer Center; U. Pennsylvania; Harvard U.; U. California, Davis; Yale U.; U. Florida; U. Pittsburgh, Pittsburgh; Washington U., Saint Louis; Northwestern U.; Emory U.; Vanderbilt U.; U. Alabama, Birmingham; U. California, San Francisco; Baylor C. of Medicine; Icahn School of Medicine at Mt. Sinai; U. Texas Southwestern Medical Center</td>
</tr>
<tr>
<td style="text-align: center">12</td>
<td style="text-align: left">U. Illinois, Urbana-Champaign; Georgia Institute of Technology; Massachusetts Institute of Technology; Pennsylvania State U., University Park and Hershey Medical Center; North Carolina State U.; Virginia Polytechnic Institute and State U.; Purdue U., West Lafayette; Texas A&M U., College Station and Health Science Center; Michigan State U.; U. California, Berkeley; U. Arizona; SUNY, Polytechnic Institute</td>
</tr>
<tr>
<td style="text-align: center">28</td>
<td style="text-align: left">U. Southern California; U. Utah; Indiana U., Bloomington; U. Chicago; Rutgers, State U. New Jersey, New Brunswick; New York U.; U. Illinois, Chicago; SUNY, U. Buffalo; U. Georgia; Rockefeller U.; U. South Florida, Tampa; Boston U.; U. Virginia, Charlottesville; U. Rochester; U. Kentucky; U. Iowa; U. Miami; U. Cincinnati; Case Western Reserve U.; U. Colorado Denver and Anschutz Medical Campus; U. Maryland, Baltimore; Scripps Research Institute; Uniformed Services U. of the Health Sciences; Oregon Health and Science U.; Yeshiva U.; U. Massachusetts, Medical School; Medical U. South Carolina; U. Texas Health Science Center, Houston</td>
</tr>
<tr>
<td style="text-align: center">75</td>
<td style="text-align: left">Carnegie Mellon U.; U. Texas, Austin; U. Maryland, College Park; Brown U.; U. Central Florida; U. Tennessee, Knoxville; U. Hawaii, Manoa; U. Massachusetts, Amherst; Arizona State U.; Princeton U.; Iowa State U.; SUNY, Stony Brook U.; U. California, Irvine; Rice U.; Florida International U.; U. Nebraska, Lincoln; Rensselaer Polytechnic Institute; U. Notre Dame; U. California, Santa Barbara; Mississippi State U.; Florida State U.; New Jersey Institute of Technology; California Institute of Technology; U. Colorado Boulder; Oregon State U.; Clemson U.; U. Houston; George Washington U.; SUNY, U. Albany; Texas Tech U.; Colorado State U., Fort Collins; U. Delaware; Washington State U.; Drexel U.; Kansas State U.; Tufts U.; U. New Mexico; Dartmouth C.; U. California, Riverside; Louisiana State U., Baton Rouge; U. Connecticut; Temple U.; North Dakota State U.; U. Missouri, Columbia; Oklahoma State U., Stillwater; Wayne State U.; U. California, Santa Cruz; U. South Carolina, Columbia; U. Kansas; Utah State U.; U. Idaho; Georgetown U.; U. Dayton; New Mexico State U.; Tulane U.; U. Oklahoma, Norman and Health Science Center; Virginia Commonwealth U.; West Virginia U.; U. Vermont; U. Louisville; U. Mississippi; Auburn U., Auburn; Wake Forest U.; U. Arkansas, Fayetteville; Woods Hole Oceanographic Institution; Medical C. Wisconsin; U. Texas Medical Branch; U. Texas Health Science Center, San Antonio; U. Nebraska, Medical Center; U. Arkansas for Medical Sciences; Thomas Jefferson U.; Cold Spring Harbor Laboratory; Rush U.; Georgia Regents U.; U. Tennessee, Health Science Center</td>
</tr>
<tr>
<td style="text-align: center">496</td>
<td style="text-align: left">U. Alabama, Huntsville; George Mason U.; U. Alabama, Tuscaloosa; Northeastern U.; U. Louisiana, Lafayette; U. Texas, Dallas; U. Maryland, Baltimore County; Naval Postgraduate School; Air Force Institute of Technology; SUNY, Binghamton U.; U. Texas, El Paso; San Diego State U.; U. Texas, Arlington; Worcester Polytechnic Institute; U. North Texas, Denton; U. North Carolina, Charlotte; DePaul U.; Wright State U.; Syracuse U.; U. Oregon; U. Texas, San Antonio; Rochester Institute of Technology; U. Massachusetts, Lowell; U.S. Air Force Academy; Louisiana Tech U.; Michigan Technological U.; Indiana U.-Purdue U., Indianapolis; U. Tulsa; U. Memphis; Brandeis U.; Georgia State U.; Boise State U.; Illinois Institute of Technology; U. Massachusetts, Boston; North Carolina Agricultural and Technical State U.; Brigham Young U., Provo; U. Nebraska, Omaha; Toyota Technological Institute, Chicago; Kent State U.; Stevens Institute of Technology; Texas State U.; California State U., San Bernardino; Portland State U.; C. of William and Mary and Virginia Institute of Marine Science; CUNY, City C.; Missouri U. of Science and Technology; Old Dominion U.; Jackson State U.; Southern Methodist U.; U. Nevada, Reno; U. Alaska, Fairbanks; San Francisco State U.; U. Wisconsin-Milwaukee; U. New Hampshire; U. Missouri, Kansas City; U. Wyoming; U.S. Naval Academy; Texas A&M U.-Corpus Christi; U. North Carolina, general administration; U. Arkansas, Little Rock; Harvey Mudd C.; U. California, Office of the President; U. Colorado Colorado Springs; Lehigh U.; U.S. Military Academy; U. New Orleans; American U.; Delaware State U.; Montana State U., Bozeman; Baylor U.; Villanova U.; Bryn Mawr C.; U. Massachusetts, Dartmouth; Alabama A&M U.; Northern Arizona U.; CUNY, Queens C.; Florida Institute of Technology; U. South Alabama; U. Nevada, Las Vegas; Gallaudet U.; Bowie State U.; Marquette U.; U. Tennessee, Chattanooga; California Polytechnic State U., San Luis Obispo; Boston C.; Tennessee Technological U.; Tuskegee U.; Desert Research Institute; Clarkson U.; Ball State U.; Florida Atlantic U.; Creighton U.; Rutgers, State U. New Jersey, Newark; U. Maine; Howard U.; U. Puerto Rico, Mayaguez; California State U., Northridge; Dakota State U.; Elizabeth City State U.; U. California, Merced; Southern Illinois U., Carbondale; Illinois State U.; Norfolk State U.; CUNY, system office; Smith C.; Western Washington U.; U. Puerto Rico, Rio Piedras; Williams C.; South Dakota State U.; Texas Southern U.; Western Michigan U. and Homer Stryker M.D. School of Medicine; Morgan State U.; U. Montana, Missoula; U. Akron; Hampton U.; Florida A&M U.; Miami U.; California State U., Bakersfield; U. South Dakota; U. Texas, Brownsville; Sam Houston State U.; C. Charleston; CUNY, Hunter C.; U. Texas Pan American; Oakland U.; Stephen F. Austin State U.; Loyola U., Chicago; Willamette U.; Northern Illinois U.; Wellesley C.; Fordham U.; Towson U.; U. Central Arkansas; U. Minnesota, Duluth; Calvin C.; California State U., San Marcos; U. Houston-Downtown; Tennessee State U.; Pennsylvania State U., Harrisburg; Pace U.; U. Rhode Island; Columbia U., Teachers C.; California State U., Monterey Bay; U. Southern Mississippi; U. Denver; Trinity C., Hartford; Colorado School of Mines; U. Metropolitana; Idaho State U.; U. Southern Maine; U. Northern Colorado; CUNY, John Jay C. of Criminal Justice; Montclair State U.; Prairie View A&M U.; Pomona C.; Carleton C.; Duquesne U.; California State U., Fresno; Vassar C.; Xavier U. Louisiana; St. Olaf C.; Cleveland State U.; U. Arkansas, Pine Bluff; Marist C.; San Jose State U.; CUNY, Brooklyn C.; California State U., Sacramento; U. North Carolina, Wilmington; Fayetteville State U.; New School; Fairfield U.; Mount Holyoke C.; Alcorn State U.; Wesleyan U.; Lewis and Clark C.; Rutgers, State U. New Jersey, Camden; U. North Carolina, Greensboro; Arkansas State U., Jonesboro; Saint Louis U.; U. North Florida; Central Connecticut State U.; Central Michigan U.; CUNY, C. Staten Island; Trinity U.; Embry-Riddle Aeronautical U.; Southern Illinois U., Edwardsville; U. Washington, Bothell; Southern U. and A&M C., Baton Rouge; Amherst C.; California State U., Channel Islands; U. North Dakota; Lafayette C.; Purdue U., Calumet; Loyola Marymount U.; Clark Atlanta U.; Spelman C.; CUNY, Graduate Center; Kean U.; Southern Connecticut State U.; Texas A&M International U.; New Mexico Institute of Mining and Technology; James Madison U.; Virginia State U.; U. Missouri, Saint Louis; Salisbury U.; Colgate U.; Ohio U.; West Chester U. Pennsylvania; U. Houston-Clear Lake; Texas A&M U.-Commerce; Appalachian State U.; U. Washington, Tacoma; Pennsylvania State U., Behrend; Georgia Southern U.; U. Hawaii, Hilo; East Tennessee State U.; CUNY, Lehman C.; Lamar U.; Reed C.; U. Wisconsin-Stevens Point; New York Institute of Technology; Bowdoin C.; Barnard C.; Claremont Graduate U.; Macalester C.; Bowling Green State U.; Seattle U.; Oberlin C.; U. Central Oklahoma; West Virginia State U.; Kennesaw State U.; Elon U.; U. South Carolina, Aiken; Wichita State U.; Azusa Pacific U.; Bates C.; SUNY, C. of Environmental Science and Forestry; West Texas A&M U.; Benedict C.; Morehouse C.; U. Detroit Mercy; Middle Tennessee State U.; Valparaiso U.; Grinnell C.; Swarthmore C.; Winthrop U.; Grand Valley State U.; C. of Saint Benedict; California State U., Dominguez Hills; Texas Christian U.; C. Wooster; U. West Florida; Bradley U.; Rowan U.; Norwich U.; U. Hartford; La Salle U.; U. del Turabo; Siena C.; Lincoln U., Jefferson City; Saint John’s U., Collegeville; Bucknell U.; Shaw U.; Sonoma State U.; Indiana U.-Purdue U., Fort Wayne; U. Wisconsin-La Crosse; U. San Francisco; CUNY, Baruch C.; U. Wisconsin-Oshkosh; Kettering U.; California State U., Long Beach; Middlebury C.; U. South Carolina, Beaufort; Davidson C.; Minnesota State U., Mankato; Wiley C.; East Central U.; U. Baltimore; Ithaca C.; South Dakota School of Mines and Technology; Western Kentucky U.; South Carolina State U.; Lawrence Technological U.; Eastern Michigan U.; Union C., Schenectady; Saint Michael’s C.; U. Nebraska, Kearney; U. Alaska, Anchorage; Colorado C.; U. West Georgia; Florida Gulf Coast U.; St. Cloud State U.; U. Toledo; Fort Valley State U.; St. John’s U., Manhattan; Haverford C.; U. Wisconsin-Green Bay; Fisk U.; U. of the Pacific; U. Minnesota, Morris; Missouri State U.; Pepperdine U.; CUNY, Medgar Evers C.; Quinnipiac U.; Hamilton C.; East Carolina U.; Hofstra U.; U. Texas, Tyler; Furman U.; Colby C.; SUNY, Geneseo; Gonzaga U.; U. Wisconsin-Platteville; Hope C.; California State U., Chico; Claflin U.; CUNY, York C.; Suffolk U.; Kentucky State U.; California State U., Fullerton; Skidmore C.; Western Illinois U.; Murray State U.; Northern Kentucky U.; McNeese State U.; U. San Diego; Savannah State U.; Rider U.; California State Polytechnic U., Pomona; Indiana U., South Bend; U. Richmond; Eastern Connecticut State U.; U. of the District of Columbia; Marshall U.; U. Northern Iowa; Niagara U.; Nova Southeastern U.; Rhode Island School of Design; U. of Mary Washington; Chapman U.; U. Michigan, Dearborn; Roger Williams U.; Hawaii Pacific U.; Jacksonville State U.; Texas Woman’s U.; Purdue U., North Central; Central State U.; Albany C. of Pharmacy and Health Sciences; Franklin and Marshall C.; Pacific U.; Washington and Lee U.; Saginaw Valley State U.; Western Carolina U.; Dickinson C.; Saint Joseph’s U.; Coastal Carolina U.; Wheaton C., Wheaton; U. North Carolina, Asheville; Youngstown State U.; SUNY, C. Brockport; Sewanee: U. of the South; Santa Clara U.; Lake Superior State U.; U. Tennessee, Knoxville, Institute of Agriculture; U. Maryland, Center for Environmental Science; Louisiana State U., Health Sciences Center – New Orleans; U. North Texas, Health Science Center; U. Puerto Rico, Medical Sciences Campus; Eastern Virginia Medical School; Van Andel Institute; Texas Tech U., Health Sciences Center; Morehouse School of Medicine; Mercer U.; SUNY, Upstate Medical U.; SUNY, Downstate Medical Center; Loma Linda U.; Catholic U. of America; Louisiana State U., Health Sciences Center - Shreveport; Texas A&M U.-Kingsville; U. of the Virgin Islands; Texas Tech U., Health Sciences Center, El Paso; Albany Medical C.; New York Medical C.; Northeast Ohio Medical U.; Rosalind Franklin U. of Medicine and Science; Charles R. Drew U. of Medicine and Science; Meharry Medical C.; North Carolina Central U.; U. Texas Health Science Center, Tyler; Montana Tech of U. Montana; Humboldt State U.; Rhode Island C.; Tarleton State U.; U. Maryland, Eastern Shore; St. Edward’s U.; Dillard U.; Midwestern U.; Langston U.; Ponce Health Sciences U.; U. New England; U. South Florida, Saint Petersburg; Seton Hall U.; Clark U.; U. Central del Caribe; U. Guam; National Defense U.; Alfred U.; U. Massachusetts, central office; California State U., Los Angeles; Western U. of Health Sciences; Southern U. and A&M C., Agricultural Research and Extension Center; A. T. Still U.; Milwaukee School of Engineering; U. Oklahoma, Tulsa; U. of the Sciences Philadelphia; MGH Institute of Health Professions; SUNY, C. of Optometry; Roseman U. of Health Sciences; Mills C.; Touro U., Vallejo; Memorial Sloan Kettering Cancer Center, Louis V. Gerstner Jr. Graduate S. of Biomedical Sciences; Fuller Theological Seminary; Eastern Washington U.; Plymouth State U.; Tougaloo C.; Southeastern Louisiana U.; Naval War C.; Central Washington U.; Philadelphia C. of Osteopathic Medicine; Northwest Indian C.; Black Hills State U.; Erikson Institute; SUNY, Buffalo State; National U.; Alabama State U.; Edward Via C. of Osteopathic Medicine; Pittsburg State U.; Sul Ross State U.; Commonwealth Medical C.; Franklin W. Olin C. of Engineering; Austin Peay State U.; U. Illinois, Springfield; Oregon Institute of Technology; Mercyhurst U.; Connecticut C.; Grambling State U.; Morehead State U.; U. New Haven; U. Texas, Permian Basin; Keck Graduate Institute; Wheeling Jesuit U.; Palmer C. of Chiropractic, Davenport; Eastern Kentucky U.; U. Louisiana, Monroe; Oklahoma State U., Center for Health Sciences; U. South Florida, Sarasota-Manatee; Nicholls State U.; U. Alaska, Southeast; Occidental C.; California Maritime Academy; Indiana State U.; U. Puerto Rico, Cayey; Maine Maritime Academy; Marshall B. Ketchum U.; Stockton U.; U. Tampa; Bastyr U.; Albany State U.; American Samoa Community C.; Providence C.; Emerson C.; Salus U.; Alaska Pacific U.; Christopher Newport U.; Augsburg C.; Salish Kootenai C.; U. del Este; La Sierra U.; SUNY, C. Plattsburgh; Heidelberg U.; SUNY, Farmingdale State C.; Seattle Pacific U.; California State U., Stanislaus; New England C. of Optometry; U. Western States; U. Houston system administration; Keene State C.; Hobart and William Smith Colleges; U. of the Incarnate Word; U. Puerto Rico, Humacao; Augustana C., Sioux Falls; Barry U.; CUNY, Advanced Science Research Center; U. Redlands; Doane C.; Florida Polytechnic U.</td>
</tr>
</tbody>
</table>Brian AustinInstitutions of higher education are constantly comparing themselves to one another, and one of the ways they do so is using their total amount of funding dollars. But total amount of funding is, at best, a blunt instrument. The National Center for Science and Engineering Statistics (part of the National Science Foundation) releases an annual survey of these research dollars, and breaks them down all sorts of ways, including by field. Data This table is a selection of the 640 universities that responded to the survey for 2015. Rank indicates the university’s relative positioning to other institutions in terms of total research funding. Johns Hopkins has the most funding, and comes in with nearly twice as much as the next-highest institution (the University of Michigan - Ann Arbor). The other variable names correspond with research funding received by various fields in 2015 (measured in thousands of dollars). Rank Institution Environmental sciences Life sciences Math and computer sciences Physical sciences Psychology Social sciences Sciences, nec Engineering All non-S&E fields 1 Johns Hopkins U. 31854 867715 171205 167009 3663 11034 54640 991937 6622 2 U. Michigan, Ann Arbor 14609 779922 25434 52449 21989 149805 1627 254505 68938 24 Georgia Institute of Technology 19068 19879 113353 47279 7431 9132 7645 533329 8254 28 U. Southern California 20051 411987 93765 16924 9935 27941 327 69527 40574 32 U. Illinois, Urbana-Champaign 7214 220029 114512 67182 17276 21340 5000 161458 25806 89 Carnegie Mellon U. 348 11212 109026 14162 7757 6791 3479 89054 175 Initial Exploration I wanted to look beyond these total figures and see how unsupervised machine learning would classify these schools compared to one another. I started by examining total research dollars for various sets of fields against one another. Scatter plots are a useful way to see that information. Here, darker points indicate greater total research dollars. Most plots that are highly clustered around 0, with bands of institutions fanning out. In some cases, such as engineering, institutions with a greater concentration of students are outliers, but don’t necessarily show a high degree of difference between them. Psychology and Life Sciences, however, have a data distribution that allow us to see some of the spread of data, which makes them a decent place to start visualizing the way some of these schools might cluster together. Density clustering These data points seem relatively tightly grouped, and I’ll start with DBSCAN, a density-based clustering algorithm that accounts for noise as part of its algorithm. For an exceptional visualization of DBSCAN, check out Naftali Harris’s blog. I’ll use the same plot of Psychology vs Life Sciences that looked promising before, but colorized by the clusters that DBSCAN assigns. Yikes. This only shows 2-3 clusters, and they don’t seem to be particularly useful in describing the data. It doesn’t look particularly good, but eyeballing it isn’t always a great way to determine the usefulness of unsupervised learning algorithms, so I’ll check the numbers. Estimated number of clusters: 3, as we could just make out Homogeneity: 0.114 Ouch - our clusters heavily overlap with one another; the best score is 1 Completeness: 1.000 All the members of a given class are in the same cluster - this is to be expected since we have such unbalanced classes V-measure: 0.205 The harmonic mean between homogeneity and completeness, higher is better Silhouette Coefficient: -0.074 This score shows how distinct clusters are from one another. Scores near zero indicate overlapping clusters (bad) and negative scores indicate incorrect clustering (worse). So on the whole, this doesn’t seem like a very good algorithm to use. The data is too dense on the whole to create unique cluster values. Hierarchical clustering To improve our approach, it would be useful to be able to see how many clusters we would have at various levels. Fortunately, there is a useful approach called hierarchical clustering that builds agglomerative groupings as data points are more similar to one another. This results in a tree-type diagram that shows the “distance” (here measured in research dollars) between two data points. On first inspection, there look to be a couple of good places to break up our clusters. If we want a very simple way to break down groups, we could still break it at ~$1.5 million, which would separate into three total groups, though one would include far more data points than the other. This is basically what the DBSCAN algorithm did. We could do it at around $1,000,000, which would break up the green data points into two groups, and would break up the red grouping into what appears to be four clusters. For more granularity, a break point at $500,000 appears to create around 10 clusters, though the distinction is hard to see with this diagram. For the purposes of grouping schools together, I think that middle class of clusters is going to create the most useful separation, so I’ll create clusters at the $1,000,000 mark. And here is what the Psychology vs Life Sciences plot looks like, with different colors indicating cluster groupings. Hey! It looks like there are distinct bands of colors in there! This looks like something we can use to better understand how universities are different. Let’s look at one more. Again, a useful banding, and one that is more prominent for the Life Sciences dimension than for the Environmental Sciences one. Not all the comparisons will look as robust, but that’s part of why the clustering itself is good - this can be difficult to eyeball. Final cluster insights Below I have included a table of the final cluster groups. The big takeaway is that Johns Hopkins receives so much R&D funding that it stands alone as an institution. It is a cluster unto itself. It is the übercluster. The next cluster includes a number of state universities, many of which fund a significant portion of their operating budgets from federal grant money. The next group of institutions includes a dozen largely technical institutions or large medical centers (as well as the University of California - Berkeley). Following that are smaller research-focused institutions, then larger institutions with less research focus, and finally a large group of primarily access-oriented institutions. A detailed walkthrough of this analysis is available on my github, with a more reader-friendly view on NBViewer. Here is a full list of the way these clusters break out. Cluster Size Institutions 1 Johns Hopkins U. 28 U. California, San Diego; U. Minnesota, Twin Cities; U. Wisconsin-Madison; Ohio State U.; Stanford U.; U. California, Los Angeles; Columbia U. in the City of New York; U. Michigan, Ann Arbor; U. North Carolina, Chapel Hill; U. Washington, Seattle; Cornell U.; Duke U.; U. Texas M. D. Anderson Cancer Center; U. Pennsylvania; Harvard U.; U. California, Davis; Yale U.; U. Florida; U. Pittsburgh, Pittsburgh; Washington U., Saint Louis; Northwestern U.; Emory U.; Vanderbilt U.; U. Alabama, Birmingham; U. California, San Francisco; Baylor C. of Medicine; Icahn School of Medicine at Mt. Sinai; U. Texas Southwestern Medical Center 12 U. Illinois, Urbana-Champaign; Georgia Institute of Technology; Massachusetts Institute of Technology; Pennsylvania State U., University Park and Hershey Medical Center; North Carolina State U.; Virginia Polytechnic Institute and State U.; Purdue U., West Lafayette; Texas A&M U., College Station and Health Science Center; Michigan State U.; U. California, Berkeley; U. Arizona; SUNY, Polytechnic Institute 28 U. Southern California; U. Utah; Indiana U., Bloomington; U. Chicago; Rutgers, State U. New Jersey, New Brunswick; New York U.; U. Illinois, Chicago; SUNY, U. Buffalo; U. Georgia; Rockefeller U.; U. South Florida, Tampa; Boston U.; U. Virginia, Charlottesville; U. Rochester; U. Kentucky; U. Iowa; U. Miami; U. Cincinnati; Case Western Reserve U.; U. Colorado Denver and Anschutz Medical Campus; U. Maryland, Baltimore; Scripps Research Institute; Uniformed Services U. of the Health Sciences; Oregon Health and Science U.; Yeshiva U.; U. Massachusetts, Medical School; Medical U. South Carolina; U. Texas Health Science Center, Houston 75 Carnegie Mellon U.; U. Texas, Austin; U. Maryland, College Park; Brown U.; U. Central Florida; U. Tennessee, Knoxville; U. Hawaii, Manoa; U. Massachusetts, Amherst; Arizona State U.; Princeton U.; Iowa State U.; SUNY, Stony Brook U.; U. California, Irvine; Rice U.; Florida International U.; U. Nebraska, Lincoln; Rensselaer Polytechnic Institute; U. Notre Dame; U. California, Santa Barbara; Mississippi State U.; Florida State U.; New Jersey Institute of Technology; California Institute of Technology; U. Colorado Boulder; Oregon State U.; Clemson U.; U. Houston; George Washington U.; SUNY, U. Albany; Texas Tech U.; Colorado State U., Fort Collins; U. Delaware; Washington State U.; Drexel U.; Kansas State U.; Tufts U.; U. New Mexico; Dartmouth C.; U. California, Riverside; Louisiana State U., Baton Rouge; U. Connecticut; Temple U.; North Dakota State U.; U. Missouri, Columbia; Oklahoma State U., Stillwater; Wayne State U.; U. California, Santa Cruz; U. South Carolina, Columbia; U. Kansas; Utah State U.; U. Idaho; Georgetown U.; U. Dayton; New Mexico State U.; Tulane U.; U. Oklahoma, Norman and Health Science Center; Virginia Commonwealth U.; West Virginia U.; U. Vermont; U. Louisville; U. Mississippi; Auburn U., Auburn; Wake Forest U.; U. Arkansas, Fayetteville; Woods Hole Oceanographic Institution; Medical C. Wisconsin; U. Texas Medical Branch; U. Texas Health Science Center, San Antonio; U. Nebraska, Medical Center; U. Arkansas for Medical Sciences; Thomas Jefferson U.; Cold Spring Harbor Laboratory; Rush U.; Georgia Regents U.; U. Tennessee, Health Science Center 496 U. Alabama, Huntsville; George Mason U.; U. Alabama, Tuscaloosa; Northeastern U.; U. Louisiana, Lafayette; U. Texas, Dallas; U. Maryland, Baltimore County; Naval Postgraduate School; Air Force Institute of Technology; SUNY, Binghamton U.; U. Texas, El Paso; San Diego State U.; U. Texas, Arlington; Worcester Polytechnic Institute; U. North Texas, Denton; U. North Carolina, Charlotte; DePaul U.; Wright State U.; Syracuse U.; U. Oregon; U. Texas, San Antonio; Rochester Institute of Technology; U. Massachusetts, Lowell; U.S. Air Force Academy; Louisiana Tech U.; Michigan Technological U.; Indiana U.-Purdue U., Indianapolis; U. Tulsa; U. Memphis; Brandeis U.; Georgia State U.; Boise State U.; Illinois Institute of Technology; U. Massachusetts, Boston; North Carolina Agricultural and Technical State U.; Brigham Young U., Provo; U. Nebraska, Omaha; Toyota Technological Institute, Chicago; Kent State U.; Stevens Institute of Technology; Texas State U.; California State U., San Bernardino; Portland State U.; C. of William and Mary and Virginia Institute of Marine Science; CUNY, City C.; Missouri U. of Science and Technology; Old Dominion U.; Jackson State U.; Southern Methodist U.; U. Nevada, Reno; U. Alaska, Fairbanks; San Francisco State U.; U. Wisconsin-Milwaukee; U. New Hampshire; U. Missouri, Kansas City; U. Wyoming; U.S. Naval Academy; Texas A&M U.-Corpus Christi; U. North Carolina, general administration; U. Arkansas, Little Rock; Harvey Mudd C.; U. California, Office of the President; U. Colorado Colorado Springs; Lehigh U.; U.S. Military Academy; U. New Orleans; American U.; Delaware State U.; Montana State U., Bozeman; Baylor U.; Villanova U.; Bryn Mawr C.; U. Massachusetts, Dartmouth; Alabama A&M U.; Northern Arizona U.; CUNY, Queens C.; Florida Institute of Technology; U. South Alabama; U. Nevada, Las Vegas; Gallaudet U.; Bowie State U.; Marquette U.; U. Tennessee, Chattanooga; California Polytechnic State U., San Luis Obispo; Boston C.; Tennessee Technological U.; Tuskegee U.; Desert Research Institute; Clarkson U.; Ball State U.; Florida Atlantic U.; Creighton U.; Rutgers, State U. New Jersey, Newark; U. Maine; Howard U.; U. Puerto Rico, Mayaguez; California State U., Northridge; Dakota State U.; Elizabeth City State U.; U. California, Merced; Southern Illinois U., Carbondale; Illinois State U.; Norfolk State U.; CUNY, system office; Smith C.; Western Washington U.; U. Puerto Rico, Rio Piedras; Williams C.; South Dakota State U.; Texas Southern U.; Western Michigan U. and Homer Stryker M.D. School of Medicine; Morgan State U.; U. Montana, Missoula; U. Akron; Hampton U.; Florida A&M U.; Miami U.; California State U., Bakersfield; U. South Dakota; U. Texas, Brownsville; Sam Houston State U.; C. Charleston; CUNY, Hunter C.; U. Texas Pan American; Oakland U.; Stephen F. Austin State U.; Loyola U., Chicago; Willamette U.; Northern Illinois U.; Wellesley C.; Fordham U.; Towson U.; U. Central Arkansas; U. Minnesota, Duluth; Calvin C.; California State U., San Marcos; U. Houston-Downtown; Tennessee State U.; Pennsylvania State U., Harrisburg; Pace U.; U. Rhode Island; Columbia U., Teachers C.; California State U., Monterey Bay; U. Southern Mississippi; U. Denver; Trinity C., Hartford; Colorado School of Mines; U. Metropolitana; Idaho State U.; U. Southern Maine; U. Northern Colorado; CUNY, John Jay C. of Criminal Justice; Montclair State U.; Prairie View A&M U.; Pomona C.; Carleton C.; Duquesne U.; California State U., Fresno; Vassar C.; Xavier U. Louisiana; St. Olaf C.; Cleveland State U.; U. Arkansas, Pine Bluff; Marist C.; San Jose State U.; CUNY, Brooklyn C.; California State U., Sacramento; U. North Carolina, Wilmington; Fayetteville State U.; New School; Fairfield U.; Mount Holyoke C.; Alcorn State U.; Wesleyan U.; Lewis and Clark C.; Rutgers, State U. New Jersey, Camden; U. North Carolina, Greensboro; Arkansas State U., Jonesboro; Saint Louis U.; U. North Florida; Central Connecticut State U.; Central Michigan U.; CUNY, C. Staten Island; Trinity U.; Embry-Riddle Aeronautical U.; Southern Illinois U., Edwardsville; U. Washington, Bothell; Southern U. and A&M C., Baton Rouge; Amherst C.; California State U., Channel Islands; U. North Dakota; Lafayette C.; Purdue U., Calumet; Loyola Marymount U.; Clark Atlanta U.; Spelman C.; CUNY, Graduate Center; Kean U.; Southern Connecticut State U.; Texas A&M International U.; New Mexico Institute of Mining and Technology; James Madison U.; Virginia State U.; U. Missouri, Saint Louis; Salisbury U.; Colgate U.; Ohio U.; West Chester U. Pennsylvania; U. Houston-Clear Lake; Texas A&M U.-Commerce; Appalachian State U.; U. Washington, Tacoma; Pennsylvania State U., Behrend; Georgia Southern U.; U. Hawaii, Hilo; East Tennessee State U.; CUNY, Lehman C.; Lamar U.; Reed C.; U. Wisconsin-Stevens Point; New York Institute of Technology; Bowdoin C.; Barnard C.; Claremont Graduate U.; Macalester C.; Bowling Green State U.; Seattle U.; Oberlin C.; U. Central Oklahoma; West Virginia State U.; Kennesaw State U.; Elon U.; U. South Carolina, Aiken; Wichita State U.; Azusa Pacific U.; Bates C.; SUNY, C. of Environmental Science and Forestry; West Texas A&M U.; Benedict C.; Morehouse C.; U. Detroit Mercy; Middle Tennessee State U.; Valparaiso U.; Grinnell C.; Swarthmore C.; Winthrop U.; Grand Valley State U.; C. of Saint Benedict; California State U., Dominguez Hills; Texas Christian U.; C. Wooster; U. West Florida; Bradley U.; Rowan U.; Norwich U.; U. Hartford; La Salle U.; U. del Turabo; Siena C.; Lincoln U., Jefferson City; Saint John’s U., Collegeville; Bucknell U.; Shaw U.; Sonoma State U.; Indiana U.-Purdue U., Fort Wayne; U. Wisconsin-La Crosse; U. San Francisco; CUNY, Baruch C.; U. Wisconsin-Oshkosh; Kettering U.; California State U., Long Beach; Middlebury C.; U. South Carolina, Beaufort; Davidson C.; Minnesota State U., Mankato; Wiley C.; East Central U.; U. Baltimore; Ithaca C.; South Dakota School of Mines and Technology; Western Kentucky U.; South Carolina State U.; Lawrence Technological U.; Eastern Michigan U.; Union C., Schenectady; Saint Michael’s C.; U. Nebraska, Kearney; U. Alaska, Anchorage; Colorado C.; U. West Georgia; Florida Gulf Coast U.; St. Cloud State U.; U. Toledo; Fort Valley State U.; St. John’s U., Manhattan; Haverford C.; U. Wisconsin-Green Bay; Fisk U.; U. of the Pacific; U. Minnesota, Morris; Missouri State U.; Pepperdine U.; CUNY, Medgar Evers C.; Quinnipiac U.; Hamilton C.; East Carolina U.; Hofstra U.; U. Texas, Tyler; Furman U.; Colby C.; SUNY, Geneseo; Gonzaga U.; U. Wisconsin-Platteville; Hope C.; California State U., Chico; Claflin U.; CUNY, York C.; Suffolk U.; Kentucky State U.; California State U., Fullerton; Skidmore C.; Western Illinois U.; Murray State U.; Northern Kentucky U.; McNeese State U.; U. San Diego; Savannah State U.; Rider U.; California State Polytechnic U., Pomona; Indiana U., South Bend; U. Richmond; Eastern Connecticut State U.; U. of the District of Columbia; Marshall U.; U. Northern Iowa; Niagara U.; Nova Southeastern U.; Rhode Island School of Design; U. of Mary Washington; Chapman U.; U. Michigan, Dearborn; Roger Williams U.; Hawaii Pacific U.; Jacksonville State U.; Texas Woman’s U.; Purdue U., North Central; Central State U.; Albany C. of Pharmacy and Health Sciences; Franklin and Marshall C.; Pacific U.; Washington and Lee U.; Saginaw Valley State U.; Western Carolina U.; Dickinson C.; Saint Joseph’s U.; Coastal Carolina U.; Wheaton C., Wheaton; U. North Carolina, Asheville; Youngstown State U.; SUNY, C. Brockport; Sewanee: U. of the South; Santa Clara U.; Lake Superior State U.; U. Tennessee, Knoxville, Institute of Agriculture; U. Maryland, Center for Environmental Science; Louisiana State U., Health Sciences Center – New Orleans; U. North Texas, Health Science Center; U. Puerto Rico, Medical Sciences Campus; Eastern Virginia Medical School; Van Andel Institute; Texas Tech U., Health Sciences Center; Morehouse School of Medicine; Mercer U.; SUNY, Upstate Medical U.; SUNY, Downstate Medical Center; Loma Linda U.; Catholic U. of America; Louisiana State U., Health Sciences Center - Shreveport; Texas A&M U.-Kingsville; U. of the Virgin Islands; Texas Tech U., Health Sciences Center, El Paso; Albany Medical C.; New York Medical C.; Northeast Ohio Medical U.; Rosalind Franklin U. of Medicine and Science; Charles R. Drew U. of Medicine and Science; Meharry Medical C.; North Carolina Central U.; U. Texas Health Science Center, Tyler; Montana Tech of U. Montana; Humboldt State U.; Rhode Island C.; Tarleton State U.; U. Maryland, Eastern Shore; St. Edward’s U.; Dillard U.; Midwestern U.; Langston U.; Ponce Health Sciences U.; U. New England; U. South Florida, Saint Petersburg; Seton Hall U.; Clark U.; U. Central del Caribe; U. Guam; National Defense U.; Alfred U.; U. Massachusetts, central office; California State U., Los Angeles; Western U. of Health Sciences; Southern U. and A&M C., Agricultural Research and Extension Center; A. T. Still U.; Milwaukee School of Engineering; U. Oklahoma, Tulsa; U. of the Sciences Philadelphia; MGH Institute of Health Professions; SUNY, C. of Optometry; Roseman U. of Health Sciences; Mills C.; Touro U., Vallejo; Memorial Sloan Kettering Cancer Center, Louis V. Gerstner Jr. Graduate S. of Biomedical Sciences; Fuller Theological Seminary; Eastern Washington U.; Plymouth State U.; Tougaloo C.; Southeastern Louisiana U.; Naval War C.; Central Washington U.; Philadelphia C. of Osteopathic Medicine; Northwest Indian C.; Black Hills State U.; Erikson Institute; SUNY, Buffalo State; National U.; Alabama State U.; Edward Via C. of Osteopathic Medicine; Pittsburg State U.; Sul Ross State U.; Commonwealth Medical C.; Franklin W. Olin C. of Engineering; Austin Peay State U.; U. Illinois, Springfield; Oregon Institute of Technology; Mercyhurst U.; Connecticut C.; Grambling State U.; Morehead State U.; U. New Haven; U. Texas, Permian Basin; Keck Graduate Institute; Wheeling Jesuit U.; Palmer C. of Chiropractic, Davenport; Eastern Kentucky U.; U. Louisiana, Monroe; Oklahoma State U., Center for Health Sciences; U. South Florida, Sarasota-Manatee; Nicholls State U.; U. Alaska, Southeast; Occidental C.; California Maritime Academy; Indiana State U.; U. Puerto Rico, Cayey; Maine Maritime Academy; Marshall B. Ketchum U.; Stockton U.; U. Tampa; Bastyr U.; Albany State U.; American Samoa Community C.; Providence C.; Emerson C.; Salus U.; Alaska Pacific U.; Christopher Newport U.; Augsburg C.; Salish Kootenai C.; U. del Este; La Sierra U.; SUNY, C. Plattsburgh; Heidelberg U.; SUNY, Farmingdale State C.; Seattle Pacific U.; California State U., Stanislaus; New England C. of Optometry; U. Western States; U. Houston system administration; Keene State C.; Hobart and William Smith Colleges; U. of the Incarnate Word; U. Puerto Rico, Humacao; Augustana C., Sioux Falls; Barry U.; CUNY, Advanced Science Research Center; U. Redlands; Doane C.; Florida Polytechnic U.Whitehouse.gov is much less popular after 100 days of Trump2017-05-12T00:00:00-05:002017-05-12T00:00:00-05:00https://austinbrian.github.io//trump-100-days<p>In 2017, the primary way most citizens interact with their government is through the internet. Before they go to an office in person, before they write a letter or make a call, they visit a website. As such, the traffic across those webpages present a useful window into the way people interact with their government, and into the rhythms of American life.</p>
<p>To understand how traffic preferences have changed during the first 100 days of Donald Trump’s presidency, I pulled data on a month’s worth of traffic among all US government websites from <a href="https://analytics.usa.gov/">analytics.usa.gov</a>. The website compiles statistics for the prior 30 days (or 5 minutes, if you’re interested in that). To compare the way activities have changed, I compared data of the prior 30 days that included Trump’s inauguration (the most readily-available data was for Jan. 29th, nine days after his inauguration) and from his 100th day in office, April 29th.</p>
<p>This chart demonstrates the way views of government domains were different in April from the way they were in January. It uses a “ranking” of the number of visits a domain received.</p>
<p><img src="https://raw.githubusercontent.com/austinbrian/blog/master/images/pagerank.png" alt="" /></p>
<p>Beyond the number of visits (which determines ranking), I considered the number of pageviews a domain received over that time, as well as the number of unique users. For some pages, such as <a href="https://www.usajobs.gov/">USAjobs.gov</a>, the number of pageviews relatively high, so there is some difference in that analysis. I’ve summarized some of the bigger differences in this analysis.</p>
<p><strong>Big Gainers</strong></p>
<ul>
<li>IRS websites are big gainers, since the first 100 days always<a href="http://archive.fortune.com/magazines/fortune/fortune_archive/2002/04/15/321414/index.htm" title="Well, since 1921. Tax Day began with passage of the 16th Amendment in 1913, but was collected March 1st, before Woodrow Wilson's inauguration on March 4th of 1913. It moved to March 15th in 1918, where it remained until 1955, when it moved again to the day we know, April 15th.">*</a> coincides with <a href="https://austinbrian.github.io/2017/04/18/pres_counties/">Tax Day</a> in America. The main IRS web domain saw the greatest increase in total visits among all executive branch domains</li>
<li>NASA’s images domain saw the biggest growth in visitors over January, with more than 100x the visitors in the month of April as they had in January. During that time the probe Cassini crossed through the rings of Saturn and sent back stunning photos.</li>
<li>The <a href="https://sec.gov">Securities and Exchange Commission</a> includes databases that house information on publicly-traded company quarterly filings. Since the period included the end of the first fiscal quarter for many companies, it may be unsurprising that visits to the website increased 83% between the two periods.</li>
<li>Visits to the White House’s <a href="https://www.stopbullying.gov/" title="the cyber">anti-cyberbullying initiative</a> jumped by 184% between January and April.</li>
<li>The 100-day marker is right before many deadlines for students applying to college to receive student aid packages, which may be why pageviews of the National Postsecondary Student Aid Survey domain increased by a greater percentage than any other government domain over the period.</li>
</ul>
<p><strong>Big Losah’s</strong></p>
<ul>
<li>The White House’s main website, <a href="https://www.whitehouse.gov/">whitehouse.gov</a> plummeted in popularity between January and April. It tumbled from being the 9th most popular federal web domain to the 22nd, with 13 million fewer visitors over the 30-day period. Corresponding visits to <a href="https://search.whitehouse.gov">search.whitehouse.gov</a> fell similarly.</li>
<li>Similarly, <a href="https://petitions.whitehouse.gov/">petitions to</a> the White House fell precipitously, losing 13 rank positions and 3 million fewer visits in April than they did in January.</li>
<li>The state department’s <a href="https://ceac.state.gov">web domain</a> that details a foreign traveler’s passport status saw a drop of more than 2 million visitors between the two months.</li>
<li>The <a href="https://opm.gov">Office of Personnel Management</a> - which DC-based folks know as the people who tell you whether the government is shut down due to snow - became much less popular in April than it was in January.</li>
</ul>
<p><strong>Top Government Websites</strong></p>
<ul>
<li>People always want their packages and they want to know the weather, so the domains for the <a href="https://tools.usps.com">post office</a> and the <a href="https://forecast.weather.gov">weather forecast</a> are consistently some of the top website. These domains are also common API targets for mobile apps that involve weather and delivery.</li>
<li><a href="https://usajobs.gov">USAjobs.gov</a> is the main federal government portal for job applicants, and is one of the top ten websites in both months.</li>
<li>The Centers for Disease Control was popular in both January and April, as was the Social Security Administration.</li>
</ul>
<p>To summarize, Americans are busy living their lives, but they are quantifiably are less interested in interacting with the public-facing part of the executive than they were at the beginning of the Trump Era.</p>
<p><em>See the full analysis of this question <a href="https://github.com/austinbrian/austinbrian.github.io/blob/master/analyses/Gov%20website%20analytics.ipynb">on Github</a>.</em></p>Brian AustinIn 2017, the primary way most citizens interact with their government is through the internet. Before they go to an office in person, before they write a letter or make a call, they visit a website. As such, the traffic across those webpages present a useful window into the way people interact with their government, and into the rhythms of American life. To understand how traffic preferences have changed during the first 100 days of Donald Trump’s presidency, I pulled data on a month’s worth of traffic among all US government websites from analytics.usa.gov. The website compiles statistics for the prior 30 days (or 5 minutes, if you’re interested in that). To compare the way activities have changed, I compared data of the prior 30 days that included Trump’s inauguration (the most readily-available data was for Jan. 29th, nine days after his inauguration) and from his 100th day in office, April 29th. This chart demonstrates the way views of government domains were different in April from the way they were in January. It uses a “ranking” of the number of visits a domain received. Beyond the number of visits (which determines ranking), I considered the number of pageviews a domain received over that time, as well as the number of unique users. For some pages, such as USAjobs.gov, the number of pageviews relatively high, so there is some difference in that analysis. I’ve summarized some of the bigger differences in this analysis. Big Gainers IRS websites are big gainers, since the first 100 days always* coincides with Tax Day in America. The main IRS web domain saw the greatest increase in total visits among all executive branch domains NASA’s images domain saw the biggest growth in visitors over January, with more than 100x the visitors in the month of April as they had in January. During that time the probe Cassini crossed through the rings of Saturn and sent back stunning photos. The Securities and Exchange Commission includes databases that house information on publicly-traded company quarterly filings. Since the period included the end of the first fiscal quarter for many companies, it may be unsurprising that visits to the website increased 83% between the two periods. Visits to the White House’s anti-cyberbullying initiative jumped by 184% between January and April. The 100-day marker is right before many deadlines for students applying to college to receive student aid packages, which may be why pageviews of the National Postsecondary Student Aid Survey domain increased by a greater percentage than any other government domain over the period. Big Losah’s The White House’s main website, whitehouse.gov plummeted in popularity between January and April. It tumbled from being the 9th most popular federal web domain to the 22nd, with 13 million fewer visitors over the 30-day period. Corresponding visits to search.whitehouse.gov fell similarly. Similarly, petitions to the White House fell precipitously, losing 13 rank positions and 3 million fewer visits in April than they did in January. The state department’s web domain that details a foreign traveler’s passport status saw a drop of more than 2 million visitors between the two months. The Office of Personnel Management - which DC-based folks know as the people who tell you whether the government is shut down due to snow - became much less popular in April than it was in January. Top Government Websites People always want their packages and they want to know the weather, so the domains for the post office and the weather forecast are consistently some of the top website. These domains are also common API targets for mobile apps that involve weather and delivery. USAjobs.gov is the main federal government portal for job applicants, and is one of the top ten websites in both months. The Centers for Disease Control was popular in both January and April, as was the Social Security Administration. To summarize, Americans are busy living their lives, but they are quantifiably are less interested in interacting with the public-facing part of the executive than they were at the beginning of the Trump Era. See the full analysis of this question on Github.The most important factor in the 2016 election? Student loans.2017-05-04T00:00:00-05:002017-05-04T00:00:00-05:00https://austinbrian.github.io//clinton_importance_feat<p>If you took all 3,113 counties in the United States, and wanted to try and guess whether a county would cast more votes for Hillary Clinton or Donald Trump, the most important pieces of information to have be the number of people who claim an education-related deduction or credit.</p>
<p>This finding on student loans is part of my ongoing analysis of the intersection of 2016 voting behavior and tax return information.</p>
<details><summary>More methodology</summary>
This analysis was performed using a random forest classifier, which used 124 variables drawn from county-level 2014 IRS data. Variables included aggregate totals and participant counts for each income type and deduction. The dataset was matched by county with the New York Times report of election night votes. Average cross-validated accuracy score using the model's classification was approximately 0.89 (good!) and an F1 score of 0.88 (not bad!).
</details>
<p><br />
To be sure, these groups of voters are qualitatively different in other measurable ways. For instance, the number of farm returns in a county is the next-most impactful feature in the income dataset behind education credits and deductions. Counties with the highest numbers of these returns were the most likely in the data to cast more returns for Trump than Clinton.</p>
<p>But these findings largely conform to the behaviors we might expect of the electorate. Those with higher levels of education (especially those at high-cost liberal arts colleges) are most likely to inhabit counties that vote for Clinton. Farmers were more likely to vote Trump.</p>
<p>Perhaps more interesting are the distinctions that weren’t found in the dataset. Most importantly, total taxable income did not emerge as an especially valuable predictor.
<br /></p>
<p><img src="https://raw.githubusercontent.com/austinbrian/blog/master/images/agi_pp_vs_clinton.png" alt="" />
<br /></p>
<p>The figure above demonstrates the distribution of income, as separated by Clinton’s margin over Trump (or vice versa). Each blue dot represents a county in which more people voted for Clinton than Trump, and each red dot shows a county with more Trump votes.</p>
<p>While the sheer numbers for the Trump voters are significantly greater - Trump won about 84 percent of US counties - the distribution of incomes is relatively similar between them.</p>Brian AustinIf you took all 3,113 counties in the United States, and wanted to try and guess whether a county would cast more votes for Hillary Clinton or Donald Trump, the most important pieces of information to have be the number of people who claim an education-related deduction or credit. This finding on student loans is part of my ongoing analysis of the intersection of 2016 voting behavior and tax return information. More methodology This analysis was performed using a random forest classifier, which used 124 variables drawn from county-level 2014 IRS data. Variables included aggregate totals and participant counts for each income type and deduction. The dataset was matched by county with the New York Times report of election night votes. Average cross-validated accuracy score using the model's classification was approximately 0.89 (good!) and an F1 score of 0.88 (not bad!). To be sure, these groups of voters are qualitatively different in other measurable ways. For instance, the number of farm returns in a county is the next-most impactful feature in the income dataset behind education credits and deductions. Counties with the highest numbers of these returns were the most likely in the data to cast more returns for Trump than Clinton. But these findings largely conform to the behaviors we might expect of the electorate. Those with higher levels of education (especially those at high-cost liberal arts colleges) are most likely to inhabit counties that vote for Clinton. Farmers were more likely to vote Trump. Perhaps more interesting are the distinctions that weren’t found in the dataset. Most importantly, total taxable income did not emerge as an especially valuable predictor. The figure above demonstrates the distribution of income, as separated by Clinton’s margin over Trump (or vice versa). Each blue dot represents a county in which more people voted for Clinton than Trump, and each red dot shows a county with more Trump votes. While the sheer numbers for the Trump voters are significantly greater - Trump won about 84 percent of US counties - the distribution of incomes is relatively similar between them.Hillary Clinton won the taxpayer vote and other Tax Day findings2017-04-18T00:00:00-05:002017-04-18T00:00:00-05:00https://austinbrian.github.io//pres_counties<p>Happy (?) Tax Day, everyone.</p>
<p>In honor of Tax Day, I wanted to do a small piece of analysis on the way taxpayers voted in the 2016 election. Here are a couple of brief headlines:</p>
<ul>
<li>Hillary Clinton won among those who filed tax returns in 2015.</li>
<li>Donald Trump won the vast majority of counties, winning 2,623 counties to Clinton’s 490, or about 84%.</li>
<li>Of the counties that had the highest proportion of votes for independent candidates, 12 of the top 15 are in Utah (the other three are in Idaho), likely a function of the independent candidacy of Evan McMullin, a member of the Mormon church in Utah.</li>
<li>Counties with the highest total gross income per capita include a small town of 499 voters in Texas called McMullen, of whom Trump won 454 votes. Manhattan, NY also cracks the top 3, and voted for Clinton with 87 percent of the vote.</li>
<li>The poorest counties also voted heavily for Clinton. Three counties on South Dakota reservations, and one county in Mississippi all voted for her by margins of at least 3-to-1.</li>
</ul>
<p>Here is a graphic of the overall margin by county, according to data on county-level vote from the New York <em>Times</em>.</p>
<iframe width="1000" height="1000" src="https://public.tableau.com/views/Clinton-TrumpMarginbyCounty/Story1?:embed=y&:display_count=yes" frameborder="0" allowfullscreen=""></iframe>
<p>A short note on methodology and assumptions:</p>
<ul>
<li>Income calculations is based on IRS <a href="https://www.irs.gov/uac/soi-tax-stats-county-data">county-level data</a> from taxes filed for the year 2014 (those typically due by April 15, 2015).</li>
<li>Taxpayer-level calculations were conducted by assuming that within a county, taxpayers and non-taxpayers voted at the same rate. Thus, to find the number of taxpayers voting for Clinton within a county, this analysis multiplied Clinton’s margin in that county by the total number of tax returns for the county.</li>
</ul>Brian AustinHappy (?) Tax Day, everyone. In honor of Tax Day, I wanted to do a small piece of analysis on the way taxpayers voted in the 2016 election. Here are a couple of brief headlines: Hillary Clinton won among those who filed tax returns in 2015. Donald Trump won the vast majority of counties, winning 2,623 counties to Clinton’s 490, or about 84%. Of the counties that had the highest proportion of votes for independent candidates, 12 of the top 15 are in Utah (the other three are in Idaho), likely a function of the independent candidacy of Evan McMullin, a member of the Mormon church in Utah. Counties with the highest total gross income per capita include a small town of 499 voters in Texas called McMullen, of whom Trump won 454 votes. Manhattan, NY also cracks the top 3, and voted for Clinton with 87 percent of the vote. The poorest counties also voted heavily for Clinton. Three counties on South Dakota reservations, and one county in Mississippi all voted for her by margins of at least 3-to-1. Here is a graphic of the overall margin by county, according to data on county-level vote from the New York Times. A short note on methodology and assumptions: Income calculations is based on IRS county-level data from taxes filed for the year 2014 (those typically due by April 15, 2015). Taxpayer-level calculations were conducted by assuming that within a county, taxpayers and non-taxpayers voted at the same rate. Thus, to find the number of taxpayers voting for Clinton within a county, this analysis multiplied Clinton’s margin in that county by the total number of tax returns for the county.#mood2017-04-04T00:00:00-05:002017-04-04T00:00:00-05:00https://austinbrian.github.io//mood<p><img src="https://raw.githubusercontent.com/austinbrian/blog/master/images/theo_confetti_angel.jpg" alt="Theo" /></p>Brian AustinOverfitting the Sweet 162017-04-03T00:00:00-05:002017-04-03T00:00:00-05:00https://austinbrian.github.io//ncaa_overfitting<p>I have a very simple model for about a quarter of my NCAA bracket predictions: if UNC, move to next round.</p>
<p>This year, that model has worked pretty well.</p>
<p>But unfortunately, it only helps you out in a few circumstances. Every pick is UNC, which is useful when UNC plays Butler, but not so much when Michigan plays Oregon. So since it gives you the same output every time, no matter who is playing, statisticians and data scientists would say this model has <strong>low variance</strong>.</p>
<p>This is isn’t very useful for the broader world of NCAA tournament basketball, so we need a more complex model, one that describes more teams. But here’s the problem a lot of people have in building models: <em>they use everything they know</em>. Let me explain.</p>
<p>Here is the bracket of the Sweet 16 games:</p>
<p><img src="https://raw.githubusercontent.com/austinbrian/blog/master/images/ncaa_ss_bracket_8blank.png" alt="where's Dook?" /></p>
<p>Since we currently live in a world where we know who won the Sweet 16 games, we can make a <em>super</em> accurate model to “predict” the winners of those games. It might look like this:</p>
<ol>
<li>If UNC is playing, pick UNC (this is my model, after all)</li>
<li>If a school has multiple NCAA championships over the last 20 years, pick it if it has more than the other school</li>
<li>In a contest between schools in states near large bodies of saltwater and freshwater, pick saltwater</li>
<li>If teams are both from the Confederacy, pick the eastern-most team</li>
<li>If a school has an X in its name, pick it (<em>Xs are cool</em>)</li>
<li>If all else fails, chalk (i.e., pick the team with the lower seed)</li>
</ol>
<p>And we’ll say that these are applied in this order, so that the 1st element of the model trumps the 2nd and so on.</p>
<p>Here is how our picks look for the Sweet 16. It’s a weird model, but I’ve got a good feeling about it.</p>
<table>
<thead>
<tr>
<th>Factor</th>
<th>Factor Description</th>
<th>Team Picked</th>
</tr>
</thead>
<tbody>
<tr>
<td>1.</td>
<td>Pick UNC</td>
<td>UNC</td>
</tr>
<tr>
<td>2.</td>
<td>NCAA champions</td>
<td>Florida, Kentucky</td>
</tr>
<tr>
<td>3.</td>
<td>Saltwater over freshwater</td>
<td>Florida, Oregon</td>
</tr>
<tr>
<td>4.</td>
<td>Easternmost Confederate</td>
<td>S. Carolina</td>
</tr>
<tr>
<td>5.</td>
<td>Has an X</td>
<td>Xavier</td>
</tr>
<tr>
<td>6.</td>
<td>Chalk</td>
<td>Kansas, Gonzaga</td>
</tr>
</tbody>
</table>
<p>How’d we do?</p>
<p><img src="https://raw.githubusercontent.com/austinbrian/blog/master/images/NCAA_ss_circles.png" alt="wow such prediction" /></p>
<p>Crushed it.</p>
<p>We got them all right! We are 100% accurate. This is the best model of all time. So we confidently deploy it to predict the winners of the Elite Eight who will move on to the Final Four.</p>
<table>
<thead>
<tr>
<th>Factor</th>
<th>Factor Description</th>
<th>Team Picked</th>
</tr>
</thead>
<tbody>
<tr>
<td>1.</td>
<td>Pick UNC</td>
<td>UNC</td>
</tr>
<tr>
<td>2.</td>
<td>NCAA champions</td>
<td>Florida</td>
</tr>
<tr>
<td>3.</td>
<td>Saltwater over freshwater</td>
<td><em>N/A, freshwaters already eliminated</em></td>
</tr>
<tr>
<td>4.</td>
<td>Easternmost Confederate</td>
<td><em>N/A, Florida already picked</em></td>
</tr>
<tr>
<td>5.</td>
<td>Has an X</td>
<td>Xavier</td>
</tr>
<tr>
<td>6.</td>
<td>Chalk</td>
<td>Kansas</td>
</tr>
</tbody>
</table>
<p>And just as we suspected…</p>
<p><img src="https://raw.githubusercontent.com/austinbrian/blog/master/images/NCAA_ss_xxxs.png" alt="not as good" /></p>
<p>Wait what?</p>
<p>The only school we got right here was UNC, giving our model an accuracy rate of 25%. Dang.</p>
<p>What went wrong?</p>
<p>When we went to create a model, we focused on hitting all the points we knew we needed to hit, which statisticians and data scientists refer to as having <strong>high bias</strong>.</p>
<p>This is a big problem in data science, known as “overfitting.” All that means is that predictions overly closely predict the original dataset, and aren’t flexible enough to be applied in the world.</p>
<p>So in summary: predictive accuracy comes at a tradeoff, called <strong>bias</strong>. Simpler models minimize this bias, but run into limitations to accuracy due to low <strong>variance</strong>. Good models are those that minimize both aspects of this.</p>
<p>And Go Heels.</p>Brian AustinI have a very simple model for about a quarter of my NCAA bracket predictions: if UNC, move to next round. This year, that model has worked pretty well. But unfortunately, it only helps you out in a few circumstances. Every pick is UNC, which is useful when UNC plays Butler, but not so much when Michigan plays Oregon. So since it gives you the same output every time, no matter who is playing, statisticians and data scientists would say this model has low variance. This is isn’t very useful for the broader world of NCAA tournament basketball, so we need a more complex model, one that describes more teams. But here’s the problem a lot of people have in building models: they use everything they know. Let me explain. Here is the bracket of the Sweet 16 games: Since we currently live in a world where we know who won the Sweet 16 games, we can make a super accurate model to “predict” the winners of those games. It might look like this: If UNC is playing, pick UNC (this is my model, after all) If a school has multiple NCAA championships over the last 20 years, pick it if it has more than the other school In a contest between schools in states near large bodies of saltwater and freshwater, pick saltwater If teams are both from the Confederacy, pick the eastern-most team If a school has an X in its name, pick it (Xs are cool) If all else fails, chalk (i.e., pick the team with the lower seed) And we’ll say that these are applied in this order, so that the 1st element of the model trumps the 2nd and so on. Here is how our picks look for the Sweet 16. It’s a weird model, but I’ve got a good feeling about it. Factor Factor Description Team Picked 1. Pick UNC UNC 2. NCAA champions Florida, Kentucky 3. Saltwater over freshwater Florida, Oregon 4. Easternmost Confederate S. Carolina 5. Has an X Xavier 6. Chalk Kansas, Gonzaga How’d we do? Crushed it. We got them all right! We are 100% accurate. This is the best model of all time. So we confidently deploy it to predict the winners of the Elite Eight who will move on to the Final Four. Factor Factor Description Team Picked 1. Pick UNC UNC 2. NCAA champions Florida 3. Saltwater over freshwater N/A, freshwaters already eliminated 4. Easternmost Confederate N/A, Florida already picked 5. Has an X Xavier 6. Chalk Kansas And just as we suspected… Wait what? The only school we got right here was UNC, giving our model an accuracy rate of 25%. Dang. What went wrong? When we went to create a model, we focused on hitting all the points we knew we needed to hit, which statisticians and data scientists refer to as having high bias. This is a big problem in data science, known as “overfitting.” All that means is that predictions overly closely predict the original dataset, and aren’t flexible enough to be applied in the world. So in summary: predictive accuracy comes at a tradeoff, called bias. Simpler models minimize this bias, but run into limitations to accuracy due to low variance. Good models are those that minimize both aspects of this. And Go Heels.Who’s that Lady?2017-03-27T00:00:00-05:002017-03-27T00:00:00-05:00https://austinbrian.github.io//dc-dogs<p>Let’s say you’re at a local dog park in DC, and you see a dog chasing a ball, doing dog things, and you want to call out for her to come play with you. What should you yell?</p>
<p>Try “Lady” or “Blue.”</p>
<p><img src="https://pbs.twimg.com/media/C77GgWCW0AAAHqi.jpg:large" alt="Peanut! from @darth" /></p>
<p>According to a list of 1,533 dog registrations in the District in 2016, those are most likely names for female and male dogs, respectively. Here are other common pup names.</p>
<table>
<thead>
<tr>
<th>Good Boys</th>
<th>#</th>
<th> </th>
<th>Good Girls</th>
<th>#</th>
</tr>
</thead>
<tbody>
<tr>
<td>Blue</td>
<td>10</td>
<td> </td>
<td>Lady</td>
<td>14</td>
</tr>
<tr>
<td>King</td>
<td>8</td>
<td> </td>
<td>Chloe</td>
<td>11</td>
</tr>
<tr>
<td>Rocky</td>
<td>7</td>
<td> </td>
<td>Bella</td>
<td>10</td>
</tr>
<tr>
<td>Charlie</td>
<td>6</td>
<td> </td>
<td>Sheba</td>
<td>7</td>
</tr>
<tr>
<td>Nino</td>
<td>6</td>
<td> </td>
<td>Princess</td>
<td>6</td>
</tr>
<tr>
<td>Max</td>
<td>6</td>
<td> </td>
<td>Lola</td>
<td>6</td>
</tr>
<tr>
<td>Zeus</td>
<td>6</td>
<td> </td>
<td>Sasha</td>
<td>6</td>
</tr>
</tbody>
</table>
<p>Rambo, my favorite name on this list, is shared by four lucky dogs. Those who enjoy trademark infringement may consider “Oreo,” which is shared by five males and one female.</p>
<p>So what else do we know about these dogs?</p>
<p>For starters, they are slightly more likely to be male, and both males and females (that are registered with DC) are very likely to be fixed.</p>
<p><img src="https://raw.githubusercontent.com/austinbrian/blog/master/images/dogs_registered_mf.png" alt="" /></p>
<p>They also have a great chance of being some breed of terrier. More than a third of registered dogs is some kind of terrier or mix. The most common breed variety in DC is the Terrier / Pit Bull mix, with 233 registrants.</p>
<p>Dogs in DC are also on the younger side, as on this chart (measured in people years).</p>
<p><img src="https://raw.githubusercontent.com/austinbrian/blog/master/images/dog_years_dc.jpg" alt="" /></p>
<p>The data on who the dogs are in DC comes from a <a href="https://github.com/katerabinowitz/FOIA-Requests/tree/master/Registered%20Dogs">dataset</a> FOIA’d and compiled by the great <a href="http://www.datalensdc.com/">Kate Rabinowitz</a>.</p>
<p>Hopefully, this has been a good excuse for you to go out and pet a pup!</p>Brian AustinLet’s say you’re at a local dog park in DC, and you see a dog chasing a ball, doing dog things, and you want to call out for her to come play with you. What should you yell? Try “Lady” or “Blue.” According to a list of 1,533 dog registrations in the District in 2016, those are most likely names for female and male dogs, respectively. Here are other common pup names. Good Boys # Good Girls # Blue 10 Lady 14 King 8 Chloe 11 Rocky 7 Bella 10 Charlie 6 Sheba 7 Nino 6 Princess 6 Max 6 Lola 6 Zeus 6 Sasha 6 Rambo, my favorite name on this list, is shared by four lucky dogs. Those who enjoy trademark infringement may consider “Oreo,” which is shared by five males and one female. So what else do we know about these dogs? For starters, they are slightly more likely to be male, and both males and females (that are registered with DC) are very likely to be fixed. They also have a great chance of being some breed of terrier. More than a third of registered dogs is some kind of terrier or mix. The most common breed variety in DC is the Terrier / Pit Bull mix, with 233 registrants. Dogs in DC are also on the younger side, as on this chart (measured in people years). The data on who the dogs are in DC comes from a dataset FOIA’d and compiled by the great Kate Rabinowitz. Hopefully, this has been a good excuse for you to go out and pet a pup!The wild predictive ride of the NCAA first round2017-03-20T00:00:00-05:002017-03-20T00:00:00-05:00https://austinbrian.github.io//ncaa-rd1<p>Ah, March.</p>
<p>It’s a time of bright <a href="http://www.accuweather.com/en/weather-news/noreaster-shuts-down-travel-threatens-to-unleash-blizzard-conditions-in-at-least-8-states/70001093">spring sunshine</a>, cherry blossoms, and people everywhere searching their memories for where exactly <a href="https://en.wikipedia.org/wiki/Xavier_University" title="Cincinnati">Xavier University</a> is.</p>
<p>It’s tournament season! Or at least it is every year in my household until UNC loses. But beyond my general interest in watching a round ball go through a hoop, it’s also a great time to test out predictions. Since everyone and their brother makes predictions about who is likely to win the games each round, I thought I’d check and see how they are doing.</p>
<p>To do that, I simulated the projections from data scientists at <a href="http://www.cbssports.com/college-basketball/bracketology/">CBS Sports</a>, <a href="https://sports.yahoo.com/m/66607537-5012-36a0-8694-65a0522cf6c1/ss_2017-ncaa-tournament-bracket.html">Yahoo!</a>, data <a href="https://projects.fivethirtyeight.com/2017-march-madness-predictions/">blog 538</a>, <a href="http://www.espn.com/mens-college-basketball/bracketology">ESPN</a>, college basketball stat-tracker <a href="http://kenpom.com/">Ken Pomeroy</a>, and sports handicapper <a href="http://sagarin.com/sports/cbsend.htm">Jeff Sagarin</a>.</p>
<p>I collected the percentage likelihoods for each of the rounds from where they were compiled at the <a href="https://www.nytimes.com/interactive/2017/03/13/upshot/ncaa-bracket-super-table.html">NY Times Upshot</a>.
<br /></p>
<details><summary>More methodology</summary>
I then created a simulator that used the projections given, and for each school, simulated 5,000 iterations of each team's probability. I compiled the results of these simulations into an overall projection that differs very slightly (and randomly) from the average of the range of probability projections from a school.</details>
<p><br />
Since the first round of the tournament is 64 hours of basketball shoved into a two-day period, it’s often the most exciting period, and the best for analyzing predictions. So I compiled the results of the first round and compared the scores for each team to the prior prediction.</p>
<p>As the graph below demonstrates, these results split out into four groups:</p>
<ul>
<li><strong>Favorites</strong>: teams that are performing as advertised; they were supposed to win, and they did win</li>
<li><strong>Upsets</strong>: this is the reason we watch March Madness; these are the teams that weren’t supposed to win, and then they do. The farther to the left they appear, the more surprising it was.</li>
<li><strong>Shockers</strong>: this is where the <a href="http://screengrabber.deadspin.com/piccolo-tears-are-the-saddest-tears-1692898303">crying</a> happens; these teams were supposed to win, but lost. Brutal.</li>
<li><strong>Made-the-Dancers</strong>: where it was an honor just to be nominated, but nobody expected them to win, and they didn’t.</li>
</ul>
<p><img src="https://raw.githubusercontent.com/austinbrian/blog/master/images/NCAA_rd1_labeled.png" alt="" /></p>
<p>As we move into the second weekend of the tournament, I’ll keep taking a look at how these predictions held up. There are some early upsets. Last year’s champ Villanova have already lost this year. I felt bad for them for exactly <a href="https://www.youtube.com/watch?v=EMHoGRp1QrE">4.7 seconds</a>.</p>Brian AustinAh, March. It’s a time of bright spring sunshine, cherry blossoms, and people everywhere searching their memories for where exactly Xavier University is. It’s tournament season! Or at least it is every year in my household until UNC loses. But beyond my general interest in watching a round ball go through a hoop, it’s also a great time to test out predictions. Since everyone and their brother makes predictions about who is likely to win the games each round, I thought I’d check and see how they are doing. To do that, I simulated the projections from data scientists at CBS Sports, Yahoo!, data blog 538, ESPN, college basketball stat-tracker Ken Pomeroy, and sports handicapper Jeff Sagarin. I collected the percentage likelihoods for each of the rounds from where they were compiled at the NY Times Upshot. More methodology I then created a simulator that used the projections given, and for each school, simulated 5,000 iterations of each team's probability. I compiled the results of these simulations into an overall projection that differs very slightly (and randomly) from the average of the range of probability projections from a school. Since the first round of the tournament is 64 hours of basketball shoved into a two-day period, it’s often the most exciting period, and the best for analyzing predictions. So I compiled the results of the first round and compared the scores for each team to the prior prediction. As the graph below demonstrates, these results split out into four groups: Favorites: teams that are performing as advertised; they were supposed to win, and they did win Upsets: this is the reason we watch March Madness; these are the teams that weren’t supposed to win, and then they do. The farther to the left they appear, the more surprising it was. Shockers: this is where the crying happens; these teams were supposed to win, but lost. Brutal. Made-the-Dancers: where it was an honor just to be nominated, but nobody expected them to win, and they didn’t. As we move into the second weekend of the tournament, I’ll keep taking a look at how these predictions held up. There are some early upsets. Last year’s champ Villanova have already lost this year. I felt bad for them for exactly 4.7 seconds.