How Social Media Data Mining Can Save Research Time and Money

How Social Media Data Mining Can Save Research Time and Money


Antoine Nouvet, a researcher with the SecDev Foundation, has been tracking the increase of social media usage by drug cartels in Mexico. His research has been covered by Vice[1] and the CBC[2], and has spawned many related articles.

Dr. Nouvet’s findings where that Mexican cartel members were increasingly using social media to recruit members, flaunt their wealth, and fight public relations battles. His research methodology has been primarily ethnographic: collecting his data through interviews and other primary sources supplemented by specific examples from social media, such as YouTube videos, Instagram photos, and Twitter accounts.

What I find interesting about this research is its potential to for linking data analysis techniques with traditional ethnographic studies. If Dr. Nouvet’s findings are accurate, then the observed increase in cartel mentions should be preserved within the Twitter record.

Data Collection Issues

There are several challenges for collecting this data. First, the Twitter accounts analyzed must come from Mexico. Tracking global mentions of cartels may miss the hoped-for increase in mentions. After all, the ethnographic study was focused on Mexico, and Mexico forms a small fraction of the global Twitter population. A small but statistically significant increase in cartel mentions within Mexico may be lost in the global context.

Typical collection tools using Twitter streams filter location based on a simple latitude/longitude box. Defining such a box for Mexico will include much of southern Arizona, New Mexico, Texas, and Louisiana — geographies that will skew the data significantly. Also, these usual methods will pick up tweets sent by cell phones within Mexico and including a geolocation tag, sampling a large segment of foreigners vacationing in Mexico.

Second, the study period needs to be long enough to see a change in cartel mentions. According to Dr. Nouvet’s study, the increased usage of social media among cartel members has been occurring gradually over many years. Collecting data from the Twitter Firehose (a paid service that streams all Tweets in real-tim) for a few months is unlikely to show a statistically significant change. To see an effect, the data will need to be collected over many years, preferably overlapping with Dr. Nouvet’s study period.

Finally, since the study is over many years, the actual mention counts will need to be normalized as a percentage (Share of Voice) rather than absolute counts. The reason for this is that Twitter’s user base and activity have grown significantly over the past few years[3]. Let’s suppose the percentage of people discussing Mexican cartels has remained constant. As more and more people start using Twitter over the study period, the absolute number of tweets discussing Mexican cartels will rise, leading to what looks like an increase in cartel mentions, when in fact the percentage of people discussing the cartels has remained constant. The only thing that the absolute counts are measuring over a multi-year period is Twitter’s amazing growth.

The problem with using Twitter streams or the Firehose is that it is difficult to get an accurate count of the total number of tweets against which to normalize. What Twitter provides through its stream API is a small sample of the actual tweets, and so can’t be used for normalization. Counting the absolute number of tweets in a day, week, month or year from the Firehose is extremely expensive, both in terms of computational requirements and the cost of licensing the service.


Many of the aforementioned challenges can be avoided by working with samples from the graph. I created samples using the Conditional Independence Coupling method (CIC)[4]. This method crawls the network of friends and followers, and is restricted only to individuals living in Mexico. The algorithm creates a statistically proper sample of people, the equivalent to pulling names from a hat. Using CIC, I created a sample of 60,854 unique Twitter accounts that had posted at least one Tweet between Jan. 1, 2011 and Dec. 31, 2013 — roughly 0.5% of all Twitter users from Mexico. For each of these accounts, I retrieved their Tweets made between Jan. 1, 2011 and Dec. 31, 2013. This resulted in 47,676,620 Tweets to analyze. The sample and Tweets were collected in January and February 2014.

From the collected Tweets, I found those that mentioned one of the three main Mexican cartels referenced in Dr. Nouvet’s study: Los Zetas, Los Cabelleros Templarios, and Cártel del Golfo. These Tweets that mentioned cartels were further subdivided by week. Then, I counted the unique authors that mentioned one of the cartels at least once. The authors mentioning the cartels were normalized by the the total number of unique accounts that had posted at least one Tweet during that week. The resulting Share of Voice (SoV) is expressed in Parts per Mil (ppm), which is the number of authors mentioning at least one of the cartels out of 1,000 authors posting that week.

Figure 1 shows the SoV plotted against time for the three-year period under study. The time series is quite spiky, with large peaks corresponding to key events that occurred during this period. Such notable events include when Anonymous was rumored to be waging cyber war on the Los Zetas cartel, the capture of “Lucky” Hernandez in a bloody gun battle, the rumored arrest (later retracted) of Los Zetas’s second-in-command “Z-40”, the killing of Los Zetas boss Heriberto Lazcano, the eventual capture of “Z-40”, and Cártel del Golfo’s aid following Hurricane Ingrid (an event prominently featured in Dr. Nouvet’s work).


Figure 1: Share of Voice for Mexican Cartels


It can be difficult to identify trends when analyzing spiky time series data. When averaging, the large peaks artificially increase the baseline value. The best approach is to remove the largest spikes and then calculate an average value between the most active periods. This is called change point analysis[5]. Change point analysis considers the fluctuations in the mean value and variance of the data, and identifies regions (such as large spikes) that are statistically different. These change points are marked and then the mean value is calculated between them.

I used the Binary Segmentation method[6] for detecting changes in the time series data, and found 13 change points. The mean values between change points are plotted in red on Figure 1. There are three that I consider to represent a baseline: Jan. 2011-June 2011, Dec. 2011-June 2012, and Sept. 2013-Dec. 2013. Across these regions, I see a statistically significant increase of SoV (1.2, 1.3, 1.7 ppm respectively). This step-wise increase in SoV between major events is a typical pattern where the SoV is steadily increasing in time. This analysis bears out Dr. Nouvet’s ethnographic conclusions.

Dr. Nouvet’s work pays a great deal of attention to the social media efforts of Cártel del Golfo following Hurricane Ingrid in Sept. 2013. In an attempt to counter their violent and negative image in the media, Cártel del Golfo provided humanitarian aid in many regions devastated by the storm. They turned to social media to spread the word. The baseline reading following this event is much higher than previous baselines (a difference of 0.4 ppm), indicating that the public relations campaign was successful.

Perhaps the most interesting feature is the muzzling of cartel mentions following the death of Los Zetas boss Heriberto Lazcano. I have seen previously[7] that following prolonged periods of elevated social activity there is a period of decreased SoV, indicating audience fatigue. It is unusual to see fatigue after a very localized event. But the period following Lazcano’s death (Oct. 2012-May 2013) has the lowest SoV of any period (0.7 ppm or a 50% reduction). This could perhaps indicate a cooling-off following such a high profile event. People who would ordinarily post about the cartels might have thought it unwise during this period.


This study of Mexican Twitter accounts for mentions of cartels reaches the same conclusions as a traditional ethnographic study. Over the three-year period, there is clear evidence that cartel mentions are on the rise. There is further evidence of some key points warranting further study: the period following Hurricane Ingrid, which has been analyzed ethnographically by Dr. Nouvet; and the period following Lazcano’s death, which does not seem to have been studied.

These results are very encouraging. Properly formulated studies —those that create accurate samples from correct populations over a significant time span— can provide insights similar to those from traditional ethnographic methods. It goes without saying that the time spent collecting the data (about two months in this example) is significantly less than the years that traditional techniques require. It is intriguing to consider the impacts this might have on business areas that use traditional ethnographic methods, such as market research surveys and focus groups.

Featured Image “Gunplay” Image courtesy of Rob View CC License

Footnotes    (↵ returns to text)

  4. White, K., Li, G., & Japkowicz, N. (2012, December). Sampling Online Social Networks Using Coupling from the Past. In Data Mining Workshops (ICDMW), 2012 IEEE 12th International Conference on (pp. 266-272). IEEE.
  5. J. Chen and A. K. Gupta, Parametric statistical change point analysis, Birkhauser (2000)
  6. A. J. Scott and M. Knott, “A cluster analysis method for grouping means in the analysis of variance,” Biometrics, vol. 30, pp. 507-512, 1974

Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.