International tourism in Europe has more than doubled in the last 30 years. Especially within the EU, due to our open borders, it became hard to judge how and where tourists are going and coming from. This makes it hard to analyse movement, amounts of people and their interests in a country. In this blog post, let`s take to Twitter and find out how tourists arrive, move around and finally depart from the Netherlands!
How can we identify tourists on Twitter?
Many people tend to tweet when they are abroad in a new country. Since in this post we are examining the Netherlands, we know that if we consider only English tweets, we will be able to get statistically significant results. Of course, some people might tweet in their native language (other than English). The dominating language on Twitter is English, making up around 33.9 % of all Twitter, so we can make the assumption that disregarding Dutch, English will be the most dominant language to tweet on in the Netherlands.
This, of course, leads to the question of how to separate tourists from all the English tweets. Luckily, a large number of tweets are geo-tagged. This means that we are able to find out where a tweet was sent from, at least with regards to municipalities. We can use these locations to see how long a unique user spends in the Netherlands to create a baseline of who counts as a tourist.
So, our task is clear: we need English tweets, from the territory of the Netherlands, and then process this data for time elapsed between the first and the last tweet of a user! Let’s take a look at the year 2016, and we are left with 730 601 tweets.. oof..
So, 730 601 English tweets, from the territory of the Netherlands, in 2016. These are unique tweets though, not twitter users and not all of them will be tourists. As we mentioned, let’s group these tweets by users, and let’s find out when their first and last tweet was made. From this, we can get the number of days elapsed between the two tweets and plot it on a histogram!
From this, we can see that many people spend between 1 and 25 days in the Netherlands, or at least will tweet within this interval. Well, we can also discount everyone who tweeted only once, ever, as that won`t help us find how they move in the country, and to stay as certain as possible, we will discount everyone after the “bend of the elbow” in the graph, where the decrease becomes less abrupt.
This will leave us with a total of 67 029 tweets, which are once more, English tweets, from the Netherlands, from people who have tweeted at least twice and the elapsed time is less than 15 days between their first and last tweet. This will be our baseline for identifying tourists, and it corresponds to approximately 16 000 tourists.
Machine learning algorithms to detect tourists
We have found our baseline through purely reasoning and subsetting the data, but can we design a machine-learning algorithm to find the users that can be considered tourists?
We can use a technique called co-training to find tourists in our data in the most reliable way. Co-training works with two machine learning models and can be extremely powerful when we only have a small labelled set, a dataset which was previously annotated by a human with the relevant labels, in our case whether a twitter user is a tourist or not.
We manually label a number of users: 1736 tourists and 5896 non-tourists. These can be done relatively fast with multiple people. Some of the non-tourists display a pattern, for example, a twitter account from Eindhoven airport shares weather data daily, thus is easy to identify as a non-tourist at first glance. To speed up computation, we will only use 60 percent of our data for training. This way our unlabelled data will contain 62 061 users.
For co-training to succeed, we need to create two different models that either work on different data and/or use a different technique but try to classify for the same property. In our case, we will use support-vector machines in both of our models, but train them on different data: Model A will be trained on the tweets and Model B will be trained on the profile info attached to each user.
The two models will take the initially labelled ~7500 users as a training set and classify for the remaining users whether they are a tourist or not. Once the classification is done, we take 10 percent of the most confidently classified users and reinsert them into the training set with their new labels. This 10 percent is of course rounded up, and we always begin with the users that both models agree on with the highest confidence. We repeat the training and classification process until there are no users left in the unlabelled set.
Using co-training we manage to identify 31 267 tourists from 60% of the complete data, which almost doubles the initial number of tourists we found using only our analytical approach. Now, what can we use this information for?
How do tourists move around in the Netherlands?
We have our tourists, now it’s time to find out where they entered the Netherlands, how they moved around and where they left. The first step is to discretise the geo-tags and link them to a municipality or area in the Netherlands. Since we won`t track movements within a municipality, we define movement as a change in municipality between two consecutive tweets. As people might tweet from the same location multiple times, we need to reduce the data to make sure that chains of the same location only get a location label once. For example, in the following case there were 3 tweets from Amsterdam, but it should be treated as only one location label:
Raw form: Amsterdam – Amsterdam – Amsterdam – Rotterdam
Reduced form: Amsterdam – Rotterdam
Some tourists also revisit locations where they have already been once, meaning that from the 595 labels we find by concatenating the string, only 361 are unique. The following table shows the most “returned to” municipalities.
Using this, we can gather a couple of insights. First of all, we can see which municipalities are the most preferred for departures and arrivals amongst the tourists. Not surprisingly, Amsterdam is where most of the tourist tweet for the first and the last time, with over 60 percent of them entering and leaving the Netherlands through Amsterdam. Port cities, like Rotterdam and Den Haag are also common amongst the tourists, with a rate of 4 and 5 percent for Rotterdam, 2 and 2 percent for Den Haag for arrivals and departures respectively.
We can also see how the whole of the Netherlands is traversed by tourists on a macro level if we define movement as a change of province instead of municipality. With this, we find that 13 000 tourists visited only one province and tourists visit an average 1.2 provinces during their stay.
We can also see how different holiday periods affect arrivals and departures. The following table shows the daily average arrivals and departures from and to the Netherlands in 2016, during Carnival and during the Christmas period:
The last two things that we should take a look at is where no tourists arrive or leave from or don`t tweet from at all.
The green municipalities on this map mark the locations where no tweets were made first! This would make us assume that no tourists entered the Netherlands at those locations.
The red municipalities on this map mark the locations where no tweets were made last. This leads to the assumption that no tourists leave the Netherlands at those locations.
The yellow municipalities show where no tweets were made at all by tourists, leading to the assumption that no tourists visit these locations.
Now that we have seen some insights that can be gathered from tweets that can be beneficial for countries, how can social media mining be applied to other areas and businesses? How would you combine this research with our previous article on sentiment analysis?
This research was conducted as part of a project in the MSc courses Data Science for Decision Making and Artificial Intelligence at the Department of Data Science and Knowledge Engineering of Maastricht University under the supervision of Dr. Mena Habib. The team members were Krist Shingjergji, Max Hort, Chong Zhang and Marcell Ignéczi.
For a more detailed overview and a review of more results, our paper can be downloaded here.