Learn Python: Data Cleaning Tweet Locations

Introduction

I recently did a scrape of Twitter via the API, Tweepy and a script hosted on GitHub, discussed on WordPress. After scraping I wanted to determine the location of Tweets. The purpose was to help build a picture of market location demographics, whilst also learning how to plot data on a world map using Geopandas. This article discusses the five different approaches I took to determine Tweet location, highlighting the pros and cons of each. The code used has been published in an accompanying Jupyter Notebook.

Determining Tweet Location

The five different approaches/API calls I used were: geocode, location, country code, timezone & UTC offset.

Tweet Location by Geocode

Initially I attempted to plot location using geocode, the geographic, latitude and longitude, coordinates; however, of the 31386 tweets returned, only 184 had  coordinates – at 0.6 %, not a great percentage. Less than 1%  would seem typical, with a scrape by the scripts author returning over half a million tweets, of which about four hundred had coordinates.

Testing the spread in the UK, I scraped Tweets within a 20km radius of London, which returned nothing. Expanding the radius to 100km also proved Tweetless. If I was going to determine where most tweets came from, I was clearly not going to be able to use the latitude and longitude coordinates of geocode. I needed another method.

Tweet Location by Location 

Secondly I tried to use location data, which is user defined. Out of the 31386 results 22815 had location data, 73%  of Tweets. A  much better average than the 0.6% for geocode data! Of the 22815 tweets there were 6692 unique locations, from 15727 users. I started to break these 6692 locations into countries. Starting with the USA,  having noticed several entries contained a city/town US state code.

Downloading a .csv file of US state codes, I loaded the data into a pandas and parsed the column to detect any cells containing a state code, changing any that did to USA. This reduced unique locations from 6692 to 5312. Noticing some entries said USA, I added that to the csv, leaving 5193 unique values. There were also variations of letter case format of locations, for example  ‘Sacramento – CA ‘ could also be defined as ‘Sacramento – California’ or ‘sacramento – california’. So I decided to add state names (starting with capital letters) to the list, reducing it to 4770, reduced by 1 when including ALLCAPS variations.  Finally adding all lower case state names to the list reduced it down to 4726.

Locations in the USA without a state deceleration remained. It soon became clear that this was going to be an onerous process, that would likely become more complicated for other countries. The widespread use and identification with states in the US made it easy, here in the UK we are less likely to identify our county (state equivalent) and more likely use city or town. Furthermore, some user defined locations were unusable such as ‘the milk way’, ‘paradise’ and the appealingly named ‘FART”! Unfortunately the location data was messy and inconsistent, I decided to look for another method.

Tweet Location by Country Code

Country code was the next obvious category, it wouldn’t give such specific location data but country data would suffice. I was planning on grouping location data by country as the second part of my analysis anyway. Unfortunately,  the country code data was sparse, with only 777 tweets populated, giving a total of 30609 missing values. Another not so great percentage, at 5%.

Tweet Location by Time Zone

Figure 1. Tweet Location Derived from City Timezone

Moving on, time-zone seemed to be the next reasonable locator. 17203 of posts contained data, 55% of Tweets. There are currently just under 200 different time zones. Analysis of the dataframe showed that it contained 183; however, looking at a list of unique values many of the stated zones were not official, most often declared by city rather than actual timezone the cities were part of. This had its advantage since some time-zones span countries, so stating by city made the location data more accurate. I decided to start cleaning the data, starting with entries in the USA, in the same way as location data. This reduced the number to 173 . Next I made a dictionary of world cities and countries (trying several different lists and settling for one adapted it from here, I will be writing a future article on this process). Mapping these to the dataframe, reduced the entries to 154. I reduced this further to 124 and finally 119, by altering/removing the formatting of some entries, ‘continent/city’ and underscores instead of spaces, respectively. I should have done this before mapping but wanted to see the differences it made.

The remaining, uncleaned entries were either cities absent in my city file or multiple country time-zones. A more concise city file would have preformed better and is something I am working on. Manually cleaning the rest of the data, reduced the 183 unique entries reduced to 78 countries + 16 time-zones. A plot of countries is shown in Figure 1, multiple country time-zones were not included, which removed a large majority of Tweets particularly in North America, with  4/16 time-zones being US and Canada totalling 11044, 35% of all tweets.

Tweet Location by Coordinated Universal Time (UTC)

 

Figure 2. Frequency Vs Coordinated Universal Time (UTC) for Tweets

Finally I decided to use  Universal Coordinated Time (UTC) which Twitter  records in seconds. There are 39 UTC time-zones in total, scraped results gave 32 and the only cleaning required was to change the units from seconds to hours. In total 17203 results had UTC data, which is 55% of all posts, the same as timezone, essentially Tweets have none or both.

A plot of UTC time-zones Vs Frequency can be seen in Figure 2. Clearly this was much more manageable than any of the other location parameters. Although,  conversely it is the least accurate in terms of geographical placement, since places such as the UK and Morocco, or Greenland and Brazil, share the same UTC time, as seen in Figure 3.

Wondering why the acronym is UTC and not CUT? It turns out to be a compromise between the English and French (TUC) acronyms, with neither winning.

Conclusion

In conclusion we have looked at five different methods for determining the location of Tweets from Twitter, using the Twitter API, Tweepy and a script discussed on WordPress, hosted on GitHub

The first method, geocode, gives the most accurate data but is very rarely reported. The second, user defined location, is less accurate but still good enough for location determination, although the the data is messy and would take too long a time to clean, so was abandoned. The third method, country code is a less accurate location determinant but still appropriate depending on requirements, however, the data was severely lacking.  The fourth method, user defined timezone,  varied in accuracy between city and actual timezone. Fortunately, most data was city and country biased and cleaning the data easier than location and could be done easily in a reasonable time. The final method,  Coordinated Universal Time, is the least accurate measure of location given the world wide longitudinal spread, but requires nothing to very little in terms of cleaning (only needs conversion from seconds to to hours). It would be useful for determining the most useful time to Tweet but is less valuable in determining market location demographics.

With respect to location demographics, it would appear that the USA is by far the biggest market, this is followed by the UK, Netherlands, Ecuador and Greece, although all of these four countries had substantially less tweets than the USA.  Further work and more complex analysis is needed to assert potential market sizes and demographics. This Consequently this is part of a new project, get in contact if you would like to join in, more details below.

Further Work: Tweet Location by Comparison of All Fields & Detailed Analysis of Location Demographics

I plan on writing a script to compare all location data fields from each tweet so that a maximum number of tweets can be geographically placed. A second script will be written to analyse how many users tweets reach, their locations and how many of them interact. It would also be desirable to determine age and sex of users to build up a fuller picture of demographics.

Final Thoughts

For me, both tasks were  harder than I first envisaged, however, it became a fantastic learning experience that also taught me a lot of other things, such as: various approaches to cleaning data in Pandas; the value of complementary data sets; limitations of using python in windows, inspiring me to switch to Linux; Linux basics; some important lessons around workflow and finally it increased my world geography knowledge. A lot form just one exercise.

Signing Out

I hope as a beginner you have found this article useful, any feedback appreciated. You can find me on Twitter @DrSCharlesworth. The script used was written by  Alexander Galea, hosted on Github, many thanks to him. Alex has also written an excellent blog article surrounding the script and the plotting of geo-coordinates. Be sure to check it out here, where you can also find other great and helpful articles.

Figure 3. Official Standard Time Zones in the world, hosted on Wiki Media Commons. 

 

Please follow and like us: