Goals: To analyze the public's Opinion on DNA fingerprinting using publicly available tweets.
As DNA and ancestry companies continue to compile millions of genetic information samples into a database, controversy surrounding whether or not this is ethical. In fact, Ancestry and 23andMe have collectively obtained more than 26 million DNA samples in their databases. (Source). In order to analyze the public’s sentiment on DNA fingerprinting, which is one such application of creating a DNA database, a script written in Python using Google Colab was created to search the Twitter API for a list of keywords
List of technologies used:
tweepy(to search for tweets)
gspread(to create database)
colab-env(to set environment variables)
demoji(to process and clean emojis)
textblob(sentiment and subjectivity analysis)
pandas(to store data before adding to database)
re(text and data cleaning)
os(to get environment variables)
*Tweets were queried for each day between 10-09-2021 to 11-03-2021
Figure 1: Diagram of steps taken in collecting publicly available tweets.
Slightly Smiling Face)
This preprocessing was done in order to prepare tweets for sentiment analysis.
Figure 2: Diagram of steps taken in cleaning tweets.
Example tweet before data cleaning: "We 💚love💚 these photos of some very impressive students learning gel electrophoresis and DNA profiling ... in first year! 🤯 Thank you for sharing the photos @GoreyEtss. We're looking forward to seeing what these scientists do next! #BiotechExperience @ABEProgOffice https://t.co/idow3wAkSd"
Example tweet after data cleaning: ": green heart : love : green heart : photos impressive students learning gel electrophoresis DNA profiling .. first year ! : exploding head : Thank sharing photos @ GoreyEtss . 're looking forward seeing scientists next ! #BiotechExperience @ ABEProgOffice _IMAGE"
After data collection and cleaning, sentiment analysis was performed on each tweet, allowing us to determine an average sentiment for each keyword we searched for.
Note: sentiment scores ranged from [-1,1] with -1 being the most negative and 1 being the most positive
Through sentiment analysis, it is possible to determine the public's opinion on DNA fingerprinting.
Genetic profiling contained tweets with the most positive sentiment (0.13), while
genetic fingerprinting contained tweets with the most negative sentiment. Taking the
average sentiment across all keywords reveals that the average sentiment is
reflects that the public has a slightly positive sentiment when it comes to discussing DNA
Subjectivity analysis (how objective or opinionated) was performed on each tweet.
Note: The subjectivity is a float within the range [0.0, 1.0] where 0.0 is very objective and 1.0 is very subjective.
Through subjectivity analysis, it was possible to determine how objective or subjective tweets about a
keyword were. For example,
dna fingerprinting contained tweets that were the most
objective (0.35), while
genetic profiling contained tweets that were the most subjective
(0.45). Taking the average of all keywords in subjectivity analysis reveals that the average
0.398. This means that on average, the public typically is more objective
than subjective when it comes to talking about DNA fingerprinting
The source code, raw data, cleaned data, and analysis results are all published on GitHub in order to promote further research on this topic.
combined_data.zipcontains all raw tweets both with and without data cleaning.
overview_and_keywords.zipare two csv files that give a general overview of the project and detail which keywords the script used to query the Twitter API with. Changing the keywords on the
keywordssheet allows you to query for different keywords.
raw_data.zipcontains all of the raw tweets pulled back from the Twitter API.
results.zipcontains csv files with sentiment, subjectivity, and phrase analysis scores.
0.2, which may be explained by just the time span in which tweets were collected, rather than the overall sentiment over time.
0.10, which may be small enough to be accounted for in confidence errors by the subjectivity analysis model.
Further research could be done in the following aspects: