Goals: To analyze the public's Opinion on DNA fingerprinting using publicly available tweets.
As DNA and ancestry companies continue to compile millions of genetic information samples into a database, controversy surrounding whether or not this is ethical. In fact, Ancestry and 23andMe have collectively obtained more than 26 million DNA samples in their databases. (Source). In order to analyze the public’s sentiment on DNA fingerprinting, which is one such application of creating a DNA database, a script written in Python using Google Colab was created to search the Twitter API for a list of keywords
List of technologies used:
tweepy
(to search for tweets)gspread
(to create database)colab-env
(to set environment variables)demoji
(to process and clean emojis)textblob
(sentiment and subjectivity analysis)nltk
(text preprocessing)pandas
(to store data before adding to database)re
(text and data cleaning)os
(to get environment variables)*Tweets were queried for each day between 10-09-2021 to 11-03-2021
Figure 1: Diagram of steps taken in collecting publicly available tweets.
_IMG
_URL
Slightly Smiling Face
)
heeeeeeeello
→ heello
.This preprocessing was done in order to prepare tweets for sentiment analysis.
Figure 2: Diagram of steps taken in cleaning tweets.
Example tweet before data cleaning: "We 💚love💚 these photos of some very impressive students learning gel electrophoresis and DNA profiling ... in first year! 🤯 Thank you for sharing the photos @GoreyEtss. We're looking forward to seeing what these scientists do next! #BiotechExperience @ABEProgOffice https://t.co/idow3wAkSd"
Example tweet after data cleaning: ": green heart : love : green heart : photos impressive students learning gel electrophoresis DNA profiling .. first year ! : exploding head : Thank sharing photos @ GoreyEtss . 're looking forward seeing scientists next ! #BiotechExperience @ ABEProgOffice _IMAGE"
After data collection and cleaning, sentiment analysis was performed on each tweet, allowing us to determine an average sentiment for each keyword we searched for.
Keyword | Sentiment Score |
---|---|
genetic fingerprinting | -0.07 |
dna profiling | 0.02 |
dna identification | 0.04 |
dna typing | 0.05 |
dna fingerprint | 0.06 |
genetic profiling | 0.08 |
dna profile | 0.09 |
dna fingerprinting | 0.1 |
genetic profile | 0.1 |
genetic fingerprint | 0.13 |
Note: sentiment scores ranged from [-1,1] with -1 being the most negative and 1 being the most positive
Through sentiment analysis, it is possible to determine the public's opinion on DNA fingerprinting.
Genetic profiling
contained tweets with the most positive sentiment (0.13), while
genetic fingerprinting
contained tweets with the most negative sentiment. Taking the
average sentiment across all keywords reveals that the average sentiment is 0.06
This
reflects that the public has a slightly positive sentiment when it comes to discussing DNA
fingerprinting.
Subjectivity analysis (how objective or opinionated) was performed on each tweet.
Keyword | Sentiment Score |
---|---|
dna fingerprinting | 0.35 |
genetic fingerprinting | 0.37 |
dna profiling | 0.38 |
dna fingerprint | 0.39 |
genetic fingerprint | 0.40 |
dna identification | 0.40 |
genetic profile | 0.40 |
dna profile | 0.41 |
dna typing | 0.43 |
genetic profiling | 0.45 |
Note: The subjectivity is a float within the range [0.0, 1.0] where 0.0 is very objective and 1.0 is very subjective.
Through subjectivity analysis, it was possible to determine how objective or subjective tweets about a
keyword were. For example, dna fingerprinting
contained tweets that were the most
objective (0.35), while genetic profiling
contained tweets that were the most subjective
(0.45). Taking the average of all keywords in subjectivity analysis reveals that the average
subjectivity was 0.398
. This means that on average, the public typically is more objective
than subjective when it comes to talking about DNA fingerprinting
The source code, raw data, cleaned data, and analysis results are all published on GitHub in order to promote further research on this topic.
API_Search_and_Analysis.ipynb
file.combined_data.zip
contains all raw tweets both with and without data cleaning.overview_and_keywords.zip
are two csv files that give a general overview of the project
and detail which keywords the script used to query the Twitter API with. Changing the keywords on
the keywords
sheet allows you to query for different keywords. raw_data.zip
contains all of the raw tweets pulled back from the Twitter API.results.zip
contains csv files with sentiment, subjectivity, and phrase analysis
scores.0.2
, which may be explained
by just the time span in which tweets were collected, rather than the overall sentiment over time.
0.10
, which may be small enough to be
accounted for in confidence errors by the subjectivity analysis model.Further research could be done in the following aspects: