Reply to thread

Message: <blockquote data-quote="KingBlack" data-source="post: 15958363" data-attributes="member: 23148">please keep in mind is a heavily edit version of my report, as I only included the results of the Twitter analysis:Executive SummaryThis report provides an analysis and evaluation of the abilities of machine learning to identify the dissemination of misleading information by individuals on social media platforms.  Methods of analysis include text mining, sentiment analysis and neural network.  These methods were compared against each other to confirm results against expectations.  The results of the analysis show that while it is possible to detect tweets that attempt to spread misinformation, it is extremely difficult to identify them at their inception.Introduction and BackgroundDuring the 2016 presidential election of the United States, potential voters had access to more media outlets then previously possible due to technology.  The lowered cost of technology has allowed segments of the population to obtain devices that were not possible a decade earlier.In previous elections, information on the subject came from a limited number of credible media outlets and flowed in one direction.  Now that social media platforms exist online, the consumer of information can also be the provider.  In addition, the low cost of acquisition to obtain the devices required for participation on social media platforms has essentially made every individual a media provider.This has given individuals the ability to manipulate peoples perception on a range of topics.  Russia’s state-controlled media Rossiya Segodnya director general Dimitry Kiselev said “Objectivity is a myth which is proposed and imposed on us.” [1] These tools include text, images, and recorded media among others.The text mining of Twitter was a focal point of this report.  Trending topics on Twitter indicate that a subject is currently in the top 10 of tweets on the platform.  This is accomplished by averaging 250 tweets per hour over a 6-hour span, from 750 unique users.  With this in mind, a diverse dataset is readily available for mining.For the report, a trending topic of a divisive nature was selected for the study.  R Studio function “TwitteR” is used to gather tweets related to that topic by its hashtag.  These tweets were sanitized in R to remove punctuation, stop words, numbers and formatting.The data collected will be used and compared against other sources related to the same subject.  The baseline document will come from an unbiased, reputable source.  That document will be analyzed for sentiment, tone, and perceived attitude and assigned a score.  The recovered tweets follow the same protocol, then compared.Significance of the TopicWhile there are many mining techniques used to identify trolls, none exist to identify them fast enough to alert administrators so quickly, that the spread of the misinformation can be stopped.  In this project, there will be attempting to identify users that exhibit behavior consistent with trolls without the use of large datasets related to the Twitter subject in question.  The innovation is that machine learning will be incorporated in an attempt to do this in a minimal number of tweets.Existing methodology uses hashtags of previously trending topics to gather data for analysis and determine what accounts related to the topic can be identified as trolls.  While that method does return a high success rate of learning, the troll has succeeded because the information has already been distributed.  Current methods tend to be tools for conformation instead of tools for prevention.  The large datasets lend itself to be formatted and prepared for machine learning, with the user waiting for certain results.  Knowing what to expect before the analysis could very well lead to biased sampling, making the results less reliable.Term Project Objective/Problem StatementThe purpose of this project is to identify a method of analysis that can detect misleading and divisive post on Twitter within minutes of it occurring.  The basic idea is to target new accounts that focus on hashtags related to current events that are politically polarizing.The analysis uses different methods to reach a conclusion about the post.  First, the post will be screened for the usage of negative and positive words.  This will give a general idea of the direction the author is steering their potential audience.Second, the sentiment of the tweet will be analyzed.  This will help determine the tone of the author and gives a better overall picture of the statement.Literature ReviewAs this topic is dynamic in nature, most of the reading of the subject were from online articles.  An article from medium.com confirmed a previous suspicion – Twitter rarely gets involved with accounts, even if they are aware of its abuse [2].  Bot accounts tend to be web based while human owned accounts are likely to originate from a mobile device [3].  The usage of psychosocial attacks[5] are a primary tool used by trolls to evoke fear and anger in a population to promote civil disorder.Solution ApproachThe methodology used for this project includes text mining for the most common terms and to discover user sentiment on the topic.  An attempt will be made to better understand the pattern and behavior of an account that uses a hashtag of a powder keg topic.An unbiased documentation of an event will be used for a baseline reading on the topic.  The selected document will undergo a sentiment analysis to determine the overall tone of the report.  This value will be used and compared with the selected tweet.Tweets are collected by using the function TwitteR to retrieve users that have commented on the hashtag.  User data can be collected, as well as followers of the account, and other accounts that are followed.  The information can be moved into a data frame, and location information, user id, actual text of the tweet, and much more can be extracted from the frame.*technical details omitted*This process occurs for files housing positive words list, negative words list, baseline document and tweets.  The count of negative words in the document is saved into a variable, then divided by the over word count of the document.  This value determines how “negative” the baseline document and tweet are.A sentiment analysis is the next item to be performed.  For the sentiment analysis, the process tries to determine the mood of the writer and gives the documents an overall score based on the sum of negative words less positive words.  Accounts that have an alarming difference in sentiment compared to the baseline document will be flagged and added to a list for an analysis in rtweet and botornot.  The botrnot package is used to predict if an account is a robot or human.Results or Solution EvaluationOverall, the plan of action was a complete failure.  Baseline documents, that were from reputable sources, could not help in determining the sentiment of a tweet.  There are several reasons this occurred.  File sanitation included stemming words and removing hashtags.  Stemming the words took many of them down to their core.  During the sentiment analysis, manual comparisons were performed so that it could be sanity checked against the computational results.  In many cases, over half of the words did not register to the machine due to stemming.  Manual manipulation of files was required to correct this.Hashtags often contained multiple words that are grouped together so that they could reach a larger audience. In an example, the February 2018 school shooting in Parkland, Florida, was used to collect data.  Users often incorporate hashtags in the sentences to make a point and reach a larger audience.  For example, #leftwingterrorist would show up as leftwingterrorist after removing the hashtag.  The word terrorist is included in the negative words sentiment list, but the comparison function isn’t intelligent enough to understand that it is dealing with 3 different words.  This lead to miscalculations in the sentiment index.  Manual corrections would require inspection of thousands of tweets from multiple users.  The messages may contain a common theme but be as unique as the author in its delivery.  It simply wasn’t possible to manually sanitize each tweet for analysis.An issue that became clear during the analysis was the media of choice being delivered.  While most people think of “140 characters” when they hear Twitter, there have been many improvements and changes in the last 5 years on the platform.  Alternative forms of media delivery include image and video uploads.  Recently, prerecorded video and images are outpacing typed words.Several controversial tweets recovered contained multiple hashtags and an image. The hashtag simply served as means of spreading the message to a larger audience while the image contained the message with embedded text.  It was not possible to get a sentiment score for messages of this nature.  The embedded messages were often hateful, misleading information with numerous words that would register a hit in the sentiment analyzer.   During the analysis, it was discovered that coded analyzers were unable to determine the author's tone in a message.  *EDITED*The baseline document negative score was 0.04316.The following tweet received a negative score of 0.04545:*EDITED*There was almost no difference in the score, yet the user uses the embedded image and hashtags to reach a larger audience. The baseline document is factual and without bias while the tweet contains a symbol associated with hate groups, as well as an image that is emotionally disturbing.  While they are not polar opposites, they clearly are not targeting the same audience.Many tweets of this nature used the hashtag to engage the subject yet steered away from the topic to push a separate agenda.  This was confirmed by doing a word cloud of the tweets with #parkland ranging from February 15th, 2018 to March 15th, 2018.*WORD CLOUD I MADE FROM TEXT MINING THE MOST COMMON WORDS USED IN TWEETS WITH #PARKLAND*[ATTACH=full]1501607[/ATTACH]Most individuals would likely have difficulty identifying the topic correctly based on the word cloud.*skipping to the lesson learned*Any attempt at achieving success in this field would require a massive investment in several technologies.  Word based sentiment keywords would be required to not only identify words, but phrases using those words.  It may be possible to use an NLP (natural language processing) classifier to alter the score of a word based on the word(s) that follow it.-I'M NOT SAVING THAT DROWNING BABY-I'M NOT LETTING THAT BABY DROWN Two completely different sentiments in the above statements, yet they net the same score.  The usage of NLP may help researchers develop a method that allows them to develop a more concise list of words and the weight they carry by examining the words that proceed and follow them.Another missing technology that is needed is text recognition within images.  This needs to be incorporated to have any realistic chance at predicting the sentiment of a tweet.  Additionally, speech to text would be required, so that all forms of communications could be intercepted for analysis.  If any form of communication is excluded, the analysis should be considered unreliable and incomplete.While the methodology was a complete failure, many lessons were learned.  The biggest of which, is that no computer system can bypass human ingenuity.  Simply rearranging the order of your words can manipulate or defeat most analyzers.  Inversely, simply mandating a two-day waiting period before your first tweet would likely eliminate many trolling tweets.  In addition, require a tweet within 5 days of signing up before releasing the account.  This would eliminate an entity from making multiple accounts and storing them until they needed it.While the goal of the analysis was to detect accounts of a divisive nature, it does appear as if Twitter have tools to accomplish this.  One item of interest is that it appears foreign accounts are removed, making it impossible for someone outside of the organization to collect data for study.  American based accounts are left untouched. These accounts had similarities; a high number of followers, high number of tweets, and heavy usage of imagery.  Also, these accounts tend to have more retweets then original content.  The user “*EDIT*” could potentially reach over 18,500,000 users at two levels based on the data collect on that account.</blockquote>

Verification