Find:\s* \s
replace: ,
Explanation: My goal for the regex in this question was to condense the columns from a regular dataframe into csv format by locating all the white spaces within the text and then condensing them by separating them with a comma. And so I searched for all the white spaces using the expression “” and then replaced them with the “,” sign.
Find:(\w+), (\w+), ((.+)$)
replace:\2 \1 \((\3)\)
Explanation: In this case, my regex captures three groups: two words using the expression “(+)” and the remainder of the line using “((.+)$)”. The replacement rearranges these groups and then adds parentheses around the third group.
Find:(\d{4})
replace:\n(\1)
My Thought Process: In my regex, I search for each match of 4 consecutive digits, and then replace that sequence with a newline, followed by the 4 digits enclosed in parentheses.
Find:(\d{4})\s+(.*?)(\.mp3)
replace:\2_ (\1\3)
My Thought Process: In my regex, I searched for and removed the leading 4-digit characters and whitespace from the filename using the the expression “()+(.*?)(.mp3)“, I then keep the song title at the front and add the 4-digit number to the end of the filename before .mp3 using the expression \2_(\1\3)” in the replace section.
Find:(\w)(?:\w*),(\w+),(\d+(?:\.\d+)?),(\d+)
replace:\1_\2,\4
My Thought Process: My regex for the find is designed to match lines with four comma-separated fields, where the first field is a word, the second field is a word, the third is a number (which has a decimal), and the fourth is an integer. I then replace the remove all the letters of the first word leaving only the first letters and an underscore separating both words. I also take out the number with a decimal.
Find:(\w)(?:\w*),(\w{4})\w*,\d+(?:\.\d+)?,(\d+)$$
replace:\1_ \2,\3
My Thought Process: My regex extracts the first letter of the first word, the first four letters of the second word, and the last numeric value from each line using the expression:“()(?:),(),(?:.)?,()”. I then insert an underscore infront of the first extracted ltter and then drop the numbers with the decimals.
Find:(\w{3})\w*,(\w{3})\w*,\d+(?:\.\d+)?,(\d+)
replace:\1\2,\4,\3
My Thought Process: My regex first extracts the first three letters of the genus and species names using the expressions “(())” and then extracts both numeric values (decimal and non-decimal). I then do the swaps of their order using the expression “\1\2,\4,\3”.
In the “pathogen_binary” column, I observe a lot of NAs instead of the regular binary inputs (i.e. 0s and 1s). So I decided to replace the NAs with 0s by using the regex: “ to search and then replaced them with 0s.
I realized several irregularities in the naming convention of the species. There were several additional characters added to the names of the species. I decided to correct them by searching through the columns to identify all characters that are not spaces, single characters, or line breaks. And then removed the underscores since they are not special characters using the regex: “[^\w\s\r\n]” and the finding all underscores and then replacing them with nothing.
Here the I realized that drones are referred to as “male”. I searched them with the term “male” and then replaced them with “drone”.