Regular expression Puzzles

Question One

Find:\s* \s

replace: ,

Explanation: My goal for the regex in this question was to condense the columns from a regular dataframe into csv format by locating all the white spaces within the text and then condensing them by separating them with a comma. And so I searched for all the white spaces using the expression “” and then replaced them with the “,” sign.

Question Two

Find:(\w+), (\w+), ((.+)$)

replace:\2 \1 \((\3)\)

Explanation: In this case, my regex captures three groups: two words using the expression “(+)” and the remainder of the line using “((.+)$)”. The replacement rearranges these groups and then adds parentheses around the third group.

Question Three

Find:(\d{4}) 

replace:\n(\1)

My Thought Process: In my regex, I search for each match of 4 consecutive digits, and then replace that sequence with a newline, followed by the 4 digits enclosed in parentheses.

Question four

Find:(\d{4})\s+(.*?)(\.mp3)

replace:\2_ (\1\3)

My Thought Process: In my regex, I searched for and removed the leading 4-digit characters and whitespace from the filename using the the expression “()+(.*?)(.mp3)“, I then keep the song title at the front and add the 4-digit number to the end of the filename before .mp3 using the expression \2_(\1\3)” in the replace section.

Question five

Find:(\w)(?:\w*),(\w+),(\d+(?:\.\d+)?),(\d+)

replace:\1_\2,\4

My Thought Process: My regex for the find is designed to match lines with four comma-separated fields, where the first field is a word, the second field is a word, the third is a number (which has a decimal), and the fourth is an integer. I then replace the remove all the letters of the first word leaving only the first letters and an underscore separating both words. I also take out the number with a decimal.

Question Six

Find:(\w)(?:\w*),(\w{4})\w*,\d+(?:\.\d+)?,(\d+)$$  

replace:\1_ \2,\3

My Thought Process: My regex extracts the first letter of the first word, the first four letters of the second word, and the last numeric value from each line using the expression:“()(?:),(),(?:.)?,()”. I then insert an underscore infront of the first extracted ltter and then drop the numbers with the decimals.

Question Seven

Find:(\w{3})\w*,(\w{3})\w*,\d+(?:\.\d+)?,(\d+)

replace:\1\2,\4,\3

My Thought Process: My regex first extracts the first three letters of the genus and species names using the expressions “(())” and then extracts both numeric values (decimal and non-decimal). I then do the swaps of their order using the expression “\1\2,\4,\3”.

Data Curation Using REGEX

The pathogen_binary Column

In the “pathogen_binary” column, I observe a lot of NAs instead of the regular binary inputs (i.e. 0s and 1s). So I decided to replace the NAs with 0s by using the regex: “ to search and then replaced them with 0s.

The Bombus_spp and Host_spp Columns

I realized several irregularities in the naming convention of the species. There were several additional characters added to the names of the species. I decided to correct them by searching through the columns to identify all characters that are not spaces, single characters, or line breaks. And then removed the underscores since they are not special characters using the regex: “[^\w\s\r\n]” and the finding all underscores and then replacing them with nothing.

The bee_caste column

Here the I realized that drones are referred to as “male”. I searched them with the term “male” and then replaced them with “drone”.