A regular expression is a sequence of characters that describes a search pattern. In Pandas, regular expressions are integrated with vectorized string methods, making finding and extracting patterns of characters easier. Learning how to use Regex make data cleaning less time-consuming for Data Scientists. This is a huge Data Science cheat sheet. Thanks for taking the time to help us. I consider this post one of the best for learning and have near!!đź‘Ť. My Data Cleaning in R Cheat Sheet - Yanqi Xu.
Last Updated 2018-10-09
The purpose of this review is to simply list common data analysis procedures that we do in quantitative methods research and outline the SPSS point-and-click procedures to accomplish these goals. This document will be updated thoughout time. The commands here are based on SPSS Version 24. I would not recommend starting with this document if you are just beginning with SPSS. Yes, often there are multiple ways to conduct the same analysis. I only present one here for each item.
Data Cleaning
Counting missing data
- Analyze > Descriptive Statistics > Frequencies
- Select the variable(s)
- Click “Continue” and the “OK”
Missing data counts will be at the top of the resulting output.
Edit variable values
- Transform > Recode into Same Variables…
- Select the variable to transform and move it into the right column.
- Click “Old and New Values…”
- Under “Old Value”, enter either a specific value you would like to replace or a set of values you would like to replace.
- Under “New Value”, enter what the replacement value should be.
- Click “Add” under “New Value”.
- Click “Continue” and then “OK”.
Create a variable
- Transform > Compute Variable…
- Click “Type and Label…” to set the variable type, then click “Continue”.
- Enter the value for the variable. If it is a string, include the value in quotes.
- OR enter a formula for the variable based on the existing variables.
- Click “OK”.
Create dummy variables from categorical variables
- Transform > Create Dummy Variables
- Move the categorical variable into “Create Dummy Variables for.”
- Under “Root names (one per selected variable)”, type whatever you want to be the prefix for the dummy variables. (Suggestion: Use the name of the original variable, followed by an underscore.)
- Click “OK”.
Delete a variable
- Right-click on the column header
- Click “Clear”.
This does not produce a syntax in the Output window. The syntax for deleting a variable is here, in case you are saving your syntax:
Drop observations based on some condition (KEEP observations meeting the opposite)
- Data > Select Cases… > Select “If condition is satisfied” > If…
- Enter the condition based on which observations you would like to keep, then click “Continue”. (Remember that a condition checking if a string variable – one that uses letters instead of numbers – is equal to some value, put that value in quotes when writing the condition.)
- Select “Delete unselected cases”.
- Click “OK”.
You can specify multiple conditions at the same time by separating them with AND or OR.
Merging datasets
- Data > Merge Files > Add Variables…
- Note that the datasets you are merging must already be saved as SPSS (.sav) format files. In addition, the variables you are matching on must have the same name across datasets.
- Select “An external SPSS statistics data file”, browse for your file, and select it.
- Select “Match cases on key variables”, click on the matching variable, and add it to “Key Variables”.
- Click “OK”.
R Data Transformation Cheat Sheet
Appending datasets
- Data > Merge Files > Add Cases…
- Note that the datasets you are merging must already be saved as SPSS (.sav) format files. In addition, the variables you are matching on must have the same name across datasets.
- Select “An external SPSS statistics data file”, browse for your file, and select it.
- All variables already in both datasets will appear in “Variables in New Active Dataset”, and variables not in both datasets will be in “Unpaired Variables”. Move all unpaired variables you want into the right column.
- Click “OK”.
Reshaping datasets
From long to wide format:
- Data > Restructure…
- Select “Restructure selected cases into variables”.
- Move the all variables that are not to be reshaped (are consistent across rows for a unit) into “Identifier variable(s)”, then click “Next”.
- Select “Yes – data will be sorted by the Identifier and Index variables”, then click “Next”.
- Select “Group by original variable”, then click “Next”.
- Select “Restructure the data now”, then click “Next”.
From wide to long format (if you have only one variable that needs to be changed):
- Data > Restructure…
- Select “Restructure selected variables into cases”.
- Move the all variables that are not to be reshaped (are consistent across rows for a unit) into “Identifier variable(s)”, then click “Next”.
- Select “One”, then click “Next”.
- Move the identification variable (e.g., student ID) into the slot in “Case Group Identification”.
- Move the wide variables to be transposed into the slot in “Variables to be Transposed”.
- Move all other variables that should be the same for all rows of a case into the slot in “Fixed Variable(s)”.
- Select “One”, then click “Next”.
- Choose how you want the different rows for each case identified, either by sequential numbers or by the wide variable names, then click “Next”.
- Specify what to do with variables that you didn’t include and what to do with missing data in the wide variables, then click “Next”.
- Select “Restructure the data now”, then click “Next”.
Descriptive Statistics
Central tendency: mean, median, and mode (for continuous variable)
- Analyze > Descriptive Statistics > Frequencies
- Select the continuous variable(s)
- Uncheck “Display frequency tables”
- Click “Statistics…” and check the desired central tendency measures
- Click “Continue” and then “OK”
Central tendency: mode and frequency table (for categorical variable)
- Analyze > Descriptive Statistics > Frequencies
- Select the categorical variable(s)
- Check “Display frequency tables”
- Click “Format” and select “Descending counts”
- Click “Continue” and then “OK”
The top item in the frequency table is the mode. Note that if multiple categorical variables are selected, a separate frequency table will be created for each variable.
Variability: Standard deviation, variance, and range (for continuous variable)
- Analyze > Descriptive Statistics > Descriptives
- Select the continuous variable(s)
- Click “Options” and select the desired measures of spread
- Click “Continue” and then “OK”
Crosstabulation
Sql Data Cleaning Cheat Sheet
- Analyze > Descriptive Statistics > Crosstabs…
- Put one of the categorical variables in the Row(s) box
- Put the other categorical variable in the Column(s) box
- Use the “Cells” menu to indicate if you want row or column percentages
- Click “OK”
Conditional Means
- Analyze > Compare Means > Means…
- Put the continuous variable in the Dependent List box
- Put the categorical variable in the Layer 1 of 1, Independent List box
- Click “OK”
Correlation
- Analyze > Correlate > Bivariate
- Select all variables that you wish to correlate
- Click “OK”
Bivariate Hypothesis Testing
One-Sample T Test
- Analyze > Compare Means > One-Sample T Test…
- Select the variable
- Use the “Options” menu to set the confidence interval level
- Set the population mean in “Test Value”
- Click “OK”
Two-Sample Independent T Test
- Data must be organized such that the continuous variable is one variable and the categorical grouping variable is the other variable.
- Analyze > Compare Means > Independent-Samples T Test…
- Select the continuous variable and move it to “Test Variable(s)” selection
- Select the categorical outcome and move it to the “Grouping Variable” selection
- Click “Define Groups…”
- Enter the two values for the two groups that will be compared (e.g., 1 and 0, or “Male” and “Female”)
- Click “Continue” and then “OK”
Two-Sample Dependent T Test
- Data must be organized such that the continuous variable is in two separate variables, one for each time period/half of the paired sample.
- Analyze > Compare Means > Paired-Samples T Test…
- Select the two continuous variables and move them over to the right side – they should be under “Pair 1”
- Click “OK”
Correlation
- Analyze > Correlate > Bivariate
- Select all variables that you wish to correlate
- Click “OK”
Chi-squared test of independence
- Analyze > Descriptive Statistics > Crosstabs…
- Move one categorical to the “Row(s)” box
- Move the other categorical variable to the “Column(s)” box
- Click “Statistics…”
- Check “Chi-square” and click “Continue”
- Click “OK”
One-way ANOVA
- Analyze > Compare Means > One-Way ANOVA…
- Move the continuous variable to the “Dependent List” box
- Move the categorical variable to the “Factor” box
- Click “Post Hoc…”
- Check “Tukey” and click “Continue”
- Click “OK”
Regression Methods
Ordinary least squares regression
- Analyze > Regression > Linear
- Move your dependent variable into the spot for “Dependent”
- Move your independent variable(s) into the spot for “Block 1 of 1”
- Click the “Statistics” button, then select “Collinearity diagnostics,” then click “Continue” if you want VIF statistics.
- Click “OK”
Binary logistic regression
- Analyze > Regression > Binary Logistic
- Move your dependent variable into the spot for “Dependent”
- Move your independent variable(s) into the spot for “Block 1 of 1”
- Click “Save”, select “Probabilities”, then click “Continue” (not important for the modeling itself, but the predicted probabilities are useful for other steps later)
- Click “OK”
Getting the ROC curve for a logistic model
- Run the logistic regression model as described above
- Analyze > ROC Curve…
- Move your predicted probabilities variable to “Test Variable”
- Move your binary outcome variable to “State Variable”
- Assuming your binary outcome is a 0/1 variable, type “1” in “Value of State Variable”
- Make sure “ROC Curve”, “With diagonal reference line”, and “Standard error and confidence interval” are checked
- Click “OK”
Ordinal logistic regression
- Analyze > Regression > Ordinal…
- Move your dependent variable into the spot for “Dependent”
- Move your independent variable(s) into the spot for “Covariate(s)” (It is suggested that you convert all of your categorical independent variables into dummy variables and include the dummy variables instead of the original categorical variables.)
- Click “Output”, select “Test of parallel lines”, then click “Continue”
- Click “OK”
Multinomial logistic regression
- Analyze > Regression > Multinomial Logistic
- Move your dependent variable into the spot for “Dependent”
- You can set the reference category using the “Reference Category…” menu.
- Move your independent variable(s) into the spot for “Covariate(s)” (It is suggested that you convert all of your categorical independent variables into dummy variables and include the dummy variables instead of the original categorical variables.)
- Click “OK”
(As of September 2016, SPSS does not support a test for independence of irrelevant alternatives.)
Miscellaneous Analysis Tools
Open a non-SPSS format file in SPSS
To open a non-SPSS format file in SPSS, you must open SPSS first. Once SPSS is open…
- In the “Recent Files” pane, click “Open another file…”
- Navigate to the location of the file on your computer.
- In the “Files of type” section, change the option to “All Files (*.*).”
- Select your file and click “Open.”
Note: SPSS has a weird quirk where sometimes, on some computers, when you go through the above steps, the opened file will appear to be empty if you concurrently have said file opened in Microsoft Excel. To be safe, when trying to open data in SPSS that is non-SPSS format, close the data in Microsoft Excel first.
Specify the “working” dataset
Logic: When using point-and-click, SPSS will only be able to refer to one dataset as a time, even though it is possible to have multiple datasets open at a time.
You can specify which dataset you are working from using the following syntax:
You can figure out the name of the dataset by finding the syntax line the opened the dataset in your output window (starts with GET FILE
) and look for where it says DATASET NAME
.
Conduct analysis for subset of observations
Logic: Rather than attaching the condition to the specific command as is the case in other languages (“Do X if Y”), SPSS workflow requires you to “filter” your data, which temporarily allows you to run commands on a subset of the data. When you are done, you can restore the full set of data.
- Data > Select Cases… > Select “If condition is satisfied” > If…
- Enter the condition based on which observations you would like to keep, then click “Continue”.
- Select “Filter out unselected cases”.
- Click “OK”.
Python Data Cleaning Cheat Sheet Pdf
When you are done with doing an analysis on your filtered subset, you can restore the full set of data using the following syntax.