taking the sum of a TRUE/FALSE vector in r -
i working analyzing snp data fungus, , trying impute missing data changing ns genotype of more frequent allele....see below. newdata matrix of snps (rows)and fungal isolates(columns). genotypes each snp in 0, 1, , n format, , why trying impute missing genotypes.
newdata_imputed=newdata (k in 1:nrow(newdata)){ u=newdata[k,] x<-sum(u==0) y<-sum(u==1) all_freq=y/(x+y) if (all_freq<0.5){ newdata_imputed[k,]=gsub("n",0,u) } else{newdata_imputed[k,]=gsub("n",1,u)} print(k) } however, keep getting error:
[1] 295 [1] 296 error in if (all_freq < 0.5) { : missing value true/false needed it obvious code runs stops after encountering problem. please, can tell me doing wrong? newbie r, , advice appreciated.
@akrun, reason why used loop because nested in loop..so after using code.
newdata=as.data.frame(newdata) u=newdata all_freq <- rowsums(u==1)/rowsums((u==1)|(u==0)) indx <- all_freq < 0.5 indx1 <- indx & !is.na(indx) indx2 <- !indx & !is.na(indx) newdata[indx1,] <- lapply(newdata[indx1,], gsub, pattern='n', replacement=0) newdata[indx2,] <- lapply(newdata[indx2,], gsub, pattern='n', replacement=1) newdata[] <- lapply(newdata, as.numeric) i got weird values
newdata[1:10,1:10] v1 v2 v3 v4 v5 v6 v7 v8 v9 v10 1 3 3 3 3 3 3 3 3 3 3 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 4 1 1 1 1 1 1 1 1 1 1 please "3" coming from.???? should have 0 or 1
we using rowsums. @bergant , @matthewlundberg mentioned in comments, if there rows no 0 or 1 elements, nan based on calculation. 1 way modify logical condition including !is.na, i.e. elements not na along previous condition.
#using `rowsums` create all_freq vector all_freq <- rowsums(newdata==1)/rowsums((newdata==1)|(newdata==0)) #create logical index based on elements less 0.5 indx <- all_freq < 0.5 #the na elements can changed false adding condition indx1 <- indx & !is.na(indx) #similarly elements > 0.5 indx2 <- !indx & !is.na(indx) now, subset rows of 'newdata' 'indx1', loop through columns (lapply) , use gsub pattern , replacement arguments , assign output subset of 'newdata'.
newdata[indx1,] <- lapply(newdata[indx1,], gsub, pattern='n', replacement=0) similarly, can replacement rows greater 0.5 'all_freq'
newdata[indx2,] <- lapply(newdata[indx2,], gsub, pattern='n', replacement=1) the gsub output columns character class, can converted numeric (if needed).
newdata[] <- lapply(newdata, as.numeric) data
set.seed(24) newdata <- as.data.frame(matrix(sample(c(0:1, "n"), 10*4, replace=true), ncol=4), stringsasfactors=false) newdata[7,] <- 2
Comments
Post a Comment