taking the sum of a TRUE/FALSE vector in r -
i working analyzing snp data fungus, , trying impute missing data changing ns genotype of more frequent allele....see below. newdata matrix of snps (rows)and fungal isolates(columns). genotypes each snp in 0, 1, , n format, , why trying impute missing genotypes.
newdata_imputed=newdata (k in 1:nrow(newdata)){ u=newdata[k,] x<-sum(u==0) y<-sum(u==1) all_freq=y/(x+y) if (all_freq<0.5){ newdata_imputed[k,]=gsub("n",0,u) } else{newdata_imputed[k,]=gsub("n",1,u)} print(k) }
however, keep getting error:
[1] 295 [1] 296 error in if (all_freq < 0.5) { : missing value true/false needed
it obvious code runs stops after encountering problem. please, can tell me doing wrong? newbie r, , advice appreciated.
@akrun, reason why used loop because nested in loop..so after using code.
newdata=as.data.frame(newdata) u=newdata all_freq <- rowsums(u==1)/rowsums((u==1)|(u==0)) indx <- all_freq < 0.5 indx1 <- indx & !is.na(indx) indx2 <- !indx & !is.na(indx) newdata[indx1,] <- lapply(newdata[indx1,], gsub, pattern='n', replacement=0) newdata[indx2,] <- lapply(newdata[indx2,], gsub, pattern='n', replacement=1) newdata[] <- lapply(newdata, as.numeric)
i got weird values
newdata[1:10,1:10] v1 v2 v3 v4 v5 v6 v7 v8 v9 v10 1 3 3 3 3 3 3 3 3 3 3 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 4 1 1 1 1 1 1 1 1 1 1
please "3" coming from.???? should have 0 or 1
we using rowsums
. @bergant , @matthewlundberg mentioned in comments, if there rows no 0 or 1 elements, nan
based on calculation. 1 way modify logical condition including !is.na
, i.e. elements not na
along previous condition.
#using `rowsums` create all_freq vector all_freq <- rowsums(newdata==1)/rowsums((newdata==1)|(newdata==0)) #create logical index based on elements less 0.5 indx <- all_freq < 0.5 #the na elements can changed false adding condition indx1 <- indx & !is.na(indx) #similarly elements > 0.5 indx2 <- !indx & !is.na(indx)
now, subset rows of 'newdata' 'indx1', loop through columns (lapply
) , use gsub
pattern
, replacement
arguments , assign output subset of 'newdata'.
newdata[indx1,] <- lapply(newdata[indx1,], gsub, pattern='n', replacement=0)
similarly, can replacement rows greater 0.5 'all_freq'
newdata[indx2,] <- lapply(newdata[indx2,], gsub, pattern='n', replacement=1)
the gsub
output columns character
class, can converted numeric
(if needed).
newdata[] <- lapply(newdata, as.numeric)
data
set.seed(24) newdata <- as.data.frame(matrix(sample(c(0:1, "n"), 10*4, replace=true), ncol=4), stringsasfactors=false) newdata[7,] <- 2
Comments
Post a Comment